Skip to content

Downloading Files

Learn how to download files from web pages during script execution.

Overview

The download_file() function allows your scripts to download files from web pages. Files are stored securely in a Docker volume and made available on the execution detail page for download.

Key Features: - ✨ Automatic filename detection - Smart extraction from URLs, headers, and MIME types - 🔒 Security-first - Blocks executables, sanitizes filenames, validates file types - 📏 Configurable limits - File size (default: 50MB) and count (default: 20/execution) - 🌍 International support - RFC 5987 compliant for encoded filenames - 🛡️ Script isolation - Downloaded files only accessible via UI, not from scripts

How It Works: 1. Specify URL or CSS selector 2. System automatically detects filename (or you can override) 3. File is downloaded, validated, and sanitized 4. Stored in secure Docker volume 5. Accessible from execution detail page

Basic Usage

Download from URL

Download a file directly from a URL. The filename is automatically detected:

async def main(page):
    # Download a PDF - filename auto-detected as "report.pdf"
    result = await download_file(
        'https://example.com/files/report.pdf',
        description='Monthly sales report'
    )

    if result['success']:
        debug_log(f"Downloaded: {result['filename']} ({result['file_size']} bytes)")
    else:
        debug_log(f"Failed: {result['error']}")

With custom filename:

# Override auto-detection for specific naming
result = await download_file(
    'https://example.com/download/abc123',  # Generic URL
    description='Monthly sales report',
    filename='sales_jan_2024.pdf'  # Custom name
)

Download by Clicking

Download a file triggered by clicking a link or button. The browser provides the filename:

async def main(page):
    await page.goto('https://example.com/downloads')

    # Click a download link - browser suggests filename
    result = await download_file(
        'a[href*="invoice.pdf"]',
        description='Invoice #12345'
    )

    if result['success']:
        debug_log(f"Downloaded: {result['filename']}")
        # Example output: "Downloaded: invoice-12345.pdf"

Function Signature

await download_file(url_or_selector, description="", filename=None)

Parameters: - url_or_selector (str, required): - If starts with http:// or https://: Downloads from URL - Otherwise: Treated as CSS selector to click - description (str, optional): Human-readable description for the UI - filename (str, optional): Custom filename (auto-detected if not provided)

Returns:

{
    'success': True,
    'filename': 'report.pdf',
    'file_size': 1234567,
    'mime_type': 'application/pdf'
}
# OR
{
    'success': False,
    'error': 'File size exceeds limit'
}

Automatic Filename Detection

When you don't specify a filename parameter, the system intelligently extracts the filename using multiple strategies:

Detection Priority Order

1. Content-Disposition Header (Most Reliable)

The HTTP response header is checked first:

Content-Disposition: attachment; filename="report.pdf"
→ Uses: report.pdf

Content-Disposition: attachment; filename*=UTF-8''Sales%20Data%202024.xlsx
→ Uses: Sales Data 2024.xlsx (properly decoded)

This method supports RFC 5987 for international filenames with proper encoding.

2. Query Parameters

URLs with filename in query strings:

# URL: https://api.example.com/download?file=data.csv
result = await download_file('https://api.example.com/download?file=data.csv')
# Filename: data.csv

# URL: https://storage.example.com/get?filename=report-2024.pdf
result = await download_file('https://storage.example.com/get?filename=report-2024.pdf')
# Filename: report-2024.pdf

Checks for: file, filename, name, download parameters.

3. URL Path

Extracts filename from the URL path with proper URL decoding:

# URL: https://example.com/files/report.pdf
result = await download_file('https://example.com/files/report.pdf')
# Filename: report.pdf

# URL: https://example.com/files/My%20Report%202024.pdf
result = await download_file('https://example.com/files/My%20Report%202024.pdf')
# Filename: My Report 2024.pdf (URL-decoded)

4. MIME Type Extension

If no filename is found or it lacks an extension, one is inferred from the MIME type:

# URL: https://api.example.com/users/1 (returns JSON)
result = await download_file('https://api.example.com/users/1')
# Filename: 1.json (adds .json extension based on Content-Type)

# URL: https://example.com/download/abc123 (returns PDF)
result = await download_file('https://example.com/download/abc123')
# Filename: abc123.pdf (adds .pdf extension)

Filename Sanitization

All filenames are automatically sanitized for security:

# Removes path traversal attempts
"../../etc/passwd"  "passwd"

# Removes dangerous characters
"file<>:|?.pdf"  "file_____.pdf"

# Handles length limits (max 255 characters)
"very-long-filename-..."  "very-long-filen...ame.pdf" (preserves extension)

# Prevents empty filenames
""  "download" or "download.pdf" (based on MIME type)

Examples

Automatic Detection:

# Example 1: Direct filename in URL
result = await download_file('https://example.com/reports/Q1-2024.pdf')
# Filename: Q1-2024.pdf

# Example 2: URL-encoded characters
result = await download_file('https://example.com/My%20Report%20(Final).pdf')
# Filename: My Report (Final).pdf

# Example 3: API endpoint with MIME type
result = await download_file('https://api.example.com/export/data')
# Filename: data.csv (if Content-Type is text/csv)

# Example 4: Query parameter
result = await download_file('https://cdn.example.com/dl?file=document.docx')
# Filename: document.docx

Manual Override:

# Specify custom filename (bypasses auto-detection)
result = await download_file(
    'https://example.com/download/abc123',
    description='Sales report',
    filename='Q1-2024-sales-final.pdf'
)
# Filename: Q1-2024-sales-final.pdf (uses your custom name)

Click-Based Downloads

For downloads triggered by clicking, the browser's suggested filename is used:

# Browser provides: "invoice-12345.pdf"
result = await download_file('a.download-button')
# Filename: invoice-12345.pdf

# You can still override:
result = await download_file('a.download-button', filename='my-invoice.pdf')
# Filename: my-invoice.pdf

Examples

Download Multiple Files

async def main(page):
    await page.goto('https://data.example.com')

    # Get all CSV download links
    csv_links = await page.query_selector_all('a[href$=".csv"]')

    for i, link in enumerate(csv_links[:5]):  # First 5 files
        href = await link.get_attribute('href')
        text = await link.text_content()

        # Filename automatically extracted from URL or Content-Disposition
        result = await download_file(
            href,
            description=f'Dataset: {text}'
        )

        if result['success']:
            debug_log(f"Downloaded {i+1}/5: {result['filename']}")
            debug_log(f"  File size: {result['file_size']} bytes")
        else:
            debug_log(f"Failed {i+1}/5: {result['error']}")

Download with Error Handling

async def main(page):
    await page.goto('https://example.com/reports')

    try:
        # Filename auto-detected from URL: report.pdf
        result = await download_file(
            'https://example.com/files/report.pdf',
            description='Q1 Report'
        )

        if result['success']:
            debug_log(f"✓ Downloaded {result['filename']}")
            debug_log(f"  Size: {result['file_size']} bytes")
            debug_log(f"  Type: {result['mime_type']}")
        else:
            debug_log(f"✗ Download failed: {result['error']}")
            await capture_screenshot("Download failure")

    except Exception as e:
        debug_log(f"Exception during download: {e}")
        await capture_screenshot("Exception state")

Conditional Downloads

async def main(page):
    await page.goto('https://example.com/files')

    # Check if file exists before downloading
    download_button = page.locator('a.download-pdf')

    if await download_button.count() > 0:
        result = await download_file(
            'a.download-pdf',
            description='Available report'
        )
        debug_log(f"Downloaded: {result['filename']}")
    else:
        debug_log("No download available")
        await capture_screenshot("No download found")
async def main(page):
    await page.goto('https://example.com/dashboard')

    # Wait for dynamic content
    await page.wait_for_selector('.download-ready', timeout=10000)

    # Get dynamically generated URL
    download_url = await page.locator('.download-link').get_attribute('href')

    result = await download_file(
        download_url,
        description='Generated report'
    )

    debug_log(f"Downloaded: {result['filename']}")

Viewing Downloads

Downloaded files appear in the execution detail page:

  1. Go to History in the navigation
  2. Click on an execution
  3. Scroll to the Downloaded Files section
  4. Click the Download button for any file

Each download shows: - Original filename - Description (if provided) - File size - MIME type - Download timestamp - File type icon

Allowed File Types

By default, the following file types are allowed:

Documents: - PDF (application/pdf) - Plain text (text/plain, text/html, text/csv) - JSON (application/json)

Office Formats: - Excel (application/vnd.ms-excel, .xlsx) - Word (.docx)

Images: - PNG (image/png) - JPEG (image/jpeg)

Archives: - ZIP (application/zip)

Blocked Types: - Executables (.exe) - Scripts (.sh, .js, .py, .php) - Any file type configured as blocked

See Configuration to customize allowed types.

Security & Limits

File Size Limit

Files larger than the configured limit (default: 50MB) are rejected:

result = await download_file('https://example.com/huge-file.zip')
if not result['success']:
    debug_log(f"Error: {result['error']}")  # "File size exceeds limit"

Download Count Limit

Maximum files per execution (default: 20):

# After 20 downloads
result = await download_file('https://example.com/file.pdf')
# Logs: "Maximum downloads per execution (20) exceeded"

Blocked File Types

Executables and scripts are always blocked:

result = await download_file('https://example.com/malware.exe')
# Returns: {'success': False, 'error': 'File type blocked for security'}

Filename Sanitization

Filenames are automatically sanitized to prevent security issues (see Automatic Filename Detection for details):

  • Path traversal prevention (../../etc/passwdpasswd)
  • Dangerous character removal (file<>:|?.pdffile_____.pdf)
  • Length validation (max 255 chars with extension preservation)
  • Empty filename handling (defaults to download or download.ext)

Script Isolation

Important: Scripts CANNOT read downloaded files. The open() function is blocked for security. Files are only accessible via the web UI download buttons.

Best Practices

1. Use Descriptions (Filenames Are Auto-Detected)

The system automatically extracts filenames, so focus on providing clear descriptions:

# Good - Relies on automatic filename detection
result = await download_file(
    'https://example.com/reports/Q1-2024-Sales.pdf',
    description='Q1 2024 Sales Report'
)
# Filename auto-detected: Q1-2024-Sales.pdf
# Description helps identify the file in the UI

# Only specify filename when needed
result = await download_file(
    'https://example.com/download/abc123',  # Generic URL
    description='Sales Report',
    filename='Q1-2024-sales-final.pdf'  # Provide meaningful name
)

When to specify filename: - ✅ Generic URLs that don't contain meaningful filenames - ✅ You want a specific naming convention - ✅ The auto-detected name isn't user-friendly

When to rely on auto-detection: - ✅ URLs with clear filenames (/reports/Q1-2024.pdf) - ✅ Click-based downloads (browser provides filename) - ✅ API endpoints with Content-Disposition headers

2. Check Success Status

result = await download_file(url, description='Important file')
if result['success']:
    debug_log(f"✓ {result['filename']} downloaded")
else:
    debug_log(f"✗ Failed: {result['error']}")
    # Handle failure (retry, skip, etc.)

3. Log Download Progress

files = ['file1.pdf', 'file2.csv', 'file3.json']

for i, file_url in enumerate(files):
    debug_log(f"Downloading {i+1}/{len(files)}: {file_url}")
    result = await download_file(file_url)
    debug_log(f"  {'Success' if result['success'] else 'Failed'}")

4. Combine with Screenshots

await capture_screenshot("Before download")

result = await download_file('a.download-button', description='Report')

if result['success']:
    await capture_screenshot(f"After downloading {result['filename']}")
else:
    await capture_screenshot("Download failed")

5. Respect Rate Limits

import asyncio

for file_url in file_list:
    await download_file(file_url)
    await asyncio.sleep(1)  # Wait 1 second between downloads

Troubleshooting

Download Not Starting

Problem: Click doesn't trigger download

Solution: Wait for the download trigger

# Wait for element to be ready
await page.wait_for_selector('a.download', state='visible')
result = await download_file('a.download')

File Size Too Large

Problem: "File size exceeds limit"

Solution: Contact administrator to increase MAX_DOWNLOAD_SIZE in environment variables

Wrong MIME Type

Problem: File type not in allowed list

Solution: 1. Check file type in error message 2. Contact administrator to add type to ALLOWED_DOWNLOAD_TYPES

Download Count Exceeded

Problem: "Maximum downloads per execution exceeded"

Solution: 1. Reduce number of files downloaded 2. Split into multiple script executions 3. Contact administrator to increase MAX_DOWNLOADS_PER_EXECUTION

Unexpected Filename

Problem: Auto-detected filename isn't what you expected

Examples: - Generic names like download or abc123.pdf - URL-encoded characters not decoded properly (old browsers) - Missing file extensions

Solution: Provide custom filename

# Override auto-detection with custom name
result = await download_file(
    'https://example.com/get?id=12345',
    description='Sales Report',
    filename='Q1-2024-sales.pdf'  # Specify exact name you want
)

Debugging auto-detection:

# Check what was auto-detected
result = await download_file(url, description='Test')
if result['success']:
    debug_log(f"Auto-detected filename: {result['filename']}")
    debug_log(f"MIME type: {result['mime_type']}")
    debug_log(f"Source URL: {url}")
    # If not what you want, re-download with custom filename

See Also