Downloading Files¶

Learn how to download files from web pages during script execution.

Overview¶

The download_file() function allows your scripts to download files from web pages. Files are stored securely in a Docker volume and made available on the execution detail page for download.

Key Features: - ✨ Automatic filename detection - Smart extraction from URLs, headers, and MIME types - 🔒 Security-first - Blocks executables, sanitizes filenames, validates file types - 📏 Configurable limits - File size (default: 50MB) and count (default: 20/execution) - 🌍 International support - RFC 5987 compliant for encoded filenames - 🛡️ Script isolation - Downloaded files only accessible via UI, not from scripts

How It Works: 1. Specify URL or CSS selector 2. System automatically detects filename (or you can override) 3. File is downloaded, validated, and sanitized 4. Stored in secure Docker volume 5. Accessible from execution detail page

Basic Usage¶

Download from URL¶

Download a file directly from a URL. The filename is automatically detected:

async def main(page):
    # Download a PDF - filename auto-detected as "report.pdf"
    result = await download_file(
        'https://example.com/files/report.pdf',
        description='Monthly sales report'
    )

    if result['success']:
        debug_log(f"Downloaded: {result['filename']} ({result['file_size']} bytes)")
    else:
        debug_log(f"Failed: {result['error']}")

With custom filename:

# Override auto-detection for specific naming
result = await download_file(
    'https://example.com/download/abc123',  # Generic URL
    description='Monthly sales report',
    filename='sales_jan_2024.pdf'  # Custom name
)

Download by Clicking¶

Download a file triggered by clicking a link or button. The browser provides the filename:

async def main(page):
    await page.goto('https://example.com/downloads')

    # Click a download link - browser suggests filename
    result = await download_file(
        'a[href*="invoice.pdf"]',
        description='Invoice #12345'
    )

    if result['success']:
        debug_log(f"Downloaded: {result['filename']}")
        # Example output: "Downloaded: invoice-12345.pdf"

Function Signature¶

await download_file(url_or_selector, description="", filename=None)

Parameters: - url_or_selector (str, required): - If starts with http:// or https://: Downloads from URL - Otherwise: Treated as CSS selector to click - description (str, optional): Human-readable description for the UI - filename (str, optional): Custom filename (auto-detected if not provided)

Returns:

{
    'success': True,
    'filename': 'report.pdf',
    'file_size': 1234567,
    'mime_type': 'application/pdf'
}
# OR
{
    'success': False,
    'error': 'File size exceeds limit'
}

Automatic Filename Detection¶

When you don't specify a filename parameter, the system intelligently extracts the filename using multiple strategies:

Detection Priority Order¶

1. Content-Disposition Header (Most Reliable)

The HTTP response header is checked first:

Content-Disposition: attachment; filename="report.pdf"
→ Uses: report.pdf

Content-Disposition: attachment; filename*=UTF-8''Sales%20Data%202024.xlsx
→ Uses: Sales Data 2024.xlsx (properly decoded)

This method supports RFC 5987 for international filenames with proper encoding.

2. Query Parameters

URLs with filename in query strings:

# URL: https://api.example.com/download?file=data.csv
result = await download_file('https://api.example.com/download?file=data.csv')
# Filename: data.csv

# URL: https://storage.example.com/get?filename=report-2024.pdf
result = await download_file('https://storage.example.com/get?filename=report-2024.pdf')
# Filename: report-2024.pdf

Checks for: file, filename, name, download parameters.

3. URL Path

Extracts filename from the URL path with proper URL decoding:

# URL: https://example.com/files/report.pdf
result = await download_file('https://example.com/files/report.pdf')
# Filename: report.pdf

# URL: https://example.com/files/My%20Report%202024.pdf
result = await download_file('https://example.com/files/My%20Report%202024.pdf')
# Filename: My Report 2024.pdf (URL-decoded)

4. MIME Type Extension

If no filename is found or it lacks an extension, one is inferred from the MIME type:

# URL: https://api.example.com/users/1 (returns JSON)
result = await download_file('https://api.example.com/users/1')
# Filename: 1.json (adds .json extension based on Content-Type)

# URL: https://example.com/download/abc123 (returns PDF)
result = await download_file('https://example.com/download/abc123')
# Filename: abc123.pdf (adds .pdf extension)

Filename Sanitization¶

All filenames are automatically sanitized for security:

# Removes path traversal attempts
"../../etc/passwd" → "passwd"

# Removes dangerous characters
"file<>:|?.pdf" → "file_____.pdf"

# Handles length limits (max 255 characters)
"very-long-filename-..." → "very-long-filen...ame.pdf" (preserves extension)

# Prevents empty filenames
"" → "download" or "download.pdf" (based on MIME type)

Examples¶

Automatic Detection:

# Example 1: Direct filename in URL
result = await download_file('https://example.com/reports/Q1-2024.pdf')
# Filename: Q1-2024.pdf

# Example 2: URL-encoded characters
result = await download_file('https://example.com/My%20Report%20(Final).pdf')
# Filename: My Report (Final).pdf

# Example 3: API endpoint with MIME type
result = await download_file('https://api.example.com/export/data')
# Filename: data.csv (if Content-Type is text/csv)

# Example 4: Query parameter
result = await download_file('https://cdn.example.com/dl?file=document.docx')
# Filename: document.docx

Manual Override:

# Specify custom filename (bypasses auto-detection)
result = await download_file(
    'https://example.com/download/abc123',
    description='Sales report',
    filename='Q1-2024-sales-final.pdf'
)
# Filename: Q1-2024-sales-final.pdf (uses your custom name)

Click-Based Downloads¶

For downloads triggered by clicking, the browser's suggested filename is used:

# Browser provides: "invoice-12345.pdf"
result = await download_file('a.download-button')
# Filename: invoice-12345.pdf

# You can still override:
result = await download_file('a.download-button', filename='my-invoice.pdf')
# Filename: my-invoice.pdf

Examples¶

Download Multiple Files¶

async def main(page):
    await page.goto('https://data.example.com')

    # Get all CSV download links
    csv_links = await page.query_selector_all('a[href$=".csv"]')

    for i, link in enumerate(csv_links[:5]):  # First 5 files
        href = await link.get_attribute('href')
        text = await link.text_content()

        # Filename automatically extracted from URL or Content-Disposition
        result = await download_file(
            href,
            description=f'Dataset: {text}'
        )

        if result['success']:
            debug_log(f"Downloaded {i+1}/5: {result['filename']}")
            debug_log(f"  File size: {result['file_size']} bytes")
        else:
            debug_log(f"Failed {i+1}/5: {result['error']}")

Download with Error Handling¶

async def main(page):
    await page.goto('https://example.com/reports')

    try:
        # Filename auto-detected from URL: report.pdf
        result = await download_file(
            'https://example.com/files/report.pdf',
            description='Q1 Report'
        )

        if result['success']:
            debug_log(f"✓ Downloaded {result['filename']}")
            debug_log(f"  Size: {result['file_size']} bytes")
            debug_log(f"  Type: {result['mime_type']}")
        else:
            debug_log(f"✗ Download failed: {result['error']}")
            await capture_screenshot("Download failure")

    except Exception as e:
        debug_log(f"Exception during download: {e}")
        await capture_screenshot("Exception state")

Conditional Downloads¶

async def main(page):
    await page.goto('https://example.com/files')

    # Check if file exists before downloading
    download_button = page.locator('a.download-pdf')

    if await download_button.count() > 0:
        result = await download_file(
            'a.download-pdf',
            description='Available report'
        )
        debug_log(f"Downloaded: {result['filename']}")
    else:
        debug_log("No download available")
        await capture_screenshot("No download found")

Download from Dynamic Links¶

async def main(page):
    await page.goto('https://example.com/dashboard')

    # Wait for dynamic content
    await page.wait_for_selector('.download-ready', timeout=10000)

    # Get dynamically generated URL
    download_url = await page.locator('.download-link').get_attribute('href')

    result = await download_file(
        download_url,
        description='Generated report'
    )

    debug_log(f"Downloaded: {result['filename']}")

Viewing Downloads¶

Downloaded files appear in the execution detail page:

Go to History in the navigation
Click on an execution
Scroll to the Downloaded Files section
Click the Download button for any file

Each download shows: - Original filename - Description (if provided) - File size - MIME type - Download timestamp - File type icon

Allowed File Types¶

By default, the following file types are allowed:

Documents: - PDF (application/pdf) - Plain text (text/plain, text/html, text/csv) - JSON (application/json)

Office Formats: - Excel (application/vnd.ms-excel, .xlsx) - Word (.docx)

Images: - PNG (image/png) - JPEG (image/jpeg)

Archives: - ZIP (application/zip)

Blocked Types: - Executables (.exe) - Scripts (.sh, .js, .py, .php) - Any file type configured as blocked

See Configuration to customize allowed types.

Security & Limits¶

File Size Limit¶

Files larger than the configured limit (default: 50MB) are rejected:

result = await download_file('https://example.com/huge-file.zip')
if not result['success']:
    debug_log(f"Error: {result['error']}")  # "File size exceeds limit"

Download Count Limit¶

Maximum files per execution (default: 20):

# After 20 downloads
result = await download_file('https://example.com/file.pdf')
# Logs: "Maximum downloads per execution (20) exceeded"

Blocked File Types¶

Executables and scripts are always blocked:

result = await download_file('https://example.com/malware.exe')
# Returns: {'success': False, 'error': 'File type blocked for security'}

Filename Sanitization¶

Filenames are automatically sanitized to prevent security issues (see Automatic Filename Detection for details):

Path traversal prevention (../../etc/passwd → passwd)
Dangerous character removal (file<>:|?.pdf → file_____.pdf)
Length validation (max 255 chars with extension preservation)
Empty filename handling (defaults to download or download.ext)

Script Isolation¶

Important: Scripts CANNOT read downloaded files. The open() function is blocked for security. Files are only accessible via the web UI download buttons.

Best Practices¶

1. Use Descriptions (Filenames Are Auto-Detected)¶

The system automatically extracts filenames, so focus on providing clear descriptions:

# Good - Relies on automatic filename detection
result = await download_file(
    'https://example.com/reports/Q1-2024-Sales.pdf',
    description='Q1 2024 Sales Report'
)
# Filename auto-detected: Q1-2024-Sales.pdf
# Description helps identify the file in the UI

# Only specify filename when needed
result = await download_file(
    'https://example.com/download/abc123',  # Generic URL
    description='Sales Report',
    filename='Q1-2024-sales-final.pdf'  # Provide meaningful name
)

When to specify filename: - ✅ Generic URLs that don't contain meaningful filenames - ✅ You want a specific naming convention - ✅ The auto-detected name isn't user-friendly

When to rely on auto-detection: - ✅ URLs with clear filenames (/reports/Q1-2024.pdf) - ✅ Click-based downloads (browser provides filename) - ✅ API endpoints with Content-Disposition headers

2. Check Success Status¶

result = await download_file(url, description='Important file')
if result['success']:
    debug_log(f"✓ {result['filename']} downloaded")
else:
    debug_log(f"✗ Failed: {result['error']}")
    # Handle failure (retry, skip, etc.)

3. Log Download Progress¶

files = ['file1.pdf', 'file2.csv', 'file3.json']

for i, file_url in enumerate(files):
    debug_log(f"Downloading {i+1}/{len(files)}: {file_url}")
    result = await download_file(file_url)
    debug_log(f"  {'Success' if result['success'] else 'Failed'}")

4. Combine with Screenshots¶

await capture_screenshot("Before download")

result = await download_file('a.download-button', description='Report')

if result['success']:
    await capture_screenshot(f"After downloading {result['filename']}")
else:
    await capture_screenshot("Download failed")

5. Respect Rate Limits¶

import asyncio

for file_url in file_list:
    await download_file(file_url)
    await asyncio.sleep(1)  # Wait 1 second between downloads

Troubleshooting¶

Download Not Starting¶

Problem: Click doesn't trigger download

Solution: Wait for the download trigger

# Wait for element to be ready
await page.wait_for_selector('a.download', state='visible')
result = await download_file('a.download')

File Size Too Large¶

Problem: "File size exceeds limit"

Solution: Contact administrator to increase MAX_DOWNLOAD_SIZE in environment variables

Wrong MIME Type¶

Problem: File type not in allowed list

Solution: 1. Check file type in error message 2. Contact administrator to add type to ALLOWED_DOWNLOAD_TYPES

Download Count Exceeded¶

Problem: "Maximum downloads per execution exceeded"

Solution: 1. Reduce number of files downloaded 2. Split into multiple script executions 3. Contact administrator to increase MAX_DOWNLOADS_PER_EXECUTION

Unexpected Filename¶

Problem: Auto-detected filename isn't what you expected

Examples: - Generic names like download or abc123.pdf - URL-encoded characters not decoded properly (old browsers) - Missing file extensions

Solution: Provide custom filename

# Override auto-detection with custom name
result = await download_file(
    'https://example.com/get?id=12345',
    description='Sales Report',
    filename='Q1-2024-sales.pdf'  # Specify exact name you want
)

Debugging auto-detection:

# Check what was auto-detected
result = await download_file(url, description='Test')
if result['success']:
    debug_log(f"Auto-detected filename: {result['filename']}")
    debug_log(f"MIME type: {result['mime_type']}")
    debug_log(f"Source URL: {url}")
    # If not what you want, re-download with custom filename

Downloading Files¶

Overview¶

Basic Usage¶

Download from URL¶

Download by Clicking¶

Function Signature¶

Automatic Filename Detection¶

Detection Priority Order¶

Filename Sanitization¶

Examples¶

Click-Based Downloads¶

Examples¶

Download Multiple Files¶

Download with Error Handling¶

Conditional Downloads¶

Download from Dynamic Links¶

Viewing Downloads¶

Allowed File Types¶

Security & Limits¶

File Size Limit¶

Download Count Limit¶

Blocked File Types¶

Filename Sanitization¶

Script Isolation¶

Best Practices¶

1. Use Descriptions (Filenames Are Auto-Detected)¶

2. Check Success Status¶

3. Log Download Progress¶

4. Combine with Screenshots¶

5. Respect Rate Limits¶

Troubleshooting¶

Download Not Starting¶

File Size Too Large¶

Wrong MIME Type¶

Download Count Exceeded¶

Unexpected Filename¶

See Also¶