Downloading Files¶
Learn how to download files from web pages during script execution.
Overview¶
The download_file() function allows your scripts to download files from web pages. Files are stored securely in a Docker volume and made available on the execution detail page for download.
Key Features: - ✨ Automatic filename detection - Smart extraction from URLs, headers, and MIME types - 🔒 Security-first - Blocks executables, sanitizes filenames, validates file types - 📏 Configurable limits - File size (default: 50MB) and count (default: 20/execution) - 🌍 International support - RFC 5987 compliant for encoded filenames - 🛡️ Script isolation - Downloaded files only accessible via UI, not from scripts
How It Works: 1. Specify URL or CSS selector 2. System automatically detects filename (or you can override) 3. File is downloaded, validated, and sanitized 4. Stored in secure Docker volume 5. Accessible from execution detail page
Basic Usage¶
Download from URL¶
Download a file directly from a URL. The filename is automatically detected:
async def main(page):
# Download a PDF - filename auto-detected as "report.pdf"
result = await download_file(
'https://example.com/files/report.pdf',
description='Monthly sales report'
)
if result['success']:
debug_log(f"Downloaded: {result['filename']} ({result['file_size']} bytes)")
else:
debug_log(f"Failed: {result['error']}")
With custom filename:
# Override auto-detection for specific naming
result = await download_file(
'https://example.com/download/abc123', # Generic URL
description='Monthly sales report',
filename='sales_jan_2024.pdf' # Custom name
)
Download by Clicking¶
Download a file triggered by clicking a link or button. The browser provides the filename:
async def main(page):
await page.goto('https://example.com/downloads')
# Click a download link - browser suggests filename
result = await download_file(
'a[href*="invoice.pdf"]',
description='Invoice #12345'
)
if result['success']:
debug_log(f"Downloaded: {result['filename']}")
# Example output: "Downloaded: invoice-12345.pdf"
Function Signature¶
Parameters:
- url_or_selector (str, required):
- If starts with http:// or https://: Downloads from URL
- Otherwise: Treated as CSS selector to click
- description (str, optional): Human-readable description for the UI
- filename (str, optional): Custom filename (auto-detected if not provided)
Returns:
{
'success': True,
'filename': 'report.pdf',
'file_size': 1234567,
'mime_type': 'application/pdf'
}
# OR
{
'success': False,
'error': 'File size exceeds limit'
}
Automatic Filename Detection¶
When you don't specify a filename parameter, the system intelligently extracts the filename using multiple strategies:
Detection Priority Order¶
1. Content-Disposition Header (Most Reliable)
The HTTP response header is checked first:
Content-Disposition: attachment; filename="report.pdf"
→ Uses: report.pdf
Content-Disposition: attachment; filename*=UTF-8''Sales%20Data%202024.xlsx
→ Uses: Sales Data 2024.xlsx (properly decoded)
This method supports RFC 5987 for international filenames with proper encoding.
2. Query Parameters
URLs with filename in query strings:
# URL: https://api.example.com/download?file=data.csv
result = await download_file('https://api.example.com/download?file=data.csv')
# Filename: data.csv
# URL: https://storage.example.com/get?filename=report-2024.pdf
result = await download_file('https://storage.example.com/get?filename=report-2024.pdf')
# Filename: report-2024.pdf
Checks for: file, filename, name, download parameters.
3. URL Path
Extracts filename from the URL path with proper URL decoding:
# URL: https://example.com/files/report.pdf
result = await download_file('https://example.com/files/report.pdf')
# Filename: report.pdf
# URL: https://example.com/files/My%20Report%202024.pdf
result = await download_file('https://example.com/files/My%20Report%202024.pdf')
# Filename: My Report 2024.pdf (URL-decoded)
4. MIME Type Extension
If no filename is found or it lacks an extension, one is inferred from the MIME type:
# URL: https://api.example.com/users/1 (returns JSON)
result = await download_file('https://api.example.com/users/1')
# Filename: 1.json (adds .json extension based on Content-Type)
# URL: https://example.com/download/abc123 (returns PDF)
result = await download_file('https://example.com/download/abc123')
# Filename: abc123.pdf (adds .pdf extension)
Filename Sanitization¶
All filenames are automatically sanitized for security:
# Removes path traversal attempts
"../../etc/passwd" → "passwd"
# Removes dangerous characters
"file<>:|?.pdf" → "file_____.pdf"
# Handles length limits (max 255 characters)
"very-long-filename-..." → "very-long-filen...ame.pdf" (preserves extension)
# Prevents empty filenames
"" → "download" or "download.pdf" (based on MIME type)
Examples¶
Automatic Detection:
# Example 1: Direct filename in URL
result = await download_file('https://example.com/reports/Q1-2024.pdf')
# Filename: Q1-2024.pdf
# Example 2: URL-encoded characters
result = await download_file('https://example.com/My%20Report%20(Final).pdf')
# Filename: My Report (Final).pdf
# Example 3: API endpoint with MIME type
result = await download_file('https://api.example.com/export/data')
# Filename: data.csv (if Content-Type is text/csv)
# Example 4: Query parameter
result = await download_file('https://cdn.example.com/dl?file=document.docx')
# Filename: document.docx
Manual Override:
# Specify custom filename (bypasses auto-detection)
result = await download_file(
'https://example.com/download/abc123',
description='Sales report',
filename='Q1-2024-sales-final.pdf'
)
# Filename: Q1-2024-sales-final.pdf (uses your custom name)
Click-Based Downloads¶
For downloads triggered by clicking, the browser's suggested filename is used:
# Browser provides: "invoice-12345.pdf"
result = await download_file('a.download-button')
# Filename: invoice-12345.pdf
# You can still override:
result = await download_file('a.download-button', filename='my-invoice.pdf')
# Filename: my-invoice.pdf
Examples¶
Download Multiple Files¶
async def main(page):
await page.goto('https://data.example.com')
# Get all CSV download links
csv_links = await page.query_selector_all('a[href$=".csv"]')
for i, link in enumerate(csv_links[:5]): # First 5 files
href = await link.get_attribute('href')
text = await link.text_content()
# Filename automatically extracted from URL or Content-Disposition
result = await download_file(
href,
description=f'Dataset: {text}'
)
if result['success']:
debug_log(f"Downloaded {i+1}/5: {result['filename']}")
debug_log(f" File size: {result['file_size']} bytes")
else:
debug_log(f"Failed {i+1}/5: {result['error']}")
Download with Error Handling¶
async def main(page):
await page.goto('https://example.com/reports')
try:
# Filename auto-detected from URL: report.pdf
result = await download_file(
'https://example.com/files/report.pdf',
description='Q1 Report'
)
if result['success']:
debug_log(f"✓ Downloaded {result['filename']}")
debug_log(f" Size: {result['file_size']} bytes")
debug_log(f" Type: {result['mime_type']}")
else:
debug_log(f"✗ Download failed: {result['error']}")
await capture_screenshot("Download failure")
except Exception as e:
debug_log(f"Exception during download: {e}")
await capture_screenshot("Exception state")
Conditional Downloads¶
async def main(page):
await page.goto('https://example.com/files')
# Check if file exists before downloading
download_button = page.locator('a.download-pdf')
if await download_button.count() > 0:
result = await download_file(
'a.download-pdf',
description='Available report'
)
debug_log(f"Downloaded: {result['filename']}")
else:
debug_log("No download available")
await capture_screenshot("No download found")
Download from Dynamic Links¶
async def main(page):
await page.goto('https://example.com/dashboard')
# Wait for dynamic content
await page.wait_for_selector('.download-ready', timeout=10000)
# Get dynamically generated URL
download_url = await page.locator('.download-link').get_attribute('href')
result = await download_file(
download_url,
description='Generated report'
)
debug_log(f"Downloaded: {result['filename']}")
Viewing Downloads¶
Downloaded files appear in the execution detail page:
- Go to History in the navigation
- Click on an execution
- Scroll to the Downloaded Files section
- Click the Download button for any file
Each download shows: - Original filename - Description (if provided) - File size - MIME type - Download timestamp - File type icon
Allowed File Types¶
By default, the following file types are allowed:
Documents: - PDF (application/pdf) - Plain text (text/plain, text/html, text/csv) - JSON (application/json)
Office Formats: - Excel (application/vnd.ms-excel, .xlsx) - Word (.docx)
Images: - PNG (image/png) - JPEG (image/jpeg)
Archives: - ZIP (application/zip)
Blocked Types: - Executables (.exe) - Scripts (.sh, .js, .py, .php) - Any file type configured as blocked
See Configuration to customize allowed types.
Security & Limits¶
File Size Limit¶
Files larger than the configured limit (default: 50MB) are rejected:
result = await download_file('https://example.com/huge-file.zip')
if not result['success']:
debug_log(f"Error: {result['error']}") # "File size exceeds limit"
Download Count Limit¶
Maximum files per execution (default: 20):
# After 20 downloads
result = await download_file('https://example.com/file.pdf')
# Logs: "Maximum downloads per execution (20) exceeded"
Blocked File Types¶
Executables and scripts are always blocked:
result = await download_file('https://example.com/malware.exe')
# Returns: {'success': False, 'error': 'File type blocked for security'}
Filename Sanitization¶
Filenames are automatically sanitized to prevent security issues (see Automatic Filename Detection for details):
- Path traversal prevention (
../../etc/passwd→passwd) - Dangerous character removal (
file<>:|?.pdf→file_____.pdf) - Length validation (max 255 chars with extension preservation)
- Empty filename handling (defaults to
downloadordownload.ext)
Script Isolation¶
Important: Scripts CANNOT read downloaded files. The open() function is blocked for security. Files are only accessible via the web UI download buttons.
Best Practices¶
1. Use Descriptions (Filenames Are Auto-Detected)¶
The system automatically extracts filenames, so focus on providing clear descriptions:
# Good - Relies on automatic filename detection
result = await download_file(
'https://example.com/reports/Q1-2024-Sales.pdf',
description='Q1 2024 Sales Report'
)
# Filename auto-detected: Q1-2024-Sales.pdf
# Description helps identify the file in the UI
# Only specify filename when needed
result = await download_file(
'https://example.com/download/abc123', # Generic URL
description='Sales Report',
filename='Q1-2024-sales-final.pdf' # Provide meaningful name
)
When to specify filename:
- ✅ Generic URLs that don't contain meaningful filenames
- ✅ You want a specific naming convention
- ✅ The auto-detected name isn't user-friendly
When to rely on auto-detection:
- ✅ URLs with clear filenames (/reports/Q1-2024.pdf)
- ✅ Click-based downloads (browser provides filename)
- ✅ API endpoints with Content-Disposition headers
2. Check Success Status¶
result = await download_file(url, description='Important file')
if result['success']:
debug_log(f"✓ {result['filename']} downloaded")
else:
debug_log(f"✗ Failed: {result['error']}")
# Handle failure (retry, skip, etc.)
3. Log Download Progress¶
files = ['file1.pdf', 'file2.csv', 'file3.json']
for i, file_url in enumerate(files):
debug_log(f"Downloading {i+1}/{len(files)}: {file_url}")
result = await download_file(file_url)
debug_log(f" {'Success' if result['success'] else 'Failed'}")
4. Combine with Screenshots¶
await capture_screenshot("Before download")
result = await download_file('a.download-button', description='Report')
if result['success']:
await capture_screenshot(f"After downloading {result['filename']}")
else:
await capture_screenshot("Download failed")
5. Respect Rate Limits¶
import asyncio
for file_url in file_list:
await download_file(file_url)
await asyncio.sleep(1) # Wait 1 second between downloads
Troubleshooting¶
Download Not Starting¶
Problem: Click doesn't trigger download
Solution: Wait for the download trigger
# Wait for element to be ready
await page.wait_for_selector('a.download', state='visible')
result = await download_file('a.download')
File Size Too Large¶
Problem: "File size exceeds limit"
Solution: Contact administrator to increase MAX_DOWNLOAD_SIZE in environment variables
Wrong MIME Type¶
Problem: File type not in allowed list
Solution:
1. Check file type in error message
2. Contact administrator to add type to ALLOWED_DOWNLOAD_TYPES
Download Count Exceeded¶
Problem: "Maximum downloads per execution exceeded"
Solution:
1. Reduce number of files downloaded
2. Split into multiple script executions
3. Contact administrator to increase MAX_DOWNLOADS_PER_EXECUTION
Unexpected Filename¶
Problem: Auto-detected filename isn't what you expected
Examples:
- Generic names like download or abc123.pdf
- URL-encoded characters not decoded properly (old browsers)
- Missing file extensions
Solution: Provide custom filename
# Override auto-detection with custom name
result = await download_file(
'https://example.com/get?id=12345',
description='Sales Report',
filename='Q1-2024-sales.pdf' # Specify exact name you want
)
Debugging auto-detection:
# Check what was auto-detected
result = await download_file(url, description='Test')
if result['success']:
debug_log(f"Auto-detected filename: {result['filename']}")
debug_log(f"MIME type: {result['mime_type']}")
debug_log(f"Source URL: {url}")
# If not what you want, re-download with custom filename
See Also¶
- Available Functions - Complete API reference
- Screenshots - Visual debugging
- Debug Logging - Execution logging
- Configuration - Download limits and settings
- Examples - More examples