Configuration¶
Configuration guide for administrators.
Environment Variables¶
Scrapazoid is configured via environment variables in the .env file.
Basic Settings¶
# Flask Configuration
SECRET_KEY=your_secret_key_here
FLASK_ENV=development # or production
# PostgreSQL Configuration
POSTGRES_USER=scrapazoid
POSTGRES_PASSWORD=scrapazoid_secret
POSTGRES_DB=scrapazoid
Database URL¶
# Database URL (auto-constructed in docker-compose)
DATABASE_URL=postgresql://user:pass@host:5432/database
Note: When using docker-compose, this is automatically set. Only configure manually for local development without Docker.
Execution Limits¶
Control script execution behavior and resource usage.
Execution Time¶
# Maximum execution time in seconds
# Scripts will timeout and be marked as FAILED after this duration
# Default: 300 seconds (5 minutes)
MAX_EXECUTION_TIME=300
Recommendations: - Development: 300 (5 minutes) - Production: 300-600 (5-10 minutes) - Long-running scripts: 900-1800 (15-30 minutes)
⚠️ Warning: Very long timeouts can tie up resources. Monitor execution patterns.
Concurrent Executions¶
# Maximum concurrent executions per user
# Prevents users from overwhelming the system
# Default: 3
MAX_CONCURRENT_EXECUTIONS_PER_USER=3
Recommendations: - Small instances: 1-2 - Medium instances: 3-5 - Large instances: 5-10
Considerations: - Each execution runs Chromium browser (memory intensive) - Monitor server resources when increasing limits
Download Settings¶
Control file download behavior and security.
File Size Limit¶
# Maximum file size in bytes
# Files larger than this will be rejected
# Default: 52428800 (50MB)
MAX_DOWNLOAD_SIZE=52428800
Common Values:
# 10MB
MAX_DOWNLOAD_SIZE=10485760
# 50MB (default)
MAX_DOWNLOAD_SIZE=52428800
# 100MB
MAX_DOWNLOAD_SIZE=104857600
# 500MB
MAX_DOWNLOAD_SIZE=524288000
Considerations:
- Larger limits require more disk space in Docker volumes
- Consider available storage and retention period
- Monitor volume usage: docker volume inspect scrapazoid_downloads
Downloads Per Execution¶
# Maximum number of files that can be downloaded per execution
# Prevents abuse and excessive storage use
# Default: 20
MAX_DOWNLOADS_PER_EXECUTION=20
Recommendations: - Light use: 10-20 - Data collection: 50-100 - Bulk downloads: 100-500
⚠️ Warning: High limits combined with large file sizes can fill storage quickly.
Allowed File Types¶
# Comma-separated list of allowed MIME types
# If not set, uses sensible defaults
ALLOWED_DOWNLOAD_TYPES=application/pdf,text/csv,application/json
Default Allowed Types:
application/pdf
application/json
text/csv
text/plain
text/html
application/zip
image/png
image/jpeg
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.openxmlformats-officedocument.wordprocessingml.document
Custom Configuration Examples:
# Only PDFs and CSVs
ALLOWED_DOWNLOAD_TYPES=application/pdf,text/csv
# Add XML files
ALLOWED_DOWNLOAD_TYPES=application/pdf,text/csv,application/xml,text/xml
# Images only
ALLOWED_DOWNLOAD_TYPES=image/png,image/jpeg,image/gif,image/webp
Always Blocked (Cannot Override):
- application/x-msdownload (.exe)
- application/x-executable
- application/x-sh (.sh)
- application/x-shellscript
- application/x-python-code (.py)
- text/x-python
- application/x-php (.php)
- application/javascript (.js)
- text/javascript
Retention Period¶
# Days to keep downloaded files before cleanup
# Default: 7 days (matches screenshot retention)
DOWNLOAD_RETENTION_DAYS=7
Storage Calculation:
Estimated Storage =
MAX_DOWNLOAD_SIZE ×
MAX_DOWNLOADS_PER_EXECUTION ×
Daily Executions ×
DOWNLOAD_RETENTION_DAYS
Example: - 50MB max size - 20 downloads per execution - 10 executions per day - 7 day retention
= 50MB × 20 × 10 × 7 = 70GB storage needed
Screenshot Settings¶
Recommendations: - Development: 1-3 days - Production: 7-14 days - Long-term auditing: 30+ days
Complete .env Example¶
# Flask Configuration
SECRET_KEY=your-super-secret-key-change-this-in-production
FLASK_ENV=production
# PostgreSQL Configuration
POSTGRES_USER=scrapazoid
POSTGRES_PASSWORD=strong_password_here
POSTGRES_DB=scrapazoid
# Execution Limits
MAX_EXECUTION_TIME=600 # 10 minutes
MAX_CONCURRENT_EXECUTIONS_PER_USER=5
# Download Limits
MAX_DOWNLOAD_SIZE=104857600 # 100MB
MAX_DOWNLOADS_PER_EXECUTION=50
DOWNLOAD_RETENTION_DAYS=14
# ALLOWED_DOWNLOAD_TYPES=application/pdf,text/csv # Override defaults
# Screenshot Settings
SCREENSHOT_RETENTION_DAYS=14
Applying Configuration Changes¶
Method 1: Restart Containers (Recommended)¶
Method 2: Recreate (If env file changed)¶
Verify Configuration¶
Check that environment variables are loaded:
# View web container environment
docker-compose exec web env | grep MAX
# Should show:
# MAX_EXECUTION_TIME=600
# MAX_DOWNLOAD_SIZE=104857600
# etc.
Monitoring¶
Storage Usage¶
Monitor Docker volume usage:
# Check downloads volume
docker volume inspect scrapazoid_downloads | grep Mountpoint
# Check actual disk usage
docker system df -v
Resource Usage¶
Monitor container resources:
# Container stats
docker stats scrapazoid-web
# Shows:
# - Memory usage
# - CPU usage
# - Network I/O
# - Block I/O
Execution Patterns¶
Check database for execution metrics:
docker-compose exec web python -c "
from app import create_app, db
from app.models import Execution, DownloadedFile
from sqlalchemy import func
app = create_app()
with app.app_context():
# Total executions
total = Execution.query.count()
print(f'Total executions: {total}')
# Total downloads
downloads = DownloadedFile.query.count()
print(f'Total downloads: {downloads}')
# Average downloads per execution
avg = db.session.query(func.avg(
db.session.query(func.count(DownloadedFile.id))
.filter(DownloadedFile.execution_id == Execution.id)
.correlate(Execution)
.scalar_subquery()
)).scalar()
print(f'Avg downloads/execution: {avg:.1f}')
"
Security Considerations¶
File Type Validation¶
- Never add executable types to
ALLOWED_DOWNLOAD_TYPES - Review allowed types periodically
- Monitor for unusual file types in logs
Storage Limits¶
- Set
MAX_DOWNLOAD_SIZEbased on available storage - Calculate total storage needs (see formula above)
- Monitor disk usage regularly
- Set up alerts for low storage
Resource Limits¶
- Limit concurrent executions based on server capacity
- Monitor memory usage (each execution = 1 Chromium instance)
- Consider separate execution queue for heavy workloads
User Isolation¶
- All files are user-isolated via
@login_required - Downloaded files only accessible by execution owner
- Scripts cannot access downloaded files (
open()is blocked)
Troubleshooting¶
Downloads Rejected¶
Problem: "File size exceeds limit"
Solution: Increase MAX_DOWNLOAD_SIZE
Storage Full¶
Problem: Docker volume out of space
Solutions:
-
Reduce retention period:
-
Manual cleanup:
docker-compose exec web python -c " from app import create_app from app.cleanup import cleanup_old_downloads, cleanup_old_screenshots app = create_app() with app.app_context(): d = cleanup_old_downloads(retention_days=1) s = cleanup_old_screenshots(retention_days=1) print(f'Deleted {d} downloads, {s} screenshots') " -
Increase volume size (requires volume backup/restore)
High Memory Usage¶
Problem: Server running out of memory
Solutions:
-
Reduce concurrent executions:
-
Monitor and kill stuck executions: