Skip to content

Configuration

Configuration guide for administrators.

Environment Variables

Scrapazoid is configured via environment variables in the .env file.

Basic Settings

# Flask Configuration
SECRET_KEY=your_secret_key_here
FLASK_ENV=development  # or production

# PostgreSQL Configuration
POSTGRES_USER=scrapazoid
POSTGRES_PASSWORD=scrapazoid_secret
POSTGRES_DB=scrapazoid

Database URL

# Database URL (auto-constructed in docker-compose)
DATABASE_URL=postgresql://user:pass@host:5432/database

Note: When using docker-compose, this is automatically set. Only configure manually for local development without Docker.

Execution Limits

Control script execution behavior and resource usage.

Execution Time

# Maximum execution time in seconds
# Scripts will timeout and be marked as FAILED after this duration
# Default: 300 seconds (5 minutes)
MAX_EXECUTION_TIME=300

Recommendations: - Development: 300 (5 minutes) - Production: 300-600 (5-10 minutes) - Long-running scripts: 900-1800 (15-30 minutes)

⚠️ Warning: Very long timeouts can tie up resources. Monitor execution patterns.

Concurrent Executions

# Maximum concurrent executions per user
# Prevents users from overwhelming the system
# Default: 3
MAX_CONCURRENT_EXECUTIONS_PER_USER=3

Recommendations: - Small instances: 1-2 - Medium instances: 3-5 - Large instances: 5-10

Considerations: - Each execution runs Chromium browser (memory intensive) - Monitor server resources when increasing limits

Download Settings

Control file download behavior and security.

File Size Limit

# Maximum file size in bytes
# Files larger than this will be rejected
# Default: 52428800 (50MB)
MAX_DOWNLOAD_SIZE=52428800

Common Values:

# 10MB
MAX_DOWNLOAD_SIZE=10485760

# 50MB (default)
MAX_DOWNLOAD_SIZE=52428800

# 100MB
MAX_DOWNLOAD_SIZE=104857600

# 500MB
MAX_DOWNLOAD_SIZE=524288000

Considerations: - Larger limits require more disk space in Docker volumes - Consider available storage and retention period - Monitor volume usage: docker volume inspect scrapazoid_downloads

Downloads Per Execution

# Maximum number of files that can be downloaded per execution
# Prevents abuse and excessive storage use
# Default: 20
MAX_DOWNLOADS_PER_EXECUTION=20

Recommendations: - Light use: 10-20 - Data collection: 50-100 - Bulk downloads: 100-500

⚠️ Warning: High limits combined with large file sizes can fill storage quickly.

Allowed File Types

# Comma-separated list of allowed MIME types
# If not set, uses sensible defaults
ALLOWED_DOWNLOAD_TYPES=application/pdf,text/csv,application/json

Default Allowed Types:

application/pdf
application/json
text/csv
text/plain
text/html
application/zip
image/png
image/jpeg
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.openxmlformats-officedocument.wordprocessingml.document

Custom Configuration Examples:

# Only PDFs and CSVs
ALLOWED_DOWNLOAD_TYPES=application/pdf,text/csv

# Add XML files
ALLOWED_DOWNLOAD_TYPES=application/pdf,text/csv,application/xml,text/xml

# Images only
ALLOWED_DOWNLOAD_TYPES=image/png,image/jpeg,image/gif,image/webp

Always Blocked (Cannot Override): - application/x-msdownload (.exe) - application/x-executable - application/x-sh (.sh) - application/x-shellscript - application/x-python-code (.py) - text/x-python - application/x-php (.php) - application/javascript (.js) - text/javascript

Retention Period

# Days to keep downloaded files before cleanup
# Default: 7 days (matches screenshot retention)
DOWNLOAD_RETENTION_DAYS=7

Storage Calculation:

Estimated Storage =
  MAX_DOWNLOAD_SIZE ×
  MAX_DOWNLOADS_PER_EXECUTION ×
  Daily Executions ×
  DOWNLOAD_RETENTION_DAYS

Example: - 50MB max size - 20 downloads per execution - 10 executions per day - 7 day retention

= 50MB × 20 × 10 × 7 = 70GB storage needed

Screenshot Settings

# Days to keep screenshots before cleanup
# Default: 7 days
SCREENSHOT_RETENTION_DAYS=7

Recommendations: - Development: 1-3 days - Production: 7-14 days - Long-term auditing: 30+ days

Complete .env Example

# Flask Configuration
SECRET_KEY=your-super-secret-key-change-this-in-production
FLASK_ENV=production

# PostgreSQL Configuration
POSTGRES_USER=scrapazoid
POSTGRES_PASSWORD=strong_password_here
POSTGRES_DB=scrapazoid

# Execution Limits
MAX_EXECUTION_TIME=600              # 10 minutes
MAX_CONCURRENT_EXECUTIONS_PER_USER=5

# Download Limits
MAX_DOWNLOAD_SIZE=104857600         # 100MB
MAX_DOWNLOADS_PER_EXECUTION=50
DOWNLOAD_RETENTION_DAYS=14
# ALLOWED_DOWNLOAD_TYPES=application/pdf,text/csv  # Override defaults

# Screenshot Settings
SCREENSHOT_RETENTION_DAYS=14

Applying Configuration Changes

# Stop containers
docker-compose down

# Start with new configuration
docker-compose up -d

Method 2: Recreate (If env file changed)

# Rebuild and restart
docker-compose up -d --force-recreate

Verify Configuration

Check that environment variables are loaded:

# View web container environment
docker-compose exec web env | grep MAX

# Should show:
# MAX_EXECUTION_TIME=600
# MAX_DOWNLOAD_SIZE=104857600
# etc.

Monitoring

Storage Usage

Monitor Docker volume usage:

# Check downloads volume
docker volume inspect scrapazoid_downloads | grep Mountpoint

# Check actual disk usage
docker system df -v

Resource Usage

Monitor container resources:

# Container stats
docker stats scrapazoid-web

# Shows:
# - Memory usage
# - CPU usage
# - Network I/O
# - Block I/O

Execution Patterns

Check database for execution metrics:

docker-compose exec web python -c "
from app import create_app, db
from app.models import Execution, DownloadedFile
from sqlalchemy import func

app = create_app()
with app.app_context():
    # Total executions
    total = Execution.query.count()
    print(f'Total executions: {total}')

    # Total downloads
    downloads = DownloadedFile.query.count()
    print(f'Total downloads: {downloads}')

    # Average downloads per execution
    avg = db.session.query(func.avg(
        db.session.query(func.count(DownloadedFile.id))
        .filter(DownloadedFile.execution_id == Execution.id)
        .correlate(Execution)
        .scalar_subquery()
    )).scalar()
    print(f'Avg downloads/execution: {avg:.1f}')
"

Security Considerations

File Type Validation

  • Never add executable types to ALLOWED_DOWNLOAD_TYPES
  • Review allowed types periodically
  • Monitor for unusual file types in logs

Storage Limits

  • Set MAX_DOWNLOAD_SIZE based on available storage
  • Calculate total storage needs (see formula above)
  • Monitor disk usage regularly
  • Set up alerts for low storage

Resource Limits

  • Limit concurrent executions based on server capacity
  • Monitor memory usage (each execution = 1 Chromium instance)
  • Consider separate execution queue for heavy workloads

User Isolation

  • All files are user-isolated via @login_required
  • Downloaded files only accessible by execution owner
  • Scripts cannot access downloaded files (open() is blocked)

Troubleshooting

Downloads Rejected

Problem: "File size exceeds limit"

Solution: Increase MAX_DOWNLOAD_SIZE

# In .env
MAX_DOWNLOAD_SIZE=104857600  # 100MB

# Restart
docker-compose restart web

Storage Full

Problem: Docker volume out of space

Solutions:

  1. Reduce retention period:

    DOWNLOAD_RETENTION_DAYS=3
    SCREENSHOT_RETENTION_DAYS=3
    

  2. Manual cleanup:

    docker-compose exec web python -c "
    from app import create_app
    from app.cleanup import cleanup_old_downloads, cleanup_old_screenshots
    
    app = create_app()
    with app.app_context():
        d = cleanup_old_downloads(retention_days=1)
        s = cleanup_old_screenshots(retention_days=1)
        print(f'Deleted {d} downloads, {s} screenshots')
    "
    

  3. Increase volume size (requires volume backup/restore)

High Memory Usage

Problem: Server running out of memory

Solutions:

  1. Reduce concurrent executions:

    MAX_CONCURRENT_EXECUTIONS_PER_USER=2
    

  2. Monitor and kill stuck executions:

    docker-compose exec web python -c "
    from app import create_app
    from app.cleanup import cleanup_stuck_executions
    
    app = create_app()
    with app.app_context():
        count = cleanup_stuck_executions()
        print(f'Cleaned up {count} stuck executions')
    "
    

See Also