Cleanup System¶

Scrapazoid includes automatic cleanup systems to maintain system health and manage storage.

Overview¶

The cleanup system ensures that:

✅ No execution runs longer than configured timeout
✅ Stuck executions are automatically marked as FAILED
✅ Old screenshots and downloads are removed
✅ Database stays clean and accurate
✅ Storage is managed automatically
✅ Resources are properly freed

Automatic Background Cleanup¶

A background thread runs continuously to perform cleanup tasks:

Every 60 seconds: Check for stuck executions
Every 24 hours: Clean up old screenshots and downloaded files

Detection Criteria¶

An execution is considered "stuck" if:

Status is running or pending
Started more than 6 minutes ago (5-minute timeout + 1-minute buffer)

Cleanup Actions¶

For each stuck execution:

Status changed to failed
Error message set to: "Execution stuck for {runtime}. Automatically terminated by cleanup job."
Completion timestamp recorded
Database changes committed

Logs¶

The cleanup thread logs activity to stdout:

[Cleanup Thread] Started with 60s interval
[Cleanup] Found 2 stuck execution(s)
[Cleanup] Marking execution #123 as failed (runtime: 0:07:42)
[Cleanup] Marking execution #124 as failed (runtime: 0:08:15)
[Cleanup] Cleaned up 2 stuck execution(s)
[Cleanup Thread] Running daily file cleanup...
[Cleanup Thread] Deleted 45 old screenshots, 123 old downloads

Manual Cleanup¶

To immediately clean up stuck executions:

flask cleanup_stuck_executions

Example Output¶

$ flask cleanup_stuck_executions
[Cleanup] Found 3 stuck execution(s)
[Cleanup] Marking execution #45 as failed (runtime: 0:12:34)
[Cleanup] Marking execution #46 as failed (runtime: 0:15:22)
[Cleanup] Marking execution #47 as failed (runtime: 1:03:15)
Successfully cleaned up 3 stuck execution(s).

When to Use Manual Cleanup¶

Cleaning up executions stuck before automatic cleanup was implemented
Immediate cleanup without waiting for the 60-second interval
Testing the cleanup functionality
After a system crash or restart

File Cleanup¶

Screenshot Cleanup¶

Old screenshots are automatically deleted based on retention period.

Configuration:

# In .env
SCREENSHOT_RETENTION_DAYS=7  # Default: 7 days

Manual Cleanup:

docker-compose exec web python -c "
from app import create_app
from app.cleanup import cleanup_old_screenshots

app = create_app()
with app.app_context():
    count = cleanup_old_screenshots(retention_days=7)
    print(f'Deleted {count} old screenshots')
"

Cleanup Process: 1. Finds screenshots older than retention period 2. Deletes files from /app/static/screenshots directory 3. Removes database records 4. Commits changes

Downloaded Files Cleanup¶

Old downloaded files are automatically deleted based on retention period.

Configuration:

# In .env
DOWNLOAD_RETENTION_DAYS=7  # Default: 7 days

Manual Cleanup:

docker-compose exec web python -c "
from app import create_app
from app.cleanup import cleanup_old_downloads

app = create_app()
with app.app_context():
    count = cleanup_old_downloads(retention_days=7)
    print(f'Deleted {count} old downloads')
"

Cleanup Process: 1. Finds downloaded files older than retention period 2. Deletes files from /app/static/downloads directory 3. Removes database records 4. Commits changes

Aggressive Cleanup¶

For immediate storage recovery, use shorter retention:

docker-compose exec web python -c "
from app import create_app
from app.cleanup import cleanup_old_downloads, cleanup_old_screenshots

app = create_app()
with app.app_context():
    # Delete files older than 1 day
    downloads = cleanup_old_downloads(retention_days=1)
    screenshots = cleanup_old_screenshots(retention_days=1)
    print(f'Deleted {downloads} downloads, {screenshots} screenshots')
"

Warning: This permanently deletes files. Consider downloading important files first.

Execution Cleanup¶

Timeout Threshold¶

The timeout is based on MAX_EXECUTION_TIME config:

# config.py
MAX_EXECUTION_TIME = 300  # 5 minutes

# Cleanup checks for executions older than:
timeout_threshold = MAX_EXECUTION_TIME + 60  # 6 minutes

Cleanup Interval¶

Default: Check every 60 seconds

To change the interval, modify app/__init__.py:

# Check every 30 seconds
cleanup_thread = start_cleanup_thread(app, interval=30)

# Check every 2 minutes
cleanup_thread = start_cleanup_thread(app, interval=120)

Production Deployment¶

Docker/Container Environments¶

The background thread runs automatically when the Flask app starts.

Cron Job (Optional)¶

For redundancy, you can set up a cron job:

# Run cleanup every 5 minutes
*/5 * * * * cd /path/to/scrapazoid && flask cleanup_stuck_executions >> /var/log/scrapazoid-cleanup.log 2>&1

Monitoring¶

Database Query¶

Check for currently stuck executions:

SELECT id, user_id, started_at,
       EXTRACT(EPOCH FROM (NOW() - started_at)) as runtime_seconds
FROM executions
WHERE status IN ('running', 'pending')
  AND started_at < NOW() - INTERVAL '6 minutes'
ORDER BY started_at;

Application Logs¶

Monitor the cleanup thread:

# Follow application logs
tail -f /var/log/scrapazoid.log | grep Cleanup

Troubleshooting¶

Cleanup Not Running¶

Check if the background thread is active:

# In Flask shell
from app import app
print(hasattr(app, 'cleanup_thread'))  # Should be True
print(app.cleanup_thread.is_alive())   # Should be True

Executions Still Stuck¶

Run manual cleanup: flask cleanup_stuck_executions
Check application logs for errors
Restart the application to restart the cleanup thread
Verify MAX_EXECUTION_TIME configuration

Thread Not Starting¶

The cleanup thread only starts when:

app.config['TESTING'] is False
Application starts successfully

Check for errors during application startup.

Technical Details¶

Thread Implementation¶

Type: Daemon thread
Interval: 60 seconds (configurable)
Auto-start: Yes (unless testing mode)
Graceful shutdown: Yes (via stop() method)

Code Location¶

Cleanup logic: app/cleanup.py
Thread initialization: app/__init__.py
CLI command: run.py

Storage Monitoring¶

Check Volume Usage¶

Monitor Docker volume disk usage:

# List all volumes
docker volume ls

# Inspect downloads volume
docker volume inspect scrapazoid_downloads

# Check disk usage
docker system df -v

Database Statistics¶

Check file counts and sizes:

docker-compose exec web python -c "
from app import create_app, db
from app.models import Screenshot, DownloadedFile
from sqlalchemy import func
from datetime import datetime, timedelta

app = create_app()
with app.app_context():
    # Screenshot stats
    screenshot_count = Screenshot.query.count()
    screenshot_size = db.session.query(func.sum(Screenshot.file_size)).scalar() or 0

    # Download stats
    download_count = DownloadedFile.query.count()
    download_size = db.session.query(func.sum(DownloadedFile.file_size)).scalar() or 0

    print(f'Screenshots: {screenshot_count} files')
    print(f'Downloads: {download_count} files, {download_size / (1024**2):.2f} MB')

    # Old files (> 7 days)
    cutoff = datetime.utcnow() - timedelta(days=7)
    old_screenshots = Screenshot.query.filter(Screenshot.timestamp < cutoff).count()
    old_downloads = DownloadedFile.query.filter(DownloadedFile.timestamp < cutoff).count()

    print(f'Old screenshots (>7d): {old_screenshots}')
    print(f'Old downloads (>7d): {old_downloads}')
"

Automated Monitoring¶

Set up monitoring alerts:

#!/bin/bash
# monitor_storage.sh

# Get volume path
VOLUME_PATH=$(docker volume inspect scrapazoid_downloads --format '{{ .Mountpoint }}')

# Check disk usage
USAGE=$(df "$VOLUME_PATH" | tail -1 | awk '{print $5}' | sed 's/%//')

if [ "$USAGE" -gt 80 ]; then
    echo "WARNING: Downloads volume is ${USAGE}% full"
    # Send alert (email, Slack, etc.)
fi

Cleanup Best Practices¶

Regular Schedule¶

Run aggressive cleanup periodically during maintenance windows:

#!/bin/bash
# weekly_cleanup.sh

docker-compose exec web python -c "
from app import create_app
from app.cleanup import cleanup_old_downloads, cleanup_old_screenshots, cleanup_stuck_executions

app = create_app()
with app.app_context():
    stuck = cleanup_stuck_executions()
    downloads = cleanup_old_downloads(retention_days=3)
    screenshots = cleanup_old_screenshots(retention_days=3)

    print(f'Cleaned up:')
    print(f'  - {stuck} stuck executions')
    print(f'  - {downloads} old downloads')
    print(f'  - {screenshots} old screenshots')
"

Retention Tuning¶

Adjust retention based on usage patterns:

High activity, limited storage: 1-3 days
Moderate activity: 7-14 days (default)
Low activity, archival needs: 30-90 days

Storage Capacity Planning¶

Calculate storage needs:

Daily Storage =
  (Avg executions/day) ×
  (Screenshots/execution × Avg screenshot size +
   Downloads/execution × Avg download size)

Total Storage = Daily Storage × Retention Days × 1.2 (safety margin)

Example: - 50 executions/day - 10 screenshots/execution @ 100KB each = 1MB - 5 downloads/execution @ 5MB each = 25MB - Total per execution: 26MB - Daily: 50 × 26MB = 1.3GB - 7-day retention: 1.3GB × 7 × 1.2 = 11GB needed

Troubleshooting¶

Cleanup Not Removing Files¶

Check file permissions:

docker-compose exec web ls -la /app/static/downloads
docker-compose exec web ls -la /app/static/screenshots

Manually remove orphaned files:

# Find files not in database
docker-compose exec web python -c "
import os
from app import create_app
from app.models import DownloadedFile

app = create_app()
with app.app_context():
    downloads_dir = '/app/static/downloads'
    db_files = {os.path.basename(f.file_path) for f in DownloadedFile.query.all()}
    disk_files = set(os.listdir(downloads_dir))
    orphaned = disk_files - db_files

    print(f'Orphaned files: {len(orphaned)}')
    for f in orphaned:
        print(f'  - {f}')
"

Volume Full Despite Cleanup¶

Check Docker system usage:

docker system df

# Clean up Docker system
docker system prune -a --volumes

Check for large individual files:

docker-compose exec web du -h /app/static/downloads | sort -hr | head -20

Cleanup System¶

Overview¶

Automatic Background Cleanup¶

Detection Criteria¶

Cleanup Actions¶

Logs¶

Manual Cleanup¶

Example Output¶

When to Use Manual Cleanup¶

File Cleanup¶

Screenshot Cleanup¶

Downloaded Files Cleanup¶

Aggressive Cleanup¶

Execution Cleanup¶

Timeout Threshold¶

Cleanup Interval¶

Production Deployment¶

Docker/Container Environments¶

Cron Job (Optional)¶

Monitoring¶

Database Query¶

Application Logs¶

Troubleshooting¶

Cleanup Not Running¶

Executions Still Stuck¶

Thread Not Starting¶

Technical Details¶

Thread Implementation¶

Code Location¶

Storage Monitoring¶

Check Volume Usage¶

Database Statistics¶

Automated Monitoring¶

Cleanup Best Practices¶

Regular Schedule¶

Retention Tuning¶

Storage Capacity Planning¶

Troubleshooting¶

Cleanup Not Removing Files¶

Volume Full Despite Cleanup¶

See Also¶