Skip to content

Cleanup System

Scrapazoid includes automatic cleanup systems to maintain system health and manage storage.

Overview

The cleanup system ensures that:

  • ✅ No execution runs longer than configured timeout
  • ✅ Stuck executions are automatically marked as FAILED
  • ✅ Old screenshots and downloads are removed
  • ✅ Database stays clean and accurate
  • ✅ Storage is managed automatically
  • ✅ Resources are properly freed

Automatic Background Cleanup

A background thread runs continuously to perform cleanup tasks:

  • Every 60 seconds: Check for stuck executions
  • Every 24 hours: Clean up old screenshots and downloaded files

Detection Criteria

An execution is considered "stuck" if:

  1. Status is running or pending
  2. Started more than 6 minutes ago (5-minute timeout + 1-minute buffer)

Cleanup Actions

For each stuck execution:

  1. Status changed to failed
  2. Error message set to: "Execution stuck for {runtime}. Automatically terminated by cleanup job."
  3. Completion timestamp recorded
  4. Database changes committed

Logs

The cleanup thread logs activity to stdout:

[Cleanup Thread] Started with 60s interval
[Cleanup] Found 2 stuck execution(s)
[Cleanup] Marking execution #123 as failed (runtime: 0:07:42)
[Cleanup] Marking execution #124 as failed (runtime: 0:08:15)
[Cleanup] Cleaned up 2 stuck execution(s)
[Cleanup Thread] Running daily file cleanup...
[Cleanup Thread] Deleted 45 old screenshots, 123 old downloads

Manual Cleanup

To immediately clean up stuck executions:

flask cleanup_stuck_executions

Example Output

$ flask cleanup_stuck_executions
[Cleanup] Found 3 stuck execution(s)
[Cleanup] Marking execution #45 as failed (runtime: 0:12:34)
[Cleanup] Marking execution #46 as failed (runtime: 0:15:22)
[Cleanup] Marking execution #47 as failed (runtime: 1:03:15)
Successfully cleaned up 3 stuck execution(s).

When to Use Manual Cleanup

  • Cleaning up executions stuck before automatic cleanup was implemented
  • Immediate cleanup without waiting for the 60-second interval
  • Testing the cleanup functionality
  • After a system crash or restart

File Cleanup

Screenshot Cleanup

Old screenshots are automatically deleted based on retention period.

Configuration:

# In .env
SCREENSHOT_RETENTION_DAYS=7  # Default: 7 days

Manual Cleanup:

docker-compose exec web python -c "
from app import create_app
from app.cleanup import cleanup_old_screenshots

app = create_app()
with app.app_context():
    count = cleanup_old_screenshots(retention_days=7)
    print(f'Deleted {count} old screenshots')
"

Cleanup Process: 1. Finds screenshots older than retention period 2. Deletes files from /app/static/screenshots directory 3. Removes database records 4. Commits changes

Downloaded Files Cleanup

Old downloaded files are automatically deleted based on retention period.

Configuration:

# In .env
DOWNLOAD_RETENTION_DAYS=7  # Default: 7 days

Manual Cleanup:

docker-compose exec web python -c "
from app import create_app
from app.cleanup import cleanup_old_downloads

app = create_app()
with app.app_context():
    count = cleanup_old_downloads(retention_days=7)
    print(f'Deleted {count} old downloads')
"

Cleanup Process: 1. Finds downloaded files older than retention period 2. Deletes files from /app/static/downloads directory 3. Removes database records 4. Commits changes

Aggressive Cleanup

For immediate storage recovery, use shorter retention:

docker-compose exec web python -c "
from app import create_app
from app.cleanup import cleanup_old_downloads, cleanup_old_screenshots

app = create_app()
with app.app_context():
    # Delete files older than 1 day
    downloads = cleanup_old_downloads(retention_days=1)
    screenshots = cleanup_old_screenshots(retention_days=1)
    print(f'Deleted {downloads} downloads, {screenshots} screenshots')
"

Warning: This permanently deletes files. Consider downloading important files first.

Execution Cleanup

Timeout Threshold

The timeout is based on MAX_EXECUTION_TIME config:

# config.py
MAX_EXECUTION_TIME = 300  # 5 minutes

# Cleanup checks for executions older than:
timeout_threshold = MAX_EXECUTION_TIME + 60  # 6 minutes

Cleanup Interval

Default: Check every 60 seconds

To change the interval, modify app/__init__.py:

# Check every 30 seconds
cleanup_thread = start_cleanup_thread(app, interval=30)

# Check every 2 minutes
cleanup_thread = start_cleanup_thread(app, interval=120)

Production Deployment

Docker/Container Environments

The background thread runs automatically when the Flask app starts.

Cron Job (Optional)

For redundancy, you can set up a cron job:

# Run cleanup every 5 minutes
*/5 * * * * cd /path/to/scrapazoid && flask cleanup_stuck_executions >> /var/log/scrapazoid-cleanup.log 2>&1

Monitoring

Database Query

Check for currently stuck executions:

SELECT id, user_id, started_at,
       EXTRACT(EPOCH FROM (NOW() - started_at)) as runtime_seconds
FROM executions
WHERE status IN ('running', 'pending')
  AND started_at < NOW() - INTERVAL '6 minutes'
ORDER BY started_at;

Application Logs

Monitor the cleanup thread:

# Follow application logs
tail -f /var/log/scrapazoid.log | grep Cleanup

Troubleshooting

Cleanup Not Running

Check if the background thread is active:

# In Flask shell
from app import app
print(hasattr(app, 'cleanup_thread'))  # Should be True
print(app.cleanup_thread.is_alive())   # Should be True

Executions Still Stuck

  1. Run manual cleanup: flask cleanup_stuck_executions
  2. Check application logs for errors
  3. Restart the application to restart the cleanup thread
  4. Verify MAX_EXECUTION_TIME configuration

Thread Not Starting

The cleanup thread only starts when:

  • app.config['TESTING'] is False
  • Application starts successfully

Check for errors during application startup.

Technical Details

Thread Implementation

  • Type: Daemon thread
  • Interval: 60 seconds (configurable)
  • Auto-start: Yes (unless testing mode)
  • Graceful shutdown: Yes (via stop() method)

Code Location

  • Cleanup logic: app/cleanup.py
  • Thread initialization: app/__init__.py
  • CLI command: run.py

Storage Monitoring

Check Volume Usage

Monitor Docker volume disk usage:

# List all volumes
docker volume ls

# Inspect downloads volume
docker volume inspect scrapazoid_downloads

# Check disk usage
docker system df -v

Database Statistics

Check file counts and sizes:

docker-compose exec web python -c "
from app import create_app, db
from app.models import Screenshot, DownloadedFile
from sqlalchemy import func
from datetime import datetime, timedelta

app = create_app()
with app.app_context():
    # Screenshot stats
    screenshot_count = Screenshot.query.count()
    screenshot_size = db.session.query(func.sum(Screenshot.file_size)).scalar() or 0

    # Download stats
    download_count = DownloadedFile.query.count()
    download_size = db.session.query(func.sum(DownloadedFile.file_size)).scalar() or 0

    print(f'Screenshots: {screenshot_count} files')
    print(f'Downloads: {download_count} files, {download_size / (1024**2):.2f} MB')

    # Old files (> 7 days)
    cutoff = datetime.utcnow() - timedelta(days=7)
    old_screenshots = Screenshot.query.filter(Screenshot.timestamp < cutoff).count()
    old_downloads = DownloadedFile.query.filter(DownloadedFile.timestamp < cutoff).count()

    print(f'Old screenshots (>7d): {old_screenshots}')
    print(f'Old downloads (>7d): {old_downloads}')
"

Automated Monitoring

Set up monitoring alerts:

#!/bin/bash
# monitor_storage.sh

# Get volume path
VOLUME_PATH=$(docker volume inspect scrapazoid_downloads --format '{{ .Mountpoint }}')

# Check disk usage
USAGE=$(df "$VOLUME_PATH" | tail -1 | awk '{print $5}' | sed 's/%//')

if [ "$USAGE" -gt 80 ]; then
    echo "WARNING: Downloads volume is ${USAGE}% full"
    # Send alert (email, Slack, etc.)
fi

Cleanup Best Practices

Regular Schedule

Run aggressive cleanup periodically during maintenance windows:

#!/bin/bash
# weekly_cleanup.sh

docker-compose exec web python -c "
from app import create_app
from app.cleanup import cleanup_old_downloads, cleanup_old_screenshots, cleanup_stuck_executions

app = create_app()
with app.app_context():
    stuck = cleanup_stuck_executions()
    downloads = cleanup_old_downloads(retention_days=3)
    screenshots = cleanup_old_screenshots(retention_days=3)

    print(f'Cleaned up:')
    print(f'  - {stuck} stuck executions')
    print(f'  - {downloads} old downloads')
    print(f'  - {screenshots} old screenshots')
"

Retention Tuning

Adjust retention based on usage patterns:

  • High activity, limited storage: 1-3 days
  • Moderate activity: 7-14 days (default)
  • Low activity, archival needs: 30-90 days

Storage Capacity Planning

Calculate storage needs:

Daily Storage =
  (Avg executions/day) ×
  (Screenshots/execution × Avg screenshot size +
   Downloads/execution × Avg download size)

Total Storage = Daily Storage × Retention Days × 1.2 (safety margin)

Example: - 50 executions/day - 10 screenshots/execution @ 100KB each = 1MB - 5 downloads/execution @ 5MB each = 25MB - Total per execution: 26MB - Daily: 50 × 26MB = 1.3GB - 7-day retention: 1.3GB × 7 × 1.2 = 11GB needed

Troubleshooting

Cleanup Not Removing Files

Check file permissions:

docker-compose exec web ls -la /app/static/downloads
docker-compose exec web ls -la /app/static/screenshots

Manually remove orphaned files:

# Find files not in database
docker-compose exec web python -c "
import os
from app import create_app
from app.models import DownloadedFile

app = create_app()
with app.app_context():
    downloads_dir = '/app/static/downloads'
    db_files = {os.path.basename(f.file_path) for f in DownloadedFile.query.all()}
    disk_files = set(os.listdir(downloads_dir))
    orphaned = disk_files - db_files

    print(f'Orphaned files: {len(orphaned)}')
    for f in orphaned:
        print(f'  - {f}')
"

Volume Full Despite Cleanup

Check Docker system usage:

docker system df

# Clean up Docker system
docker system prune -a --volumes

Check for large individual files:

docker-compose exec web du -h /app/static/downloads | sort -hr | head -20

See Also