Cleanup System¶
Scrapazoid includes automatic cleanup systems to maintain system health and manage storage.
Overview¶
The cleanup system ensures that:
- ✅ No execution runs longer than configured timeout
- ✅ Stuck executions are automatically marked as FAILED
- ✅ Old screenshots and downloads are removed
- ✅ Database stays clean and accurate
- ✅ Storage is managed automatically
- ✅ Resources are properly freed
Automatic Background Cleanup¶
A background thread runs continuously to perform cleanup tasks:
- Every 60 seconds: Check for stuck executions
- Every 24 hours: Clean up old screenshots and downloaded files
Detection Criteria¶
An execution is considered "stuck" if:
- Status is
runningorpending - Started more than 6 minutes ago (5-minute timeout + 1-minute buffer)
Cleanup Actions¶
For each stuck execution:
- Status changed to
failed - Error message set to:
"Execution stuck for {runtime}. Automatically terminated by cleanup job." - Completion timestamp recorded
- Database changes committed
Logs¶
The cleanup thread logs activity to stdout:
[Cleanup Thread] Started with 60s interval
[Cleanup] Found 2 stuck execution(s)
[Cleanup] Marking execution #123 as failed (runtime: 0:07:42)
[Cleanup] Marking execution #124 as failed (runtime: 0:08:15)
[Cleanup] Cleaned up 2 stuck execution(s)
[Cleanup Thread] Running daily file cleanup...
[Cleanup Thread] Deleted 45 old screenshots, 123 old downloads
Manual Cleanup¶
To immediately clean up stuck executions:
Example Output¶
$ flask cleanup_stuck_executions
[Cleanup] Found 3 stuck execution(s)
[Cleanup] Marking execution #45 as failed (runtime: 0:12:34)
[Cleanup] Marking execution #46 as failed (runtime: 0:15:22)
[Cleanup] Marking execution #47 as failed (runtime: 1:03:15)
Successfully cleaned up 3 stuck execution(s).
When to Use Manual Cleanup¶
- Cleaning up executions stuck before automatic cleanup was implemented
- Immediate cleanup without waiting for the 60-second interval
- Testing the cleanup functionality
- After a system crash or restart
File Cleanup¶
Screenshot Cleanup¶
Old screenshots are automatically deleted based on retention period.
Configuration:
Manual Cleanup:
docker-compose exec web python -c "
from app import create_app
from app.cleanup import cleanup_old_screenshots
app = create_app()
with app.app_context():
count = cleanup_old_screenshots(retention_days=7)
print(f'Deleted {count} old screenshots')
"
Cleanup Process:
1. Finds screenshots older than retention period
2. Deletes files from /app/static/screenshots directory
3. Removes database records
4. Commits changes
Downloaded Files Cleanup¶
Old downloaded files are automatically deleted based on retention period.
Configuration:
Manual Cleanup:
docker-compose exec web python -c "
from app import create_app
from app.cleanup import cleanup_old_downloads
app = create_app()
with app.app_context():
count = cleanup_old_downloads(retention_days=7)
print(f'Deleted {count} old downloads')
"
Cleanup Process:
1. Finds downloaded files older than retention period
2. Deletes files from /app/static/downloads directory
3. Removes database records
4. Commits changes
Aggressive Cleanup¶
For immediate storage recovery, use shorter retention:
docker-compose exec web python -c "
from app import create_app
from app.cleanup import cleanup_old_downloads, cleanup_old_screenshots
app = create_app()
with app.app_context():
# Delete files older than 1 day
downloads = cleanup_old_downloads(retention_days=1)
screenshots = cleanup_old_screenshots(retention_days=1)
print(f'Deleted {downloads} downloads, {screenshots} screenshots')
"
Warning: This permanently deletes files. Consider downloading important files first.
Execution Cleanup¶
Timeout Threshold¶
The timeout is based on MAX_EXECUTION_TIME config:
# config.py
MAX_EXECUTION_TIME = 300 # 5 minutes
# Cleanup checks for executions older than:
timeout_threshold = MAX_EXECUTION_TIME + 60 # 6 minutes
Cleanup Interval¶
Default: Check every 60 seconds
To change the interval, modify app/__init__.py:
# Check every 30 seconds
cleanup_thread = start_cleanup_thread(app, interval=30)
# Check every 2 minutes
cleanup_thread = start_cleanup_thread(app, interval=120)
Production Deployment¶
Docker/Container Environments¶
The background thread runs automatically when the Flask app starts.
Cron Job (Optional)¶
For redundancy, you can set up a cron job:
# Run cleanup every 5 minutes
*/5 * * * * cd /path/to/scrapazoid && flask cleanup_stuck_executions >> /var/log/scrapazoid-cleanup.log 2>&1
Monitoring¶
Database Query¶
Check for currently stuck executions:
SELECT id, user_id, started_at,
EXTRACT(EPOCH FROM (NOW() - started_at)) as runtime_seconds
FROM executions
WHERE status IN ('running', 'pending')
AND started_at < NOW() - INTERVAL '6 minutes'
ORDER BY started_at;
Application Logs¶
Monitor the cleanup thread:
Troubleshooting¶
Cleanup Not Running¶
Check if the background thread is active:
# In Flask shell
from app import app
print(hasattr(app, 'cleanup_thread')) # Should be True
print(app.cleanup_thread.is_alive()) # Should be True
Executions Still Stuck¶
- Run manual cleanup:
flask cleanup_stuck_executions - Check application logs for errors
- Restart the application to restart the cleanup thread
- Verify
MAX_EXECUTION_TIMEconfiguration
Thread Not Starting¶
The cleanup thread only starts when:
app.config['TESTING']isFalse- Application starts successfully
Check for errors during application startup.
Technical Details¶
Thread Implementation¶
- Type: Daemon thread
- Interval: 60 seconds (configurable)
- Auto-start: Yes (unless testing mode)
- Graceful shutdown: Yes (via
stop()method)
Code Location¶
- Cleanup logic:
app/cleanup.py - Thread initialization:
app/__init__.py - CLI command:
run.py
Storage Monitoring¶
Check Volume Usage¶
Monitor Docker volume disk usage:
# List all volumes
docker volume ls
# Inspect downloads volume
docker volume inspect scrapazoid_downloads
# Check disk usage
docker system df -v
Database Statistics¶
Check file counts and sizes:
docker-compose exec web python -c "
from app import create_app, db
from app.models import Screenshot, DownloadedFile
from sqlalchemy import func
from datetime import datetime, timedelta
app = create_app()
with app.app_context():
# Screenshot stats
screenshot_count = Screenshot.query.count()
screenshot_size = db.session.query(func.sum(Screenshot.file_size)).scalar() or 0
# Download stats
download_count = DownloadedFile.query.count()
download_size = db.session.query(func.sum(DownloadedFile.file_size)).scalar() or 0
print(f'Screenshots: {screenshot_count} files')
print(f'Downloads: {download_count} files, {download_size / (1024**2):.2f} MB')
# Old files (> 7 days)
cutoff = datetime.utcnow() - timedelta(days=7)
old_screenshots = Screenshot.query.filter(Screenshot.timestamp < cutoff).count()
old_downloads = DownloadedFile.query.filter(DownloadedFile.timestamp < cutoff).count()
print(f'Old screenshots (>7d): {old_screenshots}')
print(f'Old downloads (>7d): {old_downloads}')
"
Automated Monitoring¶
Set up monitoring alerts:
#!/bin/bash
# monitor_storage.sh
# Get volume path
VOLUME_PATH=$(docker volume inspect scrapazoid_downloads --format '{{ .Mountpoint }}')
# Check disk usage
USAGE=$(df "$VOLUME_PATH" | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$USAGE" -gt 80 ]; then
echo "WARNING: Downloads volume is ${USAGE}% full"
# Send alert (email, Slack, etc.)
fi
Cleanup Best Practices¶
Regular Schedule¶
Run aggressive cleanup periodically during maintenance windows:
#!/bin/bash
# weekly_cleanup.sh
docker-compose exec web python -c "
from app import create_app
from app.cleanup import cleanup_old_downloads, cleanup_old_screenshots, cleanup_stuck_executions
app = create_app()
with app.app_context():
stuck = cleanup_stuck_executions()
downloads = cleanup_old_downloads(retention_days=3)
screenshots = cleanup_old_screenshots(retention_days=3)
print(f'Cleaned up:')
print(f' - {stuck} stuck executions')
print(f' - {downloads} old downloads')
print(f' - {screenshots} old screenshots')
"
Retention Tuning¶
Adjust retention based on usage patterns:
- High activity, limited storage: 1-3 days
- Moderate activity: 7-14 days (default)
- Low activity, archival needs: 30-90 days
Storage Capacity Planning¶
Calculate storage needs:
Daily Storage =
(Avg executions/day) ×
(Screenshots/execution × Avg screenshot size +
Downloads/execution × Avg download size)
Total Storage = Daily Storage × Retention Days × 1.2 (safety margin)
Example: - 50 executions/day - 10 screenshots/execution @ 100KB each = 1MB - 5 downloads/execution @ 5MB each = 25MB - Total per execution: 26MB - Daily: 50 × 26MB = 1.3GB - 7-day retention: 1.3GB × 7 × 1.2 = 11GB needed
Troubleshooting¶
Cleanup Not Removing Files¶
Check file permissions:
docker-compose exec web ls -la /app/static/downloads
docker-compose exec web ls -la /app/static/screenshots
Manually remove orphaned files:
# Find files not in database
docker-compose exec web python -c "
import os
from app import create_app
from app.models import DownloadedFile
app = create_app()
with app.app_context():
downloads_dir = '/app/static/downloads'
db_files = {os.path.basename(f.file_path) for f in DownloadedFile.query.all()}
disk_files = set(os.listdir(downloads_dir))
orphaned = disk_files - db_files
print(f'Orphaned files: {len(orphaned)}')
for f in orphaned:
print(f' - {f}')
"
Volume Full Despite Cleanup¶
Check Docker system usage:
Check for large individual files:
See Also¶
- Configuration Options - Configure retention periods
- Database Management - Database maintenance
- Downloading Files - File download feature