Example: Using Imported Data¶
This example demonstrates how to use imported data to scrape multiple URLs without hardcoding them in your script.
The Problem¶
You want to scrape the same information from multiple websites, but you don't want to hardcode the URLs in your script. You also want to easily add or remove URLs without editing code.
The Solution¶
Use the Import Data feature to pass URLs as parameters to your script.
Step 1: Import Your Data¶
Click the "Import Data" button and paste this JSON:
{
"urls": [
"https://example.com",
"https://www.iana.org/domains/reserved",
"https://www.rfc-editor.org"
],
"capture_screenshots": true,
"extract_title": true,
"extract_headings": true
}
Click "Import". The button should turn green with a checkmark.
Step 2: Write Your Script¶
async def main(page):
# Validate imported data
if not imported_data:
debug_log("ERROR: No data imported. Please import JSON data first.")
return
if 'urls' not in imported_data:
debug_log("ERROR: Missing 'urls' field in imported data")
return
# Get configuration
urls = imported_data['urls']
capture_screenshots = imported_data.get('capture_screenshots', False)
extract_title = imported_data.get('extract_title', True)
extract_headings = imported_data.get('extract_headings', False)
debug_log(f"Processing {len(urls)} URLs")
debug_log(f"Screenshots: {capture_screenshots}")
debug_log(f"Extract titles: {extract_title}")
debug_log(f"Extract headings: {extract_headings}")
# Process each URL
all_results = []
for i, url in enumerate(urls):
debug_log(f"[{i+1}/{len(urls)}] Processing: {url}")
try:
# Navigate to URL
await page.goto(url, timeout=10000)
await page.wait_for_load_state('networkidle')
# Build result object
result = {'url': url, 'success': True}
# Extract title if requested
if extract_title:
title = await page.title()
result['title'] = title
debug_log(f" Title: {title}")
# Extract headings if requested
if extract_headings:
headings = []
h1_elements = await page.locator('h1').all()
for h1 in h1_elements:
text = await h1.text_content()
if text and text.strip():
headings.append(text.strip())
result['headings'] = headings
debug_log(f" Found {len(headings)} headings")
# Capture screenshot if requested
if capture_screenshots:
await capture_screenshot(f"Page {i+1}: {url}")
all_results.append(result)
except Exception as e:
debug_log(f" ERROR: {str(e)}")
all_results.append({
'url': url,
'success': False,
'error': str(e)
})
# Small delay between requests
if i < len(urls) - 1:
await asyncio.sleep(1)
# Save all results
scrape_data({
'config': {
'urls_count': len(urls),
'capture_screenshots': capture_screenshots,
'extract_title': extract_title,
'extract_headings': extract_headings
},
'results': all_results
})
debug_log(f"Completed! Processed {len(all_results)} URLs")
Step 3: Run the Script¶
Click "Run" and watch the script:
- Process each URL from your imported data
- Extract titles and headings based on your configuration
- Capture screenshots if enabled
- Store all results in structured format
Step 4: View Results¶
Switch to the Data tab to see the extracted information in JSON format:
{
"config": {
"urls_count": 3,
"capture_screenshots": true,
"extract_title": true,
"extract_headings": true
},
"results": [
{
"url": "https://example.com",
"success": true,
"title": "Example Domain",
"headings": ["Example Domain"]
},
{
"url": "https://www.iana.org/domains/reserved",
"success": true,
"title": "IANA — Reserved Domains",
"headings": ["Reserved Domains"]
},
{
"url": "https://www.rfc-editor.org",
"success": true,
"title": "RFC Editor",
"headings": ["RFC Editor"]
}
]
}
Variations¶
Scrape Different URLs¶
Update the imported data with new URLs:
{
"urls": [
"https://github.com",
"https://stackoverflow.com",
"https://news.ycombinator.com"
],
"capture_screenshots": false,
"extract_title": true,
"extract_headings": false
}
Run the script again without editing code!
Single URL with Parameters¶
Import a single URL with scraping parameters:
{
"urls": ["https://news.ycombinator.com"],
"capture_screenshots": true,
"extract_title": true,
"extract_headings": true,
"max_items": 10
}
Modify the script to use max_items:
# In the script, add:
max_items = imported_data.get('max_items', 100)
# When extracting items:
items = await page.locator('.item').all()
for i, item in enumerate(items[:max_items]):
# Process item...
Error Handling¶
The script includes error handling for each URL. If one URL fails, the script continues with the next:
{
"urls": [
"https://example.com",
"https://invalid-url-that-will-fail",
"https://www.iana.org/domains/reserved"
],
"capture_screenshots": true,
"extract_title": true,
"extract_headings": false
}
Result:
{
"results": [
{"url": "https://example.com", "success": true, "title": "Example Domain"},
{"url": "https://invalid-url-that-will-fail", "success": false, "error": "..."},
{"url": "https://www.iana.org/domains/reserved", "success": true, "title": "IANA — Reserved Domains"}
]
}
Key Takeaways¶
- Parameterization: Import Data makes scripts reusable with different inputs
- Configuration: Use imported data for feature toggles (screenshots, extraction options)
- Validation: Always check if
imported_dataexists and has expected fields - Error Handling: Handle errors gracefully so one failure doesn't stop the whole script
- Defaults: Use
.get()with defaults for optional configuration - Logging: Store configuration in
scrape_data()to track what was used