Example: Using Imported Data¶

This example demonstrates how to use imported data to scrape multiple URLs without hardcoding them in your script.

The Problem¶

You want to scrape the same information from multiple websites, but you don't want to hardcode the URLs in your script. You also want to easily add or remove URLs without editing code.

The Solution¶

Use the Import Data feature to pass URLs as parameters to your script.

Step 1: Import Your Data¶

Click the "Import Data" button and paste this JSON:

{
  "urls": [
    "https://example.com",
    "https://www.iana.org/domains/reserved",
    "https://www.rfc-editor.org"
  ],
  "capture_screenshots": true,
  "extract_title": true,
  "extract_headings": true
}

Click "Import". The button should turn green with a checkmark.

Step 2: Write Your Script¶

async def main(page):
    # Validate imported data
    if not imported_data:
        debug_log("ERROR: No data imported. Please import JSON data first.")
        return

    if 'urls' not in imported_data:
        debug_log("ERROR: Missing 'urls' field in imported data")
        return

    # Get configuration
    urls = imported_data['urls']
    capture_screenshots = imported_data.get('capture_screenshots', False)
    extract_title = imported_data.get('extract_title', True)
    extract_headings = imported_data.get('extract_headings', False)

    debug_log(f"Processing {len(urls)} URLs")
    debug_log(f"Screenshots: {capture_screenshots}")
    debug_log(f"Extract titles: {extract_title}")
    debug_log(f"Extract headings: {extract_headings}")

    # Process each URL
    all_results = []

    for i, url in enumerate(urls):
        debug_log(f"[{i+1}/{len(urls)}] Processing: {url}")

        try:
            # Navigate to URL
            await page.goto(url, timeout=10000)
            await page.wait_for_load_state('networkidle')

            # Build result object
            result = {'url': url, 'success': True}

            # Extract title if requested
            if extract_title:
                title = await page.title()
                result['title'] = title
                debug_log(f"  Title: {title}")

            # Extract headings if requested
            if extract_headings:
                headings = []
                h1_elements = await page.locator('h1').all()

                for h1 in h1_elements:
                    text = await h1.text_content()
                    if text and text.strip():
                        headings.append(text.strip())

                result['headings'] = headings
                debug_log(f"  Found {len(headings)} headings")

            # Capture screenshot if requested
            if capture_screenshots:
                await capture_screenshot(f"Page {i+1}: {url}")

            all_results.append(result)

        except Exception as e:
            debug_log(f"  ERROR: {str(e)}")
            all_results.append({
                'url': url,
                'success': False,
                'error': str(e)
            })

        # Small delay between requests
        if i < len(urls) - 1:
            await asyncio.sleep(1)

    # Save all results
    scrape_data({
        'config': {
            'urls_count': len(urls),
            'capture_screenshots': capture_screenshots,
            'extract_title': extract_title,
            'extract_headings': extract_headings
        },
        'results': all_results
    })

    debug_log(f"Completed! Processed {len(all_results)} URLs")

Step 3: Run the Script¶

Click "Run" and watch the script:

Process each URL from your imported data
Extract titles and headings based on your configuration
Capture screenshots if enabled
Store all results in structured format

Step 4: View Results¶

Switch to the Data tab to see the extracted information in JSON format:

{
  "config": {
    "urls_count": 3,
    "capture_screenshots": true,
    "extract_title": true,
    "extract_headings": true
  },
  "results": [
    {
      "url": "https://example.com",
      "success": true,
      "title": "Example Domain",
      "headings": ["Example Domain"]
    },
    {
      "url": "https://www.iana.org/domains/reserved",
      "success": true,
      "title": "IANA — Reserved Domains",
      "headings": ["Reserved Domains"]
    },
    {
      "url": "https://www.rfc-editor.org",
      "success": true,
      "title": "RFC Editor",
      "headings": ["RFC Editor"]
    }
  ]
}

Variations¶

Scrape Different URLs¶

Update the imported data with new URLs:

{
  "urls": [
    "https://github.com",
    "https://stackoverflow.com",
    "https://news.ycombinator.com"
  ],
  "capture_screenshots": false,
  "extract_title": true,
  "extract_headings": false
}

Run the script again without editing code!

Single URL with Parameters¶

Import a single URL with scraping parameters:

{
  "urls": ["https://news.ycombinator.com"],
  "capture_screenshots": true,
  "extract_title": true,
  "extract_headings": true,
  "max_items": 10
}

Modify the script to use max_items:

# In the script, add:
max_items = imported_data.get('max_items', 100)

# When extracting items:
items = await page.locator('.item').all()
for i, item in enumerate(items[:max_items]):
    # Process item...

Error Handling¶

The script includes error handling for each URL. If one URL fails, the script continues with the next:

{
  "urls": [
    "https://example.com",
    "https://invalid-url-that-will-fail",
    "https://www.iana.org/domains/reserved"
  ],
  "capture_screenshots": true,
  "extract_title": true,
  "extract_headings": false
}

Result:

{
  "results": [
    {"url": "https://example.com", "success": true, "title": "Example Domain"},
    {"url": "https://invalid-url-that-will-fail", "success": false, "error": "..."},
    {"url": "https://www.iana.org/domains/reserved", "success": true, "title": "IANA — Reserved Domains"}
  ]
}

Key Takeaways¶

Parameterization: Import Data makes scripts reusable with different inputs
Configuration: Use imported data for feature toggles (screenshots, extraction options)
Validation: Always check if imported_data exists and has expected fields
Error Handling: Handle errors gracefully so one failure doesn't stop the whole script
Defaults: Use .get() with defaults for optional configuration
Logging: Store configuration in scrape_data() to track what was used