Skip to content

Example: Using Imported Data

This example demonstrates how to use imported data to scrape multiple URLs without hardcoding them in your script.

The Problem

You want to scrape the same information from multiple websites, but you don't want to hardcode the URLs in your script. You also want to easily add or remove URLs without editing code.

The Solution

Use the Import Data feature to pass URLs as parameters to your script.

Step 1: Import Your Data

Click the "Import Data" button and paste this JSON:

{
  "urls": [
    "https://example.com",
    "https://www.iana.org/domains/reserved",
    "https://www.rfc-editor.org"
  ],
  "capture_screenshots": true,
  "extract_title": true,
  "extract_headings": true
}

Click "Import". The button should turn green with a checkmark.

Step 2: Write Your Script

async def main(page):
    # Validate imported data
    if not imported_data:
        debug_log("ERROR: No data imported. Please import JSON data first.")
        return

    if 'urls' not in imported_data:
        debug_log("ERROR: Missing 'urls' field in imported data")
        return

    # Get configuration
    urls = imported_data['urls']
    capture_screenshots = imported_data.get('capture_screenshots', False)
    extract_title = imported_data.get('extract_title', True)
    extract_headings = imported_data.get('extract_headings', False)

    debug_log(f"Processing {len(urls)} URLs")
    debug_log(f"Screenshots: {capture_screenshots}")
    debug_log(f"Extract titles: {extract_title}")
    debug_log(f"Extract headings: {extract_headings}")

    # Process each URL
    all_results = []

    for i, url in enumerate(urls):
        debug_log(f"[{i+1}/{len(urls)}] Processing: {url}")

        try:
            # Navigate to URL
            await page.goto(url, timeout=10000)
            await page.wait_for_load_state('networkidle')

            # Build result object
            result = {'url': url, 'success': True}

            # Extract title if requested
            if extract_title:
                title = await page.title()
                result['title'] = title
                debug_log(f"  Title: {title}")

            # Extract headings if requested
            if extract_headings:
                headings = []
                h1_elements = await page.locator('h1').all()

                for h1 in h1_elements:
                    text = await h1.text_content()
                    if text and text.strip():
                        headings.append(text.strip())

                result['headings'] = headings
                debug_log(f"  Found {len(headings)} headings")

            # Capture screenshot if requested
            if capture_screenshots:
                await capture_screenshot(f"Page {i+1}: {url}")

            all_results.append(result)

        except Exception as e:
            debug_log(f"  ERROR: {str(e)}")
            all_results.append({
                'url': url,
                'success': False,
                'error': str(e)
            })

        # Small delay between requests
        if i < len(urls) - 1:
            await asyncio.sleep(1)

    # Save all results
    scrape_data({
        'config': {
            'urls_count': len(urls),
            'capture_screenshots': capture_screenshots,
            'extract_title': extract_title,
            'extract_headings': extract_headings
        },
        'results': all_results
    })

    debug_log(f"Completed! Processed {len(all_results)} URLs")

Step 3: Run the Script

Click "Run" and watch the script:

  1. Process each URL from your imported data
  2. Extract titles and headings based on your configuration
  3. Capture screenshots if enabled
  4. Store all results in structured format

Step 4: View Results

Switch to the Data tab to see the extracted information in JSON format:

{
  "config": {
    "urls_count": 3,
    "capture_screenshots": true,
    "extract_title": true,
    "extract_headings": true
  },
  "results": [
    {
      "url": "https://example.com",
      "success": true,
      "title": "Example Domain",
      "headings": ["Example Domain"]
    },
    {
      "url": "https://www.iana.org/domains/reserved",
      "success": true,
      "title": "IANA — Reserved Domains",
      "headings": ["Reserved Domains"]
    },
    {
      "url": "https://www.rfc-editor.org",
      "success": true,
      "title": "RFC Editor",
      "headings": ["RFC Editor"]
    }
  ]
}

Variations

Scrape Different URLs

Update the imported data with new URLs:

{
  "urls": [
    "https://github.com",
    "https://stackoverflow.com",
    "https://news.ycombinator.com"
  ],
  "capture_screenshots": false,
  "extract_title": true,
  "extract_headings": false
}

Run the script again without editing code!

Single URL with Parameters

Import a single URL with scraping parameters:

{
  "urls": ["https://news.ycombinator.com"],
  "capture_screenshots": true,
  "extract_title": true,
  "extract_headings": true,
  "max_items": 10
}

Modify the script to use max_items:

# In the script, add:
max_items = imported_data.get('max_items', 100)

# When extracting items:
items = await page.locator('.item').all()
for i, item in enumerate(items[:max_items]):
    # Process item...

Error Handling

The script includes error handling for each URL. If one URL fails, the script continues with the next:

{
  "urls": [
    "https://example.com",
    "https://invalid-url-that-will-fail",
    "https://www.iana.org/domains/reserved"
  ],
  "capture_screenshots": true,
  "extract_title": true,
  "extract_headings": false
}

Result:

{
  "results": [
    {"url": "https://example.com", "success": true, "title": "Example Domain"},
    {"url": "https://invalid-url-that-will-fail", "success": false, "error": "..."},
    {"url": "https://www.iana.org/domains/reserved", "success": true, "title": "IANA — Reserved Domains"}
  ]
}

Key Takeaways

  1. Parameterization: Import Data makes scripts reusable with different inputs
  2. Configuration: Use imported data for feature toggles (screenshots, extraction options)
  3. Validation: Always check if imported_data exists and has expected fields
  4. Error Handling: Handle errors gracefully so one failure doesn't stop the whole script
  5. Defaults: Use .get() with defaults for optional configuration
  6. Logging: Store configuration in scrape_data() to track what was used

See Also