name: Web Scraping & Data Extraction Engine description: Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrapers, extracting web data, monitoring competitors, or automating data collection at scale.

Web Scraping & Data Extraction Engine

Quick Health Check (Run First)

Score your scraping operation (2 points each):

Signal	Healthy	Unhealthy
Legal compliance	robots.txt checked, ToS reviewed	Scraping blindly
Architecture	Tool matches site complexity	Using Puppeteer for static HTML
Anti-detection	Rotation, delays, fingerprint diversity	Single IP, no delays
Data quality	Validation + dedup pipeline	Raw dumps, no cleaning
Error handling	Retry logic, circuit breakers	Crashes on first 403
Monitoring	Success rates tracked, alerts set	No visibility
Storage	Structured, deduplicated, versioned	Flat files, duplicates
Scheduling	Appropriate frequency, off-peak	Hammering during business hours

Score: /16 → 12+: Production-ready | 8-11: Needs work | <8: Stop and redesign

Phase 1: Legal & Ethical Foundation

Pre-Scrape Compliance Checklist

compliance_brief:
  target_domain: ""
  date_assessed: ""
  
  robots_txt:
    checked: false
    target_paths_allowed: false
    crawl_delay_specified: ""
    ai_bot_rules: ""  # Many sites now block AI crawlers specifically
    
  terms_of_service:
    reviewed: false
    scraping_mentioned: false
    scraping_prohibited: false
    api_available: false
    api_sufficient: false
    
  data_classification:
    type: ""  # public-factual | public-personal | behind-auth | copyrighted
    contains_pii: false
    pii_types: []  # name, email, phone, address, photo
    gdpr_applies: false  # EU residents' data
    ccpa_applies: false  # California residents' data
    
  legal_risk: ""  # low | medium | high | do-not-scrape
  decision: ""  # proceed | use-api | request-permission | abandon
  justification: ""

Legal Landscape Quick Reference

Scenario	Risk Level	Key Case Law
Public data, no login, robots.txt allows	LOW	hiQ v. LinkedIn (2022)
Public data, robots.txt disallows	MEDIUM	Meta v. Bright Data (2024)
Behind authentication	HIGH	Van Buren v. US (2021), CFAA
Personal data without consent	HIGH	GDPR Art. 6, CCPA §1798.100
Republishing copyrighted content	HIGH	Copyright Act §106
Price/product comparison	LOW	eBay v. Bidder's Edge (fair use)
Academic/research use	LOW-MEDIUM	Varies by jurisdiction
Bypassing anti-bot measures	HIGH	CFAA "exceeds authorized access"

Decision Rules

API exists and covers your needs? → Use the API. Always.
robots.txt disallows your target? → Respect it unless you have written permission.
Data behind login? → Do not scrape without explicit authorization.
Contains PII? → GDPR/CCPA compliance required before collection.
Copyrighted content? → Extract facts/data points only, never full content.
Site explicitly prohibits scraping? → Request permission or find alternative source.

AI Crawler Considerations (2025+)

Many sites now specifically block AI-related crawlers:

# Common AI bot blocks in robots.txt
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: Google-Extended
User-agent: CCBot
User-agent: anthropic-ai
User-agent: ClaudeBot
User-agent: Bytespider
User-agent: PerplexityBot

Rule: If collecting data for AI training, check for these specific blocks.

Phase 2: Architecture Decision

Tool Selection Matrix

Tool/Approach	Best For	Speed	JS Support	Complexity	Cost
HTTP client (requests/axios)	Static HTML, APIs	⚡⚡⚡	❌	Low	Free
Beautiful Soup / Cheerio	Static HTML parsing	⚡⚡⚡	❌	Low	Free
Scrapy	Large-scale structured crawling	⚡⚡⚡	Plugin	Medium	Free
Playwright / Puppeteer	JS-rendered, SPAs, interactions	⚡	✅	Medium	Free
Selenium	Legacy, browser automation	⚡	✅	High	Free
Crawlee	Hybrid (HTTP + browser fallback)	⚡⚡	✅	Medium	Free
Firecrawl / ScrapingBee	Managed, anti-bot bypass	⚡⚡	✅	Low	Paid
Bright Data / Oxylabs	Enterprise, proxy + browser	⚡⚡	✅	Low	Paid

Decision Tree

Is the content in the initial HTML source?
├── YES → Is the site structure consistent?
│   ├── YES → Static scraper (requests + BeautifulSoup/Cheerio)
│   └── NO → Scrapy with custom parsers
└── NO → Does the page require user interaction?
    ├── YES → Playwright/Puppeteer with interaction scripts
    └── NO → Playwright in non-interactive mode
        └── At scale (>10K pages)? → Crawlee (hybrid mode)
            └── Heavy anti-bot? → Managed service (Firecrawl/ScrapingBee)

Architecture Brief YAML

scraping_project:
  name: ""
  objective: ""  # What data, why, how often
  
  targets:
    - domain: ""
      pages_estimated: 0
      rendering: "static" | "javascript" | "spa"
      anti_bot: "none" | "basic" | "cloudflare" | "advanced"
      rate_limit: ""  # requests per second safe limit
      
  tool_selected: ""
  justification: ""
  
  data_schema:
    fields: []
    output_format: ""  # json | csv | database
    
  schedule:
    frequency: ""  # once | hourly | daily | weekly
    preferred_time: ""  # off-peak for target timezone
    
  infrastructure:
    proxy_needed: false
    proxy_type: ""  # residential | datacenter | mobile
    storage: ""
    monitoring: ""

Phase 3: Request Engineering

HTTP Request Best Practices

# Python example — production request pattern
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()

# Retry strategy
retry = Retry(
    total=3,
    backoff_factor=1,      # 1s, 2s, 4s
    status_forcelist=[429, 500, 502, 503, 504],
    respect_retry_after_header=True
)
session.mount("https://", HTTPAdapter(max_retries=retry))

# Realistic headers
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Cache-Control": "no-cache",
})

Header Rotation Strategy

Rotate these to avoid fingerprinting:

Header	Rotation Pool Size	Notes
User-Agent	20-50 real browser UAs	Match OS distribution
Accept-Language	5-10 locale combos	Match proxy geo
Sec-Ch-Ua	Match User-Agent	Chrome/Edge/Brave
Referer	Vary per request	Previous page or search engine

Rate Limiting Rules

Site Type	Safe Delay	Aggressive (risky)
Small business site	5-10 seconds	2-3 seconds
Medium site	2-5 seconds	1-2 seconds
Large platform (Amazon, etc.)	3-5 seconds	1 second
API endpoint	Per API docs	Never exceed
robots.txt crawl-delay	Respect exactly	Never below

Rules:

Always respect Crawl-delay in robots.txt
Add random jitter (±30%) to avoid pattern detection
Slow down during business hours for smaller sites
Respect Retry-After headers — they mean it
Watch for 429s — back off exponentially (2x each time)

Phase 4: Parsing & Extraction

CSS Selector Strategy (Priority Order)

Data attributes → [data-product-id], [data-price] (most stable)
Semantic IDs → #product-title, #price (stable but can change)
ARIA attributes → [aria-label="Price"] (accessibility, fairly stable)
Semantic HTML → article, main, nav (structural, stable)
Class names → .product-card (can change with redesigns)
XPath position → //div[3]/span[2] (FRAGILE — last resort)

Extraction Patterns

Structured data first — Check before writing CSS selectors:

# 1. Check JSON-LD (best source — structured, clean)
import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
for script in soup.find_all('script', type='application/ld+json'):
    data = json.loads(script.string)
    # Often contains: Product, Article, Organization, etc.

# 2. Check Open Graph meta tags
og_title = soup.find('meta', property='og:title')
og_price = soup.find('meta', property='product:price:amount')

# 3. Check microdata
items = soup.find_all(itemtype=True)

# 4. Fall back to CSS selectors only if above are empty

Table extraction pattern:

import pandas as pd

# Quick table extraction
tables = pd.read_html(html)  # Returns list of DataFrames

# For complex tables with merged cells
def extract_table(soup, selector):
    table = soup.select_one(selector)
    headers = [th.get_text(strip=True) for th in table.select('thead th')]
    rows = []
    for tr in table.select('tbody tr'):
        cells = [td.get_text(strip=True) for td in tr.select('td')]
        rows.append(dict(zip(headers, cells)))
    return rows

Pagination handling:

# Pattern 1: Next button
while True:
    # ... scrape current page ...
    next_link = soup.select_one('a.next-page, [rel="next"], .pagination .next a')
    if not next_link or not next_link.get('href'):
        break
    url = urljoin(base_url, next_link['href'])
    
# Pattern 2: API pagination (infinite scroll sites)
page = 1
while True:
    resp = session.get(f"{api_url}?page={page}&limit=50")
    data = resp.json()
    if not data.get('results'):
        break
    # ... process results ...
    page += 1

# Pattern 3: Cursor-based
cursor = None
while True:
    params = {"limit": 50}
    if cursor:
        params["cursor"] = cursor
    resp = session.get(api_url, params=params)
    data = resp.json()
    # ... process ...
    cursor = data.get('next_cursor')
    if not cursor:
        break

JavaScript-Rendered Content

# Playwright pattern for JS-rendered pages
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent="Mozilla/5.0 ...",
    )
    page = context.new_page()
    
    # Block unnecessary resources (speed + stealth)
    page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}", 
               lambda route: route.abort())
    
    page.goto(url, wait_until="networkidle")
    
    # Wait for specific content (better than arbitrary sleep)
    page.wait_for_selector('[data-product-id]', timeout=10000)
    
    # Extract after JS rendering
    content = page.content()
    # ... parse with BeautifulSoup/Cheerio ...
    
    browser.close()

Phase 5: Anti-Detection & Stealth

Detection Signals (What Sites Check)

Signal	Detection Method	Mitigation
IP reputation	IP blacklists, datacenter ranges	Residential proxies
Request rate	Requests/min from same IP	Rate limiting + jitter
TLS fingerprint	JA3/JA4 hash matching	Use real browser or curl-impersonate
Browser fingerprint	Canvas, WebGL, fonts	Playwright with stealth plugin
JavaScript challenges	Cloudflare Turnstile, hCaptcha	Managed browser services
Cookie/session behavior	Missing cookies, no history	Full session management
Navigation pattern	Direct URL hits, no referrer	Simulate natural browsing
Mouse/keyboard events	No interaction telemetry	Event simulation (Playwright)
Header consistency	Mismatched headers vs UA	Header sets that match

Proxy Strategy

proxy_strategy:
  # Tier 1: Free/Datacenter (for non-protected sites)
  basic:
    type: "datacenter"
    cost: "$1-5/GB"
    success_rate: "60-80%"
    use_for: "APIs, small sites, no anti-bot"
    
  # Tier 2: Residential (for most protected sites)
  standard:
    type: "residential"
    cost: "$5-15/GB"
    success_rate: "90-95%"
    use_for: "Cloudflare, major platforms"
    rotation: "per-request or sticky 10min"
    
  # Tier 3: Mobile/ISP (for maximum stealth)
  premium:
    type: "mobile"
    cost: "$15-30/GB"
    success_rate: "95-99%"
    use_for: "Aggressive anti-bot, social media"
    
  rules:
    - Start with cheapest tier, escalate only on blocks
    - Match proxy geo to target audience geo
    - Rotate on 403/429, not every request
    - Use sticky sessions for multi-page scrapes
    - Monitor proxy health — remove slow/blocked IPs

Playwright Stealth Configuration

# Essential stealth for Playwright
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=[
            '--disable-blink-features=AutomationControlled',
            '--disable-features=IsolateOrigins,site-per-process',
        ]
    )
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        locale="en-US",
        timezone_id="America/New_York",
        geolocation={"latitude": 40.7128, "longitude": -74.0060},
        permissions=["geolocation"],
    )
    
    # Remove automation indicators
    page = context.new_page()
    page.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
        Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
    """)

Cloudflare Bypass Decision

Cloudflare detected?
├── JS Challenge only → Playwright with stealth + residential proxy
├── Turnstile CAPTCHA → Managed service (ScrapingBee/Bright Data)
├── Under Attack Mode → Wait, try later, or managed service
└── WAF blocking → Different approach needed
    ├── Check for API endpoints (network tab)
    ├── Check for mobile app API
    └── Consider if data is available elsewhere

Phase 6: Data Pipeline & Quality

Data Validation Rules

# Validation pattern — validate BEFORE storing
from dataclasses import dataclass, field
from typing import Optional
import re
from datetime import datetime

@dataclass
class ScrapedProduct:
    url: str
    title: str
    price: Optional[float]
    currency: str = "USD"
    scraped_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    
    def validate(self) -> list[str]:
        errors = []
        if not self.url.startswith('http'):
            errors.append("Invalid URL")
        if not self.title or len(self.title) < 3:
            errors.append("Title too short or missing")
        if self.price is not None and self.price < 0:
            errors.append("Negative price")
        if self.price is not None and self.price > 1_000_000:
            errors.append("Price suspiciously high — verify")
        if self.currency not in ("USD", "EUR", "GBP", "BTC"):
            errors.append(f"Unknown currency: {self.currency}")
        return errors

Deduplication Strategy

Method	When to Use	Implementation
URL-based	Pages with unique URLs	Hash the canonical URL
Content hash	Same URL, changing content	MD5/SHA256 of key fields
Fuzzy matching	Near-duplicate detection	Jaccard similarity > 0.85
Composite key	Multi-field uniqueness	Hash(domain + product_id + variant)

import hashlib

def dedup_key(item: dict, fields: list[str]) -> str:
    """Generate dedup key from selected fields."""
    values = "|".join(str(item.get(f, "")) for f in fields)
    return hashlib.sha256(values.encode()).hexdigest()

# Usage
seen = set()
for item in scraped_items:
    key = dedup_key(item, ["url", "product_id"])
    if key not in seen:
        seen.add(key)
        clean_items.append(item)

Data Cleaning Pipeline

Raw HTML → Parse → Extract → Validate → Clean → Deduplicate → Store
                                ↓
                          Quarantine (failed validation)

Common cleaning operations:

Problem	Solution
HTML entities (`&`)	`html.unescape()`
Extra whitespace	`" ".join(text.split())`
Unicode issues	`unicodedata.normalize('NFKD', text)`
Price in text ("$49.99")	Regex: `r'[\$£€]?([\d,]+\.?\d*)'`
Date formats vary	`dateutil.parser.parse()` with `dayfirst` flag
Relative URLs	`urllib.parse.urljoin(base, relative)`
Encoding issues	`chardet.detect()` then decode

Phase 7: Storage & Export

Storage Decision Guide

Volume	Frequency	Query Needs	Recommendation
<10K records	One-time	None	JSON/CSV files
<10K records	Recurring	Simple lookups	SQLite
10K-1M records	Recurring	Complex queries	PostgreSQL
1M+ records	Continuous	Analytics	PostgreSQL + partitioning
Append-only logs	Continuous	Time-series	ClickHouse / TimescaleDB

SQLite Pattern (Most Common)

import sqlite3
import json
from datetime import datetime

def init_db(path="scraper_data.db"):
    conn = sqlite3.connect(path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS items (
            id INTEGER PRIMARY KEY,
            url TEXT UNIQUE,
            data JSON NOT NULL,
            scraped_at TEXT DEFAULT (datetime('now')),
            updated_at TEXT,
            checksum TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_url ON items(url)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_scraped ON items(scraped_at)")
    return conn

def upsert(conn, url, data, checksum):
    conn.execute("""
        INSERT INTO items (url, data, checksum) VALUES (?, ?, ?)
        ON CONFLICT(url) DO UPDATE SET
            data = excluded.data,
            updated_at = datetime('now'),
            checksum = excluded.checksum
        WHERE items.checksum != excluded.checksum
    """, (url, json.dumps(data), checksum))
    conn.commit()

Export Formats

# CSV export
import csv
def to_csv(items, path, fields):
    with open(path, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=fields)
        writer.writeheader()
        writer.writerows(items)

# JSON Lines (best for large datasets — streaming)
def to_jsonl(items, path):
    with open(path, 'w') as f:
        for item in items:
            f.write(json.dumps(item) + '\n')

# Incremental export (only new/changed since last export)
def export_since(conn, last_export_time):
    cursor = conn.execute(
        "SELECT data FROM items WHERE scraped_at > ? OR updated_at > ?",
        (last_export_time, last_export_time)
    )
    return [json.loads(row[0]) for row in cursor]

Phase 8: Error Handling & Resilience

Error Classification

HTTP Code	Meaning	Action
200	Success	Process normally
301/302	Redirect	Follow (max 5 hops)
403	Forbidden/blocked	Rotate proxy, slow down
404	Not found	Log, skip, mark URL dead
429	Rate limited	Respect Retry-After, back off 2x
500-504	Server error	Retry 3x with backoff
Connection timeout	Network issue	Retry with different proxy
SSL error	Certificate issue	Log, investigate, skip

Circuit Breaker Pattern

class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=300):
        self.failures = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure = 0
        self.state = "closed"  # closed | open | half-open
    
    def record_failure(self):
        self.failures += 1
        self.last_failure = time.time()
        if self.failures >= self.threshold:
            self.state = "open"
            # Alert: "Circuit open — too many failures"
    
    def record_success(self):
        self.failures = 0
        self.state = "closed"
    
    def can_proceed(self):
        if self.state == "closed":
            return True
        if self.state == "open":
            if time.time() - self.last_failure > self.reset_timeout:
                self.state = "half-open"
                return True  # Try one request
            return False
        return True  # half-open: allow attempt

Checkpoint & Resume

import json
from pathlib import Path

class Checkpointer:
    def __init__(self, path="checkpoint.json"):
        self.path = Path(path)
        self.state = self._load()
    
    def _load(self):
        if self.path.exists():
            return json.loads(self.path.read_text())
        return {"completed_urls": [], "last_page": 0, "cursor": None}
    
    def save(self):
        self.path.write_text(json.dumps(self.state))
    
    def is_done(self, url):
        return url in self.state["completed_urls"]
    
    def mark_done(self, url):
        self.state["completed_urls"].append(url)
        if len(self.state["completed_urls"]) % 50 == 0:
            self.save()  # Periodic save

Phase 9: Monitoring & Operations

Scraper Health Dashboard

dashboard:
  real_time:
    - metric: "requests_per_minute"
      alert_if: "> 60 for small sites"
    - metric: "success_rate"
      alert_if: "< 90%"
    - metric: "avg_response_time_ms"
      alert_if: "> 5000"
    - metric: "blocked_rate"
      alert_if: "> 10%"
      
  per_run:
    - metric: "pages_scraped"
    - metric: "items_extracted"
    - metric: "items_validated"
    - metric: "items_deduplicated"
    - metric: "new_items"
    - metric: "updated_items"
    - metric: "errors_by_type"
    - metric: "run_duration"
    - metric: "proxy_cost"
    
  weekly:
    - metric: "data_freshness"
      description: "% of records updated in last 7 days"
    - metric: "site_structure_changes"
      description: "Selectors that stopped matching"
    - metric: "total_cost"
      description: "Proxy + compute + storage"

Breakage Detection

Sites redesign. Selectors break. Detect it early:

def health_check(results: list[dict], expected_fields: list[str]) -> dict:
    """Check if scraper is still extracting correctly."""
    total = len(results)
    if total == 0:
        return {"status": "CRITICAL", "message": "Zero results — likely broken"}
    
    field_coverage = {}
    for field in expected_fields:
        filled = sum(1 for r in results if r.get(field))
        coverage = filled / total
        field_coverage[field] = coverage
        
    issues = []
    for field, coverage in field_coverage.items():
        if coverage < 0.5:
            issues.append(f"{field}: {coverage:.0%} fill rate (expected >50%)")
    
    if issues:
        return {"status": "WARNING", "issues": issues}
    return {"status": "OK", "field_coverage": field_coverage}

Operational Runbook

Daily:

Check success rate per target domain
Review error logs for new patterns
Verify data freshness

Weekly:

Compare extraction counts vs baseline (>20% drop = investigate)
Review proxy spend
Spot-check 10 random records for accuracy

Monthly:

Full selector validation against live pages
Review legal compliance (robots.txt changes, ToS updates)
Cost optimization review
Prune dead URLs from queue

Phase 10: Common Scraping Patterns

Pattern 1: E-commerce Price Monitor

use_case: "Track competitor prices daily"
tool: "requests + BeautifulSoup"
schedule: "Daily at 03:00 UTC (off-peak)"
targets: ["competitor-a.com/products", "competitor-b.com/api"]
data:
  - product_id
  - product_name
  - price
  - currency
  - in_stock
  - scraped_at
storage: "SQLite with price history"
alerts: "Price change > 10% → notify"

Pattern 2: Job Board Aggregator

use_case: "Aggregate job listings from multiple boards"
tool: "Scrapy with per-site spiders"
schedule: "Every 6 hours"
targets: ["board-a.com", "board-b.com", "board-c.com"]
data:
  - title
  - company
  - location
  - salary_range
  - posted_date
  - url
  - source
dedup: "Hash(title + company + location)"
storage: "PostgreSQL"

Pattern 3: News & Content Monitor

use_case: "Monitor industry news mentions"
tool: "requests + RSS feeds (preferred) + web fallback"
schedule: "Every 30 minutes"
approach:
  1: "RSS/Atom feeds (fastest, cleanest)"
  2: "Google News RSS for topic"
  3: "Direct scraping if no feed"
data:
  - headline
  - source
  - url
  - published_at
  - snippet
  - sentiment
alerts: "Keyword match → immediate notification"

Pattern 4: Social Media Intelligence

use_case: "Track brand mentions and sentiment"
tool: "Official APIs (always) + web search fallback"
rules:
  - NEVER scrape social platforms directly — use APIs
  - Twitter/X: Official API ($100/mo basic)
  - Reddit: Official API (free tier available)
  - LinkedIn: No scraping (aggressive legal action)
  - Instagram: Official API only (Meta Business)
fallback: "Brave/Google search for public mentions"

Pattern 5: Real Estate Listings

use_case: "Track property listings and prices"
tool: "Playwright (most listing sites are JS-heavy)"
schedule: "Daily"
challenges:
  - Heavy JavaScript rendering
  - Anti-bot measures (Cloudflare common)
  - Frequent layout changes
  - Map-based results
approach: "API endpoint discovery via network tab first"

Phase 11: Scaling Strategies

Concurrency Architecture

Single machine (small scale):
├── asyncio + aiohttp (Python) → 50-200 concurrent requests
├── Worker pool (ThreadPoolExecutor) → 10-50 threads
└── Scrapy reactor → Built-in concurrency

Multi-machine (large scale):
├── URL queue: Redis / RabbitMQ / SQS
├── Workers: Multiple Scrapy/custom workers
├── Results: Shared PostgreSQL / S3
└── Coordinator: Celery / custom scheduler

Cost Optimization

Lever	Impact	How
Static > Browser	10-50x cheaper	Always try HTTP first
Block images/CSS/fonts	60-80% bandwidth saved	Route filtering
Cache DNS	Minor but cumulative	Local DNS cache
Compress responses	50-70% bandwidth	Accept-Encoding: gzip, br
Smart scheduling	Avoid redundant scrapes	Change detection before full re-scrape
Proxy tier matching	3-10x cost difference	Don't use residential for easy sites

Phase 12: Advanced Patterns

API Discovery (Network Tab Mining)

Before building a scraper, check if the site has hidden API endpoints:

Open DevTools → Network tab
Filter by XHR/Fetch
Navigate the site, click load-more, filter/sort
Look for JSON responses — these are your goldmine
Most SPAs load data via REST/GraphQL APIs

Common hidden API patterns:

/api/v1/products?page=1&limit=20
/graphql with query parameters
/_next/data/... (Next.js data routes)
/wp-json/wp/v2/posts (WordPress)

Headless Browser Optimization

# Minimize browser resource usage
context = browser.new_context(
    viewport={"width": 1280, "height": 720},
    java_script_enabled=True,  # Only if needed
    has_touch=False,
    is_mobile=False,
)

# Block resource types you don't need
page.route("**/*", lambda route: (
    route.abort() if route.request.resource_type in 
    ["image", "stylesheet", "font", "media"] 
    else route.continue_()
))

Scraping Behind Authentication

# When authorized to scrape behind login
# ALWAYS use session-based auth, never store passwords in code

# Pattern: Login once, reuse session
session = requests.Session()
login_resp = session.post("https://example.com/login", data={
    "username": os.environ["SCRAPE_USER"],
    "password": os.environ["SCRAPE_PASS"],
})
assert login_resp.ok, "Login failed"

# Session cookies are now stored — use for subsequent requests
data_resp = session.get("https://example.com/api/data")

Change Detection (Avoid Redundant Scrapes)

import hashlib

def has_changed(url, session, last_etag=None, last_modified=None):
    """Check if page changed without downloading full content."""
    headers = {}
    if last_etag:
        headers["If-None-Match"] = last_etag
    if last_modified:
        headers["If-Modified-Since"] = last_modified
    
    resp = session.head(url, headers=headers)
    
    if resp.status_code == 304:
        return False, resp.headers.get("ETag"), resp.headers.get("Last-Modified")
    
    return True, resp.headers.get("ETag"), resp.headers.get("Last-Modified")

Quality Scoring Rubric (0-100)

Dimension	Weight	What to Assess
Legal compliance	20%	robots.txt, ToS, PII handling, audit trail
Data quality	20%	Validation, accuracy, completeness, freshness
Resilience	15%	Error handling, retries, circuit breakers, checkpointing
Anti-detection	15%	Proxy rotation, fingerprint diversity, rate limiting
Architecture	10%	Right tool selection, clean code, modularity
Monitoring	10%	Success rates, breakage detection, alerting
Performance	5%	Speed, cost efficiency, resource usage
Documentation	5%	Runbook, schema docs, legal assessment

Grading: 90+ Excellent | 75-89 Good | 60-74 Needs work | <60 Redesign

10 Common Mistakes

#	Mistake	Fix
1	No robots.txt check	Always check first — it's your legal defense
2	Fixed delays (no jitter)	Add ±30% random jitter to all delays
3	No data validation	Validate every field before storing
4	Using browser for static HTML	HTTP client is 10-50x faster and cheaper
5	Single IP, no rotation	Proxy rotation for any serious scraping
6	No breakage detection	Monitor extraction counts and field fill rates
7	Storing raw HTML only	Extract + structure immediately
8	No checkpoint/resume	Long scrapes must be resumable
9	Ignoring structured data	JSON-LD/microdata is cleaner than CSS selectors
10	Scraping when API exists	Always check for API first

5 Edge Cases

Single-page apps (React/Vue/Angular): Must use browser rendering OR find the underlying API (network tab). Prefer API discovery — it's faster and more reliable.
Infinite scroll: Intercept the XHR/fetch calls that load more content. Simulate scrolling only as last resort. The API endpoint usually accepts page or offset params.
CAPTCHAs: If you're hitting CAPTCHAs, you're scraping too aggressively. Slow down first. If CAPTCHAs persist: managed services (2Captcha, Anti-Captcha) or rethink approach.
Dynamic class names (CSS modules, Tailwind): Use data attributes, ARIA labels, or text content selectors instead. [data-testid="price"] survives redesigns. .sc-bdVTJa does not.
Multi-language sites: Detect language via html[lang] attribute. Set Accept-Language header to get desired locale. Watch for different URL structures (/en/, /de/, subdomains).

Natural Language Commands

"Check if I can scrape [URL]" → Run compliance checklist (robots.txt, ToS, data type)
"What tool should I use for [site]?" → Analyze site rendering, anti-bot, recommend tool
"Build a scraper for [description]" → Full architecture brief + code pattern
"My scraper is getting blocked" → Anti-detection diagnostic + proxy/stealth recommendations
"Extract [data] from [URL]" → Check structured data first, then CSS selectors
"Monitor [site] for changes" → Change detection + scheduling + alerting setup
"How do I handle pagination on [site]?" → Identify pagination type + code pattern
"Scrape at scale ([N] pages)" → Concurrency architecture + cost estimate
"Clean and store this scraped data" → Validation + dedup + storage recommendation
"Is my scraper healthy?" → Run health check + breakage detection
"Find the API behind [site]" → Network tab mining guide + common patterns
"Set up price monitoring for [competitors]" → Full e-commerce monitor pattern

Web Scraping & Data Extraction Engine

Description

Web Scraping & Data Extraction Engine

Quick Health Check (Run First)

Phase 1: Legal & Ethical Foundation

Pre-Scrape Compliance Checklist

Legal Landscape Quick Reference

Decision Rules

AI Crawler Considerations (2025+)

Phase 2: Architecture Decision

Tool Selection Matrix

Decision Tree

Architecture Brief YAML

Phase 3: Request Engineering

HTTP Request Best Practices

Header Rotation Strategy

Rate Limiting Rules

Phase 4: Parsing & Extraction

CSS Selector Strategy (Priority Order)

Extraction Patterns

JavaScript-Rendered Content

Phase 5: Anti-Detection & Stealth

Detection Signals (What Sites Check)

Proxy Strategy

Playwright Stealth Configuration

Cloudflare Bypass Decision

Phase 6: Data Pipeline & Quality

Data Validation Rules

Deduplication Strategy

Data Cleaning Pipeline

Phase 7: Storage & Export

Storage Decision Guide

SQLite Pattern (Most Common)

Export Formats

Phase 8: Error Handling & Resilience

Error Classification

Circuit Breaker Pattern

Checkpoint & Resume

Phase 9: Monitoring & Operations

Scraper Health Dashboard

Breakage Detection

Operational Runbook

Phase 10: Common Scraping Patterns

Pattern 1: E-commerce Price Monitor

Pattern 2: Job Board Aggregator

Pattern 3: News & Content Monitor

Pattern 4: Social Media Intelligence

Pattern 5: Real Estate Listings

Phase 11: Scaling Strategies

Concurrency Architecture

Cost Optimization

Phase 12: Advanced Patterns

API Discovery (Network Tab Mining)

Headless Browser Optimization

Scraping Behind Authentication

Change Detection (Avoid Redundant Scrapes)

Quality Scoring Rubric (0-100)

10 Common Mistakes

5 Edge Cases

Natural Language Commands

Reviews (0)

Comments (0)

Compatible Platforms

Links

Pricing

Related Configs

self-improving-agent

Self Improving Agent

Find Skills

Summarize