🧪 Skills

Web Scraping & Data Extraction Engine

Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrap...

v1.0.0
❤️ 0
⬇️ 566
👁 2
Share

Description


name: Web Scraping & Data Extraction Engine description: Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrapers, extracting web data, monitoring competitors, or automating data collection at scale.

Web Scraping & Data Extraction Engine

Quick Health Check (Run First)

Score your scraping operation (2 points each):

Signal Healthy Unhealthy
Legal compliance robots.txt checked, ToS reviewed Scraping blindly
Architecture Tool matches site complexity Using Puppeteer for static HTML
Anti-detection Rotation, delays, fingerprint diversity Single IP, no delays
Data quality Validation + dedup pipeline Raw dumps, no cleaning
Error handling Retry logic, circuit breakers Crashes on first 403
Monitoring Success rates tracked, alerts set No visibility
Storage Structured, deduplicated, versioned Flat files, duplicates
Scheduling Appropriate frequency, off-peak Hammering during business hours

Score: /16 → 12+: Production-ready | 8-11: Needs work | <8: Stop and redesign


Phase 1: Legal & Ethical Foundation

Pre-Scrape Compliance Checklist

compliance_brief:
  target_domain: ""
  date_assessed: ""
  
  robots_txt:
    checked: false
    target_paths_allowed: false
    crawl_delay_specified: ""
    ai_bot_rules: ""  # Many sites now block AI crawlers specifically
    
  terms_of_service:
    reviewed: false
    scraping_mentioned: false
    scraping_prohibited: false
    api_available: false
    api_sufficient: false
    
  data_classification:
    type: ""  # public-factual | public-personal | behind-auth | copyrighted
    contains_pii: false
    pii_types: []  # name, email, phone, address, photo
    gdpr_applies: false  # EU residents' data
    ccpa_applies: false  # California residents' data
    
  legal_risk: ""  # low | medium | high | do-not-scrape
  decision: ""  # proceed | use-api | request-permission | abandon
  justification: ""

Legal Landscape Quick Reference

Scenario Risk Level Key Case Law
Public data, no login, robots.txt allows LOW hiQ v. LinkedIn (2022)
Public data, robots.txt disallows MEDIUM Meta v. Bright Data (2024)
Behind authentication HIGH Van Buren v. US (2021), CFAA
Personal data without consent HIGH GDPR Art. 6, CCPA §1798.100
Republishing copyrighted content HIGH Copyright Act §106
Price/product comparison LOW eBay v. Bidder's Edge (fair use)
Academic/research use LOW-MEDIUM Varies by jurisdiction
Bypassing anti-bot measures HIGH CFAA "exceeds authorized access"

Decision Rules

  1. API exists and covers your needs? → Use the API. Always.
  2. robots.txt disallows your target? → Respect it unless you have written permission.
  3. Data behind login? → Do not scrape without explicit authorization.
  4. Contains PII? → GDPR/CCPA compliance required before collection.
  5. Copyrighted content? → Extract facts/data points only, never full content.
  6. Site explicitly prohibits scraping? → Request permission or find alternative source.

AI Crawler Considerations (2025+)

Many sites now specifically block AI-related crawlers:

# Common AI bot blocks in robots.txt
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: Google-Extended
User-agent: CCBot
User-agent: anthropic-ai
User-agent: ClaudeBot
User-agent: Bytespider
User-agent: PerplexityBot

Rule: If collecting data for AI training, check for these specific blocks.


Phase 2: Architecture Decision

Tool Selection Matrix

Tool/Approach Best For Speed JS Support Complexity Cost
HTTP client (requests/axios) Static HTML, APIs ⚡⚡⚡ Low Free
Beautiful Soup / Cheerio Static HTML parsing ⚡⚡⚡ Low Free
Scrapy Large-scale structured crawling ⚡⚡⚡ Plugin Medium Free
Playwright / Puppeteer JS-rendered, SPAs, interactions Medium Free
Selenium Legacy, browser automation High Free
Crawlee Hybrid (HTTP + browser fallback) ⚡⚡ Medium Free
Firecrawl / ScrapingBee Managed, anti-bot bypass ⚡⚡ Low Paid
Bright Data / Oxylabs Enterprise, proxy + browser ⚡⚡ Low Paid

Decision Tree

Is the content in the initial HTML source?
├── YES → Is the site structure consistent?
│   ├── YES → Static scraper (requests + BeautifulSoup/Cheerio)
│   └── NO → Scrapy with custom parsers
└── NO → Does the page require user interaction?
    ├── YES → Playwright/Puppeteer with interaction scripts
    └── NO → Playwright in non-interactive mode
        └── At scale (>10K pages)? → Crawlee (hybrid mode)
            └── Heavy anti-bot? → Managed service (Firecrawl/ScrapingBee)

Architecture Brief YAML

scraping_project:
  name: ""
  objective: ""  # What data, why, how often
  
  targets:
    - domain: ""
      pages_estimated: 0
      rendering: "static" | "javascript" | "spa"
      anti_bot: "none" | "basic" | "cloudflare" | "advanced"
      rate_limit: ""  # requests per second safe limit
      
  tool_selected: ""
  justification: ""
  
  data_schema:
    fields: []
    output_format: ""  # json | csv | database
    
  schedule:
    frequency: ""  # once | hourly | daily | weekly
    preferred_time: ""  # off-peak for target timezone
    
  infrastructure:
    proxy_needed: false
    proxy_type: ""  # residential | datacenter | mobile
    storage: ""
    monitoring: ""

Phase 3: Request Engineering

HTTP Request Best Practices

# Python example — production request pattern
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()

# Retry strategy
retry = Retry(
    total=3,
    backoff_factor=1,      # 1s, 2s, 4s
    status_forcelist=[429, 500, 502, 503, 504],
    respect_retry_after_header=True
)
session.mount("https://", HTTPAdapter(max_retries=retry))

# Realistic headers
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Cache-Control": "no-cache",
})

Header Rotation Strategy

Rotate these to avoid fingerprinting:

Header Rotation Pool Size Notes
User-Agent 20-50 real browser UAs Match OS distribution
Accept-Language 5-10 locale combos Match proxy geo
Sec-Ch-Ua Match User-Agent Chrome/Edge/Brave
Referer Vary per request Previous page or search engine

Rate Limiting Rules

Site Type Safe Delay Aggressive (risky)
Small business site 5-10 seconds 2-3 seconds
Medium site 2-5 seconds 1-2 seconds
Large platform (Amazon, etc.) 3-5 seconds 1 second
API endpoint Per API docs Never exceed
robots.txt crawl-delay Respect exactly Never below

Rules:

  1. Always respect Crawl-delay in robots.txt
  2. Add random jitter (±30%) to avoid pattern detection
  3. Slow down during business hours for smaller sites
  4. Respect Retry-After headers — they mean it
  5. Watch for 429s — back off exponentially (2x each time)

Phase 4: Parsing & Extraction

CSS Selector Strategy (Priority Order)

  1. Data attributes[data-product-id], [data-price] (most stable)
  2. Semantic IDs#product-title, #price (stable but can change)
  3. ARIA attributes[aria-label="Price"] (accessibility, fairly stable)
  4. Semantic HTMLarticle, main, nav (structural, stable)
  5. Class names.product-card (can change with redesigns)
  6. XPath position//div[3]/span[2] (FRAGILE — last resort)

Extraction Patterns

Structured data first — Check before writing CSS selectors:

# 1. Check JSON-LD (best source — structured, clean)
import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
for script in soup.find_all('script', type='application/ld+json'):
    data = json.loads(script.string)
    # Often contains: Product, Article, Organization, etc.

# 2. Check Open Graph meta tags
og_title = soup.find('meta', property='og:title')
og_price = soup.find('meta', property='product:price:amount')

# 3. Check microdata
items = soup.find_all(itemtype=True)

# 4. Fall back to CSS selectors only if above are empty

Table extraction pattern:

import pandas as pd

# Quick table extraction
tables = pd.read_html(html)  # Returns list of DataFrames

# For complex tables with merged cells
def extract_table(soup, selector):
    table = soup.select_one(selector)
    headers = [th.get_text(strip=True) for th in table.select('thead th')]
    rows = []
    for tr in table.select('tbody tr'):
        cells = [td.get_text(strip=True) for td in tr.select('td')]
        rows.append(dict(zip(headers, cells)))
    return rows

Pagination handling:

# Pattern 1: Next button
while True:
    # ... scrape current page ...
    next_link = soup.select_one('a.next-page, [rel="next"], .pagination .next a')
    if not next_link or not next_link.get('href'):
        break
    url = urljoin(base_url, next_link['href'])
    
# Pattern 2: API pagination (infinite scroll sites)
page = 1
while True:
    resp = session.get(f"{api_url}?page={page}&limit=50")
    data = resp.json()
    if not data.get('results'):
        break
    # ... process results ...
    page += 1

# Pattern 3: Cursor-based
cursor = None
while True:
    params = {"limit": 50}
    if cursor:
        params["cursor"] = cursor
    resp = session.get(api_url, params=params)
    data = resp.json()
    # ... process ...
    cursor = data.get('next_cursor')
    if not cursor:
        break

JavaScript-Rendered Content

# Playwright pattern for JS-rendered pages
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent="Mozilla/5.0 ...",
    )
    page = context.new_page()
    
    # Block unnecessary resources (speed + stealth)
    page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}", 
               lambda route: route.abort())
    
    page.goto(url, wait_until="networkidle")
    
    # Wait for specific content (better than arbitrary sleep)
    page.wait_for_selector('[data-product-id]', timeout=10000)
    
    # Extract after JS rendering
    content = page.content()
    # ... parse with BeautifulSoup/Cheerio ...
    
    browser.close()

Phase 5: Anti-Detection & Stealth

Detection Signals (What Sites Check)

Signal Detection Method Mitigation
IP reputation IP blacklists, datacenter ranges Residential proxies
Request rate Requests/min from same IP Rate limiting + jitter
TLS fingerprint JA3/JA4 hash matching Use real browser or curl-impersonate
Browser fingerprint Canvas, WebGL, fonts Playwright with stealth plugin
JavaScript challenges Cloudflare Turnstile, hCaptcha Managed browser services
Cookie/session behavior Missing cookies, no history Full session management
Navigation pattern Direct URL hits, no referrer Simulate natural browsing
Mouse/keyboard events No interaction telemetry Event simulation (Playwright)
Header consistency Mismatched headers vs UA Header sets that match

Proxy Strategy

proxy_strategy:
  # Tier 1: Free/Datacenter (for non-protected sites)
  basic:
    type: "datacenter"
    cost: "$1-5/GB"
    success_rate: "60-80%"
    use_for: "APIs, small sites, no anti-bot"
    
  # Tier 2: Residential (for most protected sites)
  standard:
    type: "residential"
    cost: "$5-15/GB"
    success_rate: "90-95%"
    use_for: "Cloudflare, major platforms"
    rotation: "per-request or sticky 10min"
    
  # Tier 3: Mobile/ISP (for maximum stealth)
  premium:
    type: "mobile"
    cost: "$15-30/GB"
    success_rate: "95-99%"
    use_for: "Aggressive anti-bot, social media"
    
  rules:
    - Start with cheapest tier, escalate only on blocks
    - Match proxy geo to target audience geo
    - Rotate on 403/429, not every request
    - Use sticky sessions for multi-page scrapes
    - Monitor proxy health — remove slow/blocked IPs

Playwright Stealth Configuration

# Essential stealth for Playwright
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=[
            '--disable-blink-features=AutomationControlled',
            '--disable-features=IsolateOrigins,site-per-process',
        ]
    )
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        locale="en-US",
        timezone_id="America/New_York",
        geolocation={"latitude": 40.7128, "longitude": -74.0060},
        permissions=["geolocation"],
    )
    
    # Remove automation indicators
    page = context.new_page()
    page.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
        Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
    """)

Cloudflare Bypass Decision

Cloudflare detected?
├── JS Challenge only → Playwright with stealth + residential proxy
├── Turnstile CAPTCHA → Managed service (ScrapingBee/Bright Data)
├── Under Attack Mode → Wait, try later, or managed service
└── WAF blocking → Different approach needed
    ├── Check for API endpoints (network tab)
    ├── Check for mobile app API
    └── Consider if data is available elsewhere

Phase 6: Data Pipeline & Quality

Data Validation Rules

# Validation pattern — validate BEFORE storing
from dataclasses import dataclass, field
from typing import Optional
import re
from datetime import datetime

@dataclass
class ScrapedProduct:
    url: str
    title: str
    price: Optional[float]
    currency: str = "USD"
    scraped_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    
    def validate(self) -> list[str]:
        errors = []
        if not self.url.startswith('http'):
            errors.append("Invalid URL")
        if not self.title or len(self.title) < 3:
            errors.append("Title too short or missing")
        if self.price is not None and self.price < 0:
            errors.append("Negative price")
        if self.price is not None and self.price > 1_000_000:
            errors.append("Price suspiciously high — verify")
        if self.currency not in ("USD", "EUR", "GBP", "BTC"):
            errors.append(f"Unknown currency: {self.currency}")
        return errors

Deduplication Strategy

Method When to Use Implementation
URL-based Pages with unique URLs Hash the canonical URL
Content hash Same URL, changing content MD5/SHA256 of key fields
Fuzzy matching Near-duplicate detection Jaccard similarity > 0.85
Composite key Multi-field uniqueness Hash(domain + product_id + variant)
import hashlib

def dedup_key(item: dict, fields: list[str]) -> str:
    """Generate dedup key from selected fields."""
    values = "|".join(str(item.get(f, "")) for f in fields)
    return hashlib.sha256(values.encode()).hexdigest()

# Usage
seen = set()
for item in scraped_items:
    key = dedup_key(item, ["url", "product_id"])
    if key not in seen:
        seen.add(key)
        clean_items.append(item)

Data Cleaning Pipeline

Raw HTML → Parse → Extract → Validate → Clean → Deduplicate → Store
                                ↓
                          Quarantine (failed validation)

Common cleaning operations:

Problem Solution
HTML entities (&amp;) html.unescape()
Extra whitespace " ".join(text.split())
Unicode issues unicodedata.normalize('NFKD', text)
Price in text ("$49.99") Regex: r'[\$£€]?([\d,]+\.?\d*)'
Date formats vary dateutil.parser.parse() with dayfirst flag
Relative URLs urllib.parse.urljoin(base, relative)
Encoding issues chardet.detect() then decode

Phase 7: Storage & Export

Storage Decision Guide

Volume Frequency Query Needs Recommendation
<10K records One-time None JSON/CSV files
<10K records Recurring Simple lookups SQLite
10K-1M records Recurring Complex queries PostgreSQL
1M+ records Continuous Analytics PostgreSQL + partitioning
Append-only logs Continuous Time-series ClickHouse / TimescaleDB

SQLite Pattern (Most Common)

import sqlite3
import json
from datetime import datetime

def init_db(path="scraper_data.db"):
    conn = sqlite3.connect(path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS items (
            id INTEGER PRIMARY KEY,
            url TEXT UNIQUE,
            data JSON NOT NULL,
            scraped_at TEXT DEFAULT (datetime('now')),
            updated_at TEXT,
            checksum TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_url ON items(url)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_scraped ON items(scraped_at)")
    return conn

def upsert(conn, url, data, checksum):
    conn.execute("""
        INSERT INTO items (url, data, checksum) VALUES (?, ?, ?)
        ON CONFLICT(url) DO UPDATE SET
            data = excluded.data,
            updated_at = datetime('now'),
            checksum = excluded.checksum
        WHERE items.checksum != excluded.checksum
    """, (url, json.dumps(data), checksum))
    conn.commit()

Export Formats

# CSV export
import csv
def to_csv(items, path, fields):
    with open(path, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=fields)
        writer.writeheader()
        writer.writerows(items)

# JSON Lines (best for large datasets — streaming)
def to_jsonl(items, path):
    with open(path, 'w') as f:
        for item in items:
            f.write(json.dumps(item) + '\n')

# Incremental export (only new/changed since last export)
def export_since(conn, last_export_time):
    cursor = conn.execute(
        "SELECT data FROM items WHERE scraped_at > ? OR updated_at > ?",
        (last_export_time, last_export_time)
    )
    return [json.loads(row[0]) for row in cursor]

Phase 8: Error Handling & Resilience

Error Classification

HTTP Code Meaning Action
200 Success Process normally
301/302 Redirect Follow (max 5 hops)
403 Forbidden/blocked Rotate proxy, slow down
404 Not found Log, skip, mark URL dead
429 Rate limited Respect Retry-After, back off 2x
500-504 Server error Retry 3x with backoff
Connection timeout Network issue Retry with different proxy
SSL error Certificate issue Log, investigate, skip

Circuit Breaker Pattern

class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=300):
        self.failures = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure = 0
        self.state = "closed"  # closed | open | half-open
    
    def record_failure(self):
        self.failures += 1
        self.last_failure = time.time()
        if self.failures >= self.threshold:
            self.state = "open"
            # Alert: "Circuit open — too many failures"
    
    def record_success(self):
        self.failures = 0
        self.state = "closed"
    
    def can_proceed(self):
        if self.state == "closed":
            return True
        if self.state == "open":
            if time.time() - self.last_failure > self.reset_timeout:
                self.state = "half-open"
                return True  # Try one request
            return False
        return True  # half-open: allow attempt

Checkpoint & Resume

import json
from pathlib import Path

class Checkpointer:
    def __init__(self, path="checkpoint.json"):
        self.path = Path(path)
        self.state = self._load()
    
    def _load(self):
        if self.path.exists():
            return json.loads(self.path.read_text())
        return {"completed_urls": [], "last_page": 0, "cursor": None}
    
    def save(self):
        self.path.write_text(json.dumps(self.state))
    
    def is_done(self, url):
        return url in self.state["completed_urls"]
    
    def mark_done(self, url):
        self.state["completed_urls"].append(url)
        if len(self.state["completed_urls"]) % 50 == 0:
            self.save()  # Periodic save

Phase 9: Monitoring & Operations

Scraper Health Dashboard

dashboard:
  real_time:
    - metric: "requests_per_minute"
      alert_if: "> 60 for small sites"
    - metric: "success_rate"
      alert_if: "< 90%"
    - metric: "avg_response_time_ms"
      alert_if: "> 5000"
    - metric: "blocked_rate"
      alert_if: "> 10%"
      
  per_run:
    - metric: "pages_scraped"
    - metric: "items_extracted"
    - metric: "items_validated"
    - metric: "items_deduplicated"
    - metric: "new_items"
    - metric: "updated_items"
    - metric: "errors_by_type"
    - metric: "run_duration"
    - metric: "proxy_cost"
    
  weekly:
    - metric: "data_freshness"
      description: "% of records updated in last 7 days"
    - metric: "site_structure_changes"
      description: "Selectors that stopped matching"
    - metric: "total_cost"
      description: "Proxy + compute + storage"

Breakage Detection

Sites redesign. Selectors break. Detect it early:

def health_check(results: list[dict], expected_fields: list[str]) -> dict:
    """Check if scraper is still extracting correctly."""
    total = len(results)
    if total == 0:
        return {"status": "CRITICAL", "message": "Zero results — likely broken"}
    
    field_coverage = {}
    for field in expected_fields:
        filled = sum(1 for r in results if r.get(field))
        coverage = filled / total
        field_coverage[field] = coverage
        
    issues = []
    for field, coverage in field_coverage.items():
        if coverage < 0.5:
            issues.append(f"{field}: {coverage:.0%} fill rate (expected >50%)")
    
    if issues:
        return {"status": "WARNING", "issues": issues}
    return {"status": "OK", "field_coverage": field_coverage}

Operational Runbook

Daily:

  • Check success rate per target domain
  • Review error logs for new patterns
  • Verify data freshness

Weekly:

  • Compare extraction counts vs baseline (>20% drop = investigate)
  • Review proxy spend
  • Spot-check 10 random records for accuracy

Monthly:

  • Full selector validation against live pages
  • Review legal compliance (robots.txt changes, ToS updates)
  • Cost optimization review
  • Prune dead URLs from queue

Phase 10: Common Scraping Patterns

Pattern 1: E-commerce Price Monitor

use_case: "Track competitor prices daily"
tool: "requests + BeautifulSoup"
schedule: "Daily at 03:00 UTC (off-peak)"
targets: ["competitor-a.com/products", "competitor-b.com/api"]
data:
  - product_id
  - product_name
  - price
  - currency
  - in_stock
  - scraped_at
storage: "SQLite with price history"
alerts: "Price change > 10% → notify"

Pattern 2: Job Board Aggregator

use_case: "Aggregate job listings from multiple boards"
tool: "Scrapy with per-site spiders"
schedule: "Every 6 hours"
targets: ["board-a.com", "board-b.com", "board-c.com"]
data:
  - title
  - company
  - location
  - salary_range
  - posted_date
  - url
  - source
dedup: "Hash(title + company + location)"
storage: "PostgreSQL"

Pattern 3: News & Content Monitor

use_case: "Monitor industry news mentions"
tool: "requests + RSS feeds (preferred) + web fallback"
schedule: "Every 30 minutes"
approach:
  1: "RSS/Atom feeds (fastest, cleanest)"
  2: "Google News RSS for topic"
  3: "Direct scraping if no feed"
data:
  - headline
  - source
  - url
  - published_at
  - snippet
  - sentiment
alerts: "Keyword match → immediate notification"

Pattern 4: Social Media Intelligence

use_case: "Track brand mentions and sentiment"
tool: "Official APIs (always) + web search fallback"
rules:
  - NEVER scrape social platforms directly — use APIs
  - Twitter/X: Official API ($100/mo basic)
  - Reddit: Official API (free tier available)
  - LinkedIn: No scraping (aggressive legal action)
  - Instagram: Official API only (Meta Business)
fallback: "Brave/Google search for public mentions"

Pattern 5: Real Estate Listings

use_case: "Track property listings and prices"
tool: "Playwright (most listing sites are JS-heavy)"
schedule: "Daily"
challenges:
  - Heavy JavaScript rendering
  - Anti-bot measures (Cloudflare common)
  - Frequent layout changes
  - Map-based results
approach: "API endpoint discovery via network tab first"

Phase 11: Scaling Strategies

Concurrency Architecture

Single machine (small scale):
├── asyncio + aiohttp (Python) → 50-200 concurrent requests
├── Worker pool (ThreadPoolExecutor) → 10-50 threads
└── Scrapy reactor → Built-in concurrency

Multi-machine (large scale):
├── URL queue: Redis / RabbitMQ / SQS
├── Workers: Multiple Scrapy/custom workers
├── Results: Shared PostgreSQL / S3
└── Coordinator: Celery / custom scheduler

Cost Optimization

Lever Impact How
Static > Browser 10-50x cheaper Always try HTTP first
Block images/CSS/fonts 60-80% bandwidth saved Route filtering
Cache DNS Minor but cumulative Local DNS cache
Compress responses 50-70% bandwidth Accept-Encoding: gzip, br
Smart scheduling Avoid redundant scrapes Change detection before full re-scrape
Proxy tier matching 3-10x cost difference Don't use residential for easy sites

Phase 12: Advanced Patterns

API Discovery (Network Tab Mining)

Before building a scraper, check if the site has hidden API endpoints:

  1. Open DevTools → Network tab
  2. Filter by XHR/Fetch
  3. Navigate the site, click load-more, filter/sort
  4. Look for JSON responses — these are your goldmine
  5. Most SPAs load data via REST/GraphQL APIs

Common hidden API patterns:

  • /api/v1/products?page=1&limit=20
  • /graphql with query parameters
  • /_next/data/... (Next.js data routes)
  • /wp-json/wp/v2/posts (WordPress)

Headless Browser Optimization

# Minimize browser resource usage
context = browser.new_context(
    viewport={"width": 1280, "height": 720},
    java_script_enabled=True,  # Only if needed
    has_touch=False,
    is_mobile=False,
)

# Block resource types you don't need
page.route("**/*", lambda route: (
    route.abort() if route.request.resource_type in 
    ["image", "stylesheet", "font", "media"] 
    else route.continue_()
))

Scraping Behind Authentication

# When authorized to scrape behind login
# ALWAYS use session-based auth, never store passwords in code

# Pattern: Login once, reuse session
session = requests.Session()
login_resp = session.post("https://example.com/login", data={
    "username": os.environ["SCRAPE_USER"],
    "password": os.environ["SCRAPE_PASS"],
})
assert login_resp.ok, "Login failed"

# Session cookies are now stored — use for subsequent requests
data_resp = session.get("https://example.com/api/data")

Change Detection (Avoid Redundant Scrapes)

import hashlib

def has_changed(url, session, last_etag=None, last_modified=None):
    """Check if page changed without downloading full content."""
    headers = {}
    if last_etag:
        headers["If-None-Match"] = last_etag
    if last_modified:
        headers["If-Modified-Since"] = last_modified
    
    resp = session.head(url, headers=headers)
    
    if resp.status_code == 304:
        return False, resp.headers.get("ETag"), resp.headers.get("Last-Modified")
    
    return True, resp.headers.get("ETag"), resp.headers.get("Last-Modified")

Quality Scoring Rubric (0-100)

Dimension Weight What to Assess
Legal compliance 20% robots.txt, ToS, PII handling, audit trail
Data quality 20% Validation, accuracy, completeness, freshness
Resilience 15% Error handling, retries, circuit breakers, checkpointing
Anti-detection 15% Proxy rotation, fingerprint diversity, rate limiting
Architecture 10% Right tool selection, clean code, modularity
Monitoring 10% Success rates, breakage detection, alerting
Performance 5% Speed, cost efficiency, resource usage
Documentation 5% Runbook, schema docs, legal assessment

Grading: 90+ Excellent | 75-89 Good | 60-74 Needs work | <60 Redesign


10 Common Mistakes

# Mistake Fix
1 No robots.txt check Always check first — it's your legal defense
2 Fixed delays (no jitter) Add ±30% random jitter to all delays
3 No data validation Validate every field before storing
4 Using browser for static HTML HTTP client is 10-50x faster and cheaper
5 Single IP, no rotation Proxy rotation for any serious scraping
6 No breakage detection Monitor extraction counts and field fill rates
7 Storing raw HTML only Extract + structure immediately
8 No checkpoint/resume Long scrapes must be resumable
9 Ignoring structured data JSON-LD/microdata is cleaner than CSS selectors
10 Scraping when API exists Always check for API first

5 Edge Cases

  1. Single-page apps (React/Vue/Angular): Must use browser rendering OR find the underlying API (network tab). Prefer API discovery — it's faster and more reliable.

  2. Infinite scroll: Intercept the XHR/fetch calls that load more content. Simulate scrolling only as last resort. The API endpoint usually accepts page or offset params.

  3. CAPTCHAs: If you're hitting CAPTCHAs, you're scraping too aggressively. Slow down first. If CAPTCHAs persist: managed services (2Captcha, Anti-Captcha) or rethink approach.

  4. Dynamic class names (CSS modules, Tailwind): Use data attributes, ARIA labels, or text content selectors instead. [data-testid="price"] survives redesigns. .sc-bdVTJa does not.

  5. Multi-language sites: Detect language via html[lang] attribute. Set Accept-Language header to get desired locale. Watch for different URL structures (/en/, /de/, subdomains).


Natural Language Commands

  1. "Check if I can scrape [URL]" → Run compliance checklist (robots.txt, ToS, data type)
  2. "What tool should I use for [site]?" → Analyze site rendering, anti-bot, recommend tool
  3. "Build a scraper for [description]" → Full architecture brief + code pattern
  4. "My scraper is getting blocked" → Anti-detection diagnostic + proxy/stealth recommendations
  5. "Extract [data] from [URL]" → Check structured data first, then CSS selectors
  6. "Monitor [site] for changes" → Change detection + scheduling + alerting setup
  7. "How do I handle pagination on [site]?" → Identify pagination type + code pattern
  8. "Scrape at scale ([N] pages)" → Concurrency architecture + cost estimate
  9. "Clean and store this scraped data" → Validation + dedup + storage recommendation
  10. "Is my scraper healthy?" → Run health check + breakage detection
  11. "Find the API behind [site]" → Network tab mining guide + common patterns
  12. "Set up price monitoring for [competitors]" → Full e-commerce monitor pattern

Reviews (0)

Sign in to write a review.

No reviews yet. Be the first to review!

Comments (0)

Sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Compatible Platforms

Pricing

Free

Related Configs