2dbi

Design a Scalable Web Crawler

viaLeetCode

Problem Design a scalable web crawler: start from seed URLs, download pages, extract links, and keep crawling — politely and at scale.

Functional requirements

  • Fetch pages, parse and extract URLs, enqueue unseen ones; store page content for downstream indexing; re-crawl on a freshness schedule; respect robots.txt.

Non-functional requirements

  • Scale to discuss: ~1B pages/month (≈400 fetches/sec sustained); politeness — bounded request rate per domain regardless of cluster size; dedupe both URLs and content; tolerate crashes without losing frontier state.

Key components

  • URL frontier: the heart — a two-level queue system separating priority selection from per-domain FIFO politeness queues (one active fetch stream per domain, delayed by crawl-delay); persistent/checkpointed.
  • Fetcher workers (async I/O, DNS cache), robots.txt cache per domain, parser/extractor, URL normalizer + seen-set (Bloom filter for scale — accept false positives, know the trade), content dedupe (hash/SimHash for near-duplicates), content store (blob store + metadata DB), scheduler for re-crawls (priority by change frequency/PageRank-ish importance).

Deep dives / trade-offs

  • Partitioning: shard by domain-hash so politeness state stays local to one worker; rebalancing on worker failure.
  • Traps and hostile input: crawler traps (infinite calendars), URL explosion, spider-trap detection via per-domain budgets; timeouts, retry policy, DNS as a bottleneck.
  • Frontier persistence vs memory (the frontier can exceed RAM); prioritization mixing freshness, importance, and fairness without starving small domains.
Add a follow-up question they asked
No follow-ups yet. Be the first to add one.
asked …
LeaderboardSalary
Language
Account