Design a Scalable Web Crawler
viaLeetCode
Problem Design a scalable web crawler: start from seed URLs, download pages, extract links, and keep crawling — politely and at scale.
Functional requirements
- Fetch pages, parse and extract URLs, enqueue unseen ones; store page content for downstream indexing; re-crawl on a freshness schedule; respect robots.txt.
Non-functional requirements
- Scale to discuss: ~1B pages/month (≈400 fetches/sec sustained); politeness — bounded request rate per domain regardless of cluster size; dedupe both URLs and content; tolerate crashes without losing frontier state.
Key components
- URL frontier: the heart — a two-level queue system separating priority selection from per-domain FIFO politeness queues (one active fetch stream per domain, delayed by crawl-delay); persistent/checkpointed.
- Fetcher workers (async I/O, DNS cache), robots.txt cache per domain, parser/extractor, URL normalizer + seen-set (Bloom filter for scale — accept false positives, know the trade), content dedupe (hash/SimHash for near-duplicates), content store (blob store + metadata DB), scheduler for re-crawls (priority by change frequency/PageRank-ish importance).
Deep dives / trade-offs
- Partitioning: shard by domain-hash so politeness state stays local to one worker; rebalancing on worker failure.
- Traps and hostile input: crawler traps (infinite calendars), URL explosion, spider-trap detection via per-domain budgets; timeouts, retry policy, DNS as a bottleneck.
- Frontier persistence vs memory (the frontier can exceed RAM); prioritization mixing freshness, importance, and fairness without starving small domains.
asked …