Design a Scalable Web Crawler

viaLeetCode

Problem Design a scalable web crawler: start from seed URLs, download pages, extract links, and keep crawling — politely and at scale.

Functional requirements

Fetch pages, parse and extract URLs, enqueue unseen ones; store page content for downstream indexing; re-crawl on a freshness schedule; respect robots.txt.

Non-functional requirements

Scale to discuss: ~1B pages/month (≈400 fetches/sec sustained); politeness — bounded request rate per domain regardless of cluster size; dedupe both URLs and content; tolerate crashes without losing frontier state.

Key components

URL frontier: the heart — a two-level queue system separating priority selection from per-domain FIFO politeness queues (one active fetch stream per domain, delayed by crawl-delay); persistent/checkpointed.
Fetcher workers (async I/O, DNS cache), robots.txt cache per domain, parser/extractor, URL normalizer + seen-set (Bloom filter for scale — accept false positives, know the trade), content dedupe (hash/SimHash for near-duplicates), content store (blob store + metadata DB), scheduler for re-crawls (priority by change frequency/PageRank-ish importance).

Deep dives / trade-offs

Partitioning: shard by domain-hash so politeness state stays local to one worker; rebalancing on worker failure.
Traps and hostile input: crawler traps (infinite calendars), URL explosion, spider-trap detection via per-domain budgets; timeouts, retry policy, DNS as a bottleneck.
Frontier persistence vs memory (the frontier can exceed RAM); prioritization mixing freshness, importance, and fairness without starving small domains.

Add a follow-up question they asked

No follow-ups yet. Be the first to add one.

asked …