Ctrlk

Design a web crawler

Reference

https://github.com/donnemartin/system-design-primer/blob/master/solutions/system_design/web_crawler/README.md

Usecases

Search engine
Copyright violation detection
keyword based finding
web walware detection
web analytics
datascience data

Things to consider

Politeness/crawl rate
DNS query
Distributed crawling
Priority crawling
Duplicate detection

Questions

How to generate content signature?
what to do with similar content?

PreviousDesign Mint NextDesign a decentralized web crawler

Last updated 4 years ago

Was this helpful?