https://github.com/donnemartin/system-design-primer/blob/master/solutions/system_design/web_crawler/README.mdarrow-up-right
Search engine
Copyright violation detection
keyword based finding
web walware detection
web analytics
datascience data
Politeness/crawl rate
DNS query
Distributed crawling
Priority crawling
Duplicate detection
How to generate content signature?
what to do with similar content?
Last updated 4 years ago