Design a web crawler

Reference

Usecases

  • Search engine

  • Copyright violation detection

  • keyword based finding

  • web walware detection

  • web analytics

  • datascience data

Things to consider

  • Politeness/crawl rate

  • DNS query

  • Distributed crawling

  • Priority crawling

  • Duplicate detection

Questions

  • How to generate content signature?

  • what to do with similar content?

Last updated