System Design
  • Introduction
  • Basics
    • Key Characteristics of Distributed Systems
    • Load Balancing
    • Reverse Proxy
    • Cache
    • Sharding or Data Partitioning
    • Index
    • Redundancy and Replication
    • SQL vs NoSQL
  • Advanced
    • The Difference between SLI, SLO, and SLA
    • Consistent Hashing
    • Server-to-client Communication
    • Data Sharding
  • Database
    • SQL
    • ACID
    • Data Partitioning
  • News Feed
    • Design a News Feed System
    • Timeline creation with sharded data
    • Facebook News Feed
    • Twitter News Feed (Timeline)
    • How does facebook rank news feed?
  • Mint
    • Design Mint
  • Web Crawler
    • Design a web crawler
    • Design a decentralized web crawler
  • TODO
    • TODO
    • Elastic Search
    • Lucene
    • twitter-snowflake
Powered by GitBook
On this page
  • Reference
  • Usecases
  • Things to consider
  • Questions

Was this helpful?

  1. Web Crawler

Design a web crawler

PreviousDesign MintNextDesign a decentralized web crawler

Last updated 4 years ago

Was this helpful?

Reference

Usecases

  • Search engine

  • Copyright violation detection

  • keyword based finding

  • web walware detection

  • web analytics

  • datascience data

Things to consider

  • Politeness/crawl rate

  • DNS query

  • Distributed crawling

  • Priority crawling

  • Duplicate detection

Questions

  • How to generate content signature?

  • what to do with similar content?

https://github.com/donnemartin/system-design-primer/blob/master/solutions/system_design/web_crawler/README.md