Design a decentralized web crawler
Question
We can't use a centralized data store.
We need to deploy the same crawler software on each node and each node will do the job of crawling and storing data. We have 10,000 nodes and each node can communicate with all other nodes. We have to minimize communication and make sure each node does equal amount of work.
Solution
After a node crawls a page, it will get a list of URLs to crawl next.
The brute force solution would be that this node communicates with all other nodes to see what are the links that haven't been crawled.
A better solution would be that a node use Consistent Hashing to determine which node should handle that URL and send the URL to that node. In this way, each node becomes a task assigner, and the communication between nodes are minimized.
When a new node is added to the system, a small fraction of links (crawled or not) that were or will be crawled in an old node will get migrated to the new node.
When a node becomes offline, the links assigned to this node will be shared across other nodes.
Last updated