> For the complete documentation index, see [llms.txt](https://liuzhenglaichn.gitbook.io/system-design/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://liuzhenglaichn.gitbook.io/system-design/web-crawler/design-a-decentralized-web-crawler.md).

# Design a decentralized web crawler

## Question

<https://leetcode.com/discuss/interview-question/system-design/124657/facebook-system-design-a-web-crawler-that-will-crawl-wikipedia>

We can't use a centralized data store.

We need to deploy the same crawler software on each node and each node will do the job of crawling and storing data. We have 10,000 nodes and each node can communicate with all other nodes. We have to minimize communication and make sure each node does equal amount of work.

## Solution

After a node crawls a page, it will get a list of URLs to crawl next.

The brute force solution would be that this node communicates with all other nodes to see what are the links that haven't been crawled.

A better solution would be that a node use **Consistent Hashing** to determine which node should handle that URL and send the URL to that node. In this way, each node becomes a task assigner, and the communication between nodes are minimized.

* When a new node is added to the system, a small fraction of links (crawled or not) that were or will be crawled in an old node will get migrated to the new node.&#x20;
* When a node becomes offline, the links assigned to this node will be shared across other nodes.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://liuzhenglaichn.gitbook.io/system-design/web-crawler/design-a-decentralized-web-crawler.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
