Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

What is the history of the ccBot crawler?

April 26, 2017crawler History

0

Posted

What is the history of the ccBot crawler?

1 Answer

0

Posted

The ccBot crawler is a distributed crawling infrastructure that makes use of the Apache Hadoop and Nutch projects. We use Map-Reduce (via the open source Hadoop project) to process and extract crawl candidates from our crawl database. This candidate list is sorted by host (domain name) and then distributed via RPC to a set of spider (bot) servers. The resulting crawl data is then post processed (for the purposes of link extraction and deduplication) and then reintegrated into the crawl database.