Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

What is the history of the ccBot crawler?

crawler History
0
Posted

What is the history of the ccBot crawler?

0

The ccBot crawler is a distributed crawling infrastructure that makes use of the Apache Hadoop and Nutch projects. We use Map-Reduce (via the open source Hadoop project) to process and extract crawl candidates from our crawl database. This candidate list is sorted by host (domain name) and then distributed via RPC to a set of spider (bot) servers. The resulting crawl data is then post processed (for the purposes of link extraction and deduplication) and then reintegrated into the crawl database.

Related Questions

What is your question?

*Sadly, we had to bring back ads too. Hopefully more targeted.

Experts123