What is the history of the ccBot crawler?
The ccBot crawler is a distributed crawling infrastructure that makes use of the Apache Hadoop and Nutch projects. We use Map-Reduce (via the open source Hadoop project) to process and extract crawl candidates from our crawl database. This candidate list is sorted by host (domain name) and then distributed via RPC to a set of spider (bot) servers. The resulting crawl data is then post processed (for the purposes of link extraction and deduplication) and then reintegrated into the crawl database.