Where can I get more background on Heritrix and learn more about “crawling” in general?
The following are all worth at least a quick skim: • The Wikipedia Webcrawler page offers a nice introduction on general crawling problem. It has a good overview of current, most cited literature. • Mercator: A Scalable, Extensible Web Crawler is an overview of the original Mercator design, which the Heritrix crawler parallels in many ways. • High-performance Web Crawling is info on experience scaling Mercator. • Performance Limitations of the Java Core Libraries is info on Mercator’s experience working around Java problems and bottlenecks. Fortunately, many of these issues have been improved for us by later JVMs and Java core API updates — but some of these are still issues, and in any case it gives a good flavor for the kinds of problems and profiling one might need to do. • Ubicrawler, a scalable distributed web crawler. • The Viuva Negra crawler paper describes common architectures and common issues encountered crawling as introduction to the VN crawler. The paper ends with compar