Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

Where can I get more background on Heritrix and learn more about “crawling” in general?

April 26, 2017background crawling heritrix learn

0

Posted

Where can I get more background on Heritrix and learn more about “crawling” in general?

1 Answer

0

Posted

The following are all worth at least a quick skim: • The Wikipedia Webcrawler page offers a nice introduction on general crawling problem. It has a good overview of current, most cited literature. • Mercator: A Scalable, Extensible Web Crawler is an overview of the original Mercator design, which the Heritrix crawler parallels in many ways. • High-performance Web Crawling is info on experience scaling Mercator. • Performance Limitations of the Java Core Libraries is info on Mercator’s experience working around Java problems and bottlenecks. Fortunately, many of these issues have been improved for us by later JVMs and Java core API updates — but some of these are still issues, and in any case it gives a good flavor for the kinds of problems and profiling one might need to do. • Ubicrawler, a scalable distributed web crawler. • The Viuva Negra crawler paper describes common architectures and common issues encountered crawling as introduction to the VN crawler. The paper ends with compar