Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

How does the harvester work? Is there a relationship between how the harvester works and how the crawler works?

April 26, 2017crawler harvester relationship

0

Posted

How does the harvester work? Is there a relationship between how the harvester works and how the crawler works?

1 Answer

0

Posted

The harvester strips out URLs from text and code. The URLs it strips are these types: http://www.govcom.org, http://govcom.org and www.govcom.org. If the URL is just govcom.org, the harvester does not retain it. The Issue Crawler itself analyzes hyperlinks in html and other code, and in html and other code hyperlinks must have http://. Thus the Issue Crawler deals with all hyperlinks, and not just hyperlinks that begin with http:// and www. For your information, the Issue Crawler, like all crawlers we know of, doesn’t handle javascript. The Issue Crawler also does not strip links from pdf’s, but pdf’s may be in the final results.