How does the harvester work? Is there a relationship between how the harvester works and how the crawler works?
The harvester strips out URLs from text and code. The URLs it strips are these types: http://www.govcom.org, http://govcom.org and www.govcom.org. If the URL is just govcom.org, the harvester does not retain it. The Issue Crawler itself analyzes hyperlinks in html and other code, and in html and other code hyperlinks must have http://. Thus the Issue Crawler deals with all hyperlinks, and not just hyperlinks that begin with http:// and www. For your information, the Issue Crawler, like all crawlers we know of, doesn’t handle javascript. The Issue Crawler also does not strip links from pdf’s, but pdf’s may be in the final results.