Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

How does Nutch relate to distributed Web crawler Grub, and what do you think of it?

April 26, 2017crawler distributed GRUB nutch relate think web

0

Posted

How does Nutch relate to distributed Web crawler Grub, and what do you think of it?

1 Answer

0

Posted

As far as I can tell, Grub is a project that lets folks donate their hardware and bandwidth to LookSmart’s crawling effort. Only the client is open source, not the server, so folks can neither deploy their own version of Grub, nor can they access the data that Grub gathers. What about distributed crawling more generally? When a search engine gets big, crawl-related expenses are dwarfed by search-related expenses. So a distributed crawler doesn’t significantly improve costs, rather it makes more complicated something that is already relatively inexpensive. That’s not a good tradeoff. Widely distributed search is interesting, but I’m not sure it can yet be done and keep things as fast as they need to be. A faster search engine is a better search engine. When folks can quickly revise queries then they more frequently find what they’re looking for before they get impatient. But building a widely distributed search system that can search billions of pages in a fraction of a second is diffic