How does Nutch relate to distributed Web crawler Grub, and what do you think of it?
As far as I can tell, Grub is a project that lets folks donate their hardware and bandwidth to LookSmart’s crawling effort. Only the client is open source, not the server, so folks can neither deploy their own version of Grub, nor can they access the data that Grub gathers. What about distributed crawling more generally? When a search engine gets big, crawl-related expenses are dwarfed by search-related expenses. So a distributed crawler doesn’t significantly improve costs, rather it makes more complicated something that is already relatively inexpensive. That’s not a good tradeoff. Widely distributed search is interesting, but I’m not sure it can yet be done and keep things as fast as they need to be. A faster search engine is a better search engine. When folks can quickly revise queries then they more frequently find what they’re looking for before they get impatient. But building a widely distributed search system that can search billions of pages in a fraction of a second is diffic