Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

Can I use Lucene to crawl my site or other sites on the Internet?

April 26, 2017Crawl internet lucene site sites

0

Posted

Can I use Lucene to crawl my site or other sites on the Internet?

1 Answer

0

Posted

No. Lucene does not know how to access external document, nor does it know how to extract the content and links of HTML and other document format. Lucene focuses on the indexing and searching and does it great. However, several crawlers are available which you could use: list of Open Source Crawlers in Java. regain is an Open Source tool that crawls web sites, stores them in a Lucene index and offers a search web interface. Also see Nutch for a powerful Lucene-based search engine.