Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

What kind of data/documents can Lemur index?

April 26, 2017Data documents index lemur

0

Posted

What kind of data/documents can Lemur index?

1 Answer

0

Posted

Actually, you can create your own parsers for whatever text documents you have, as long as your parser takes whatever it wants to recognize as a term and “pushes” it into the index. However, we do provide several parsers with the toolkit. Lemur is primarily a research system so the included parsers were designed to facilitate indexing many documents that are in the same file. In order for the index to know where the document boundaries are within files, each document must have begin document and end document tags. These tags are similar to HTML or XML tags and are actually the format for NIST’s Text REtrieval Conference (TREC) documents. The 2 most frequently used parsers are the TrecParser and WebParser. TrecParser: This parser recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. For example: document_number Index this document text. WebParser: This parser removes HTML tags, text within SCRIPT tags, as well as text in HTML c