What kind of data/documents can Lemur index?
Actually, you can create your own parsers for whatever text documents you have, as long as your parser takes whatever it wants to recognize as a term and “pushes” it into the index. However, we do provide several parsers with the toolkit. Lemur is primarily a research system so the included parsers were designed to facilitate indexing many documents that are in the same file. In order for the index to know where the document boundaries are within files, each document must have begin document and end document tags. These tags are similar to HTML or XML tags and are actually the format for NIST’s Text REtrieval Conference (TREC) documents. The 2 most frequently used parsers are the TrecParser and WebParser. TrecParser: This parser recognizes text in the TEXT, HL, HEAD, HEADLINE, TTL, and LP fields. For example: