Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

How can I index file formats like OpenDocument (aka OpenOffice.org), RTF, Microsoft Word, Excel, PowerPoint, Visio, etc?

April 26, 2017AKA Excel file formats index Microsoft opendocument RTF word

0

Posted

How can I index file formats like OpenDocument (aka OpenOffice.org), RTF, Microsoft Word, Excel, PowerPoint, Visio, etc?

1 Answer

0

Posted

Have a look at Tika, the content analysis toolkit. Alternately: Many modern office file formats (.odt, .sxw, .sxc, etc) are ZIP archives that contain XML files. You can uncompress the file using Java’s ZIP support, then parse e.g. meta.xml to get the title and e.g. content.xml to get the document’s content. You can then add these to the Lucene index, typically using one Lucene field per property. You can also use LIUS framework for indexing OpenOffice.org documents ( http://www.bibl.ulaval.ca/lius/). LIUS allows metadata and fulltext indexing, using XPath. For MS-Word, MS-Excel, MS-Visio, and MS-Powerpoint you might also want to take a look at Apache POI. Lucene In Action contains an example of how to extract text from RTF files using the Swing RTFEditorKit class.