How can I index file formats like OpenDocument (aka OpenOffice.org), RTF, Microsoft Word, Excel, PowerPoint, Visio, etc?
Have a look at Tika, the content analysis toolkit. Alternately: Many modern office file formats (.odt, .sxw, .sxc, etc) are ZIP archives that contain XML files. You can uncompress the file using Java’s ZIP support, then parse e.g. meta.xml to get the title and e.g. content.xml to get the document’s content. You can then add these to the Lucene index, typically using one Lucene field per property. You can also use LIUS framework for indexing OpenOffice.org documents ( http://www.bibl.ulaval.ca/lius/). LIUS allows metadata and fulltext indexing, using XPath. For MS-Word, MS-Excel, MS-Visio, and MS-Powerpoint you might also want to take a look at Apache POI. Lucene In Action contains an example of how to extract text from RTF files using the Swing RTFEditorKit class.
Related Questions
- How can I index file formats like OpenDocument (aka OpenOffice.org), RTF, Microsoft Word, Excel, PowerPoint, Visio, etc?
- Has anyone installed the Microsoft Office Compatibility Pack for Word, Excel and Powerpoing 2007 File formats?
- How can I optimize a file from within Microsoft Word, Excel or PowerPoint?