How do I index Word, Excel, PowerPoint or PostScript documents?
This must be done with an external parser or converter. A sample of such an external converter is the contrib/doc2html/doc2html.pl Perl script. It will parse Word, PostScript, PDF and other documents, when used with the appropriate document to text converters. It uses catdoc to parse Word documents, and ps2ascii to parse PostScript files. The comments in the Perl script and accompanying documentation indicate where you can obtain these converters.Versions of htdig before 3.1.4 don’t support external converters, so you have to use an external parser script such as contrib/parse_doc.pl (or better yet, upgrade htdig if you can). External converter scripts are simpler to write and maintain than a full external parser, as they just convert input documents to text/plain or text/html, and pass that back to htdig to be parsed. Parsing is more consistent across document types with external converters, because the final work is done by htdig’s internal parsers.
This must be done with an external parser or converter. A sample of such an external converter is the contrib/doc2html/doc2html.pl Perl script. It will parse Word, PostScript, PDF and other documents, when used with the appropriate document to text converters. It uses catdoc to parse Word documents, and ps2ascii to parse PostScript files. The comments in the Perl script and accompanying documentation indicate where you can obtain these converters. Versions of htdig before 3.1.4 don’t support external converters, so you have to use an external parser script such as contrib/parse_doc.pl (or better yet, upgrade htdig if you can). External converter scripts are simpler to write and maintain than a full external parser, as they just convert input documents to text/plain or text/html, and pass that back to htdig to be parsed. Parsing is more consistent across document types with external converters, because the final work is done by htdig’s internal parsers. External parser scripts tend to b