How can I get htdig not to index some directories, but still follow links?
You can simply add the directory name to your robots.txt file or to the exclude_urls attribute in your configuration, but that will exclude all files under that directory. If you want the files in that directory to be indexed, you have a couple options. You can add an index.html file to the directory, that will include a robots meta tag (see question 4.15) to prevent indexing, and will contain links to all your files in this directory. The drawback of this is that you must maintain the index.html file yourself, as it won’t be automatically updated as new files are added to the directory.The other technique you can use, if you want the directory index to be made by the web server, is to get the server to insert the robots meta tag into the index page it generates. In Apache, this is done using the HeaderName and IndexOptions directives in the directory’s .htaccess file.
You can simply add the directory name to your robots.txt file or to the exclude_urls attribute in your configuration, but that will exclude all files under that directory. If you want the files in that directory to be indexed, you have a couple options. You can add an index.html file to the directory, that will include a robots meta tag (see question 4.15) to prevent indexing, and will contain links to all your files in this directory. The drawback of this is that you must maintain the index.html file yourself, as it won’t be automatically updated as new files are added to the directory. The other technique you can use, if you want the directory index to be made by the web server, is to get the server to insert the robots meta tag into the index page it generates. In Apache, this is done using the HeaderName and IndexOptions directives in the directory’s .htaccess file.