What are the practical and/or theoretical limits of ht://Dig?
The code itself doesn’t put any real limit on the number of pages. There are several sites in the hundreds of thousands of pages. As for practical limits, it depends a lot on how many pages you plan on indexing. Some operating systems limit files to 2 GB in size, which can become a problem with a large database. There are also slightly different limits to each of the programs. Right now htmerge performs a sort on the words indexed. Most sort programs use a fair amount of RAM and temporary disk space as they assemble the sorted list. The htdig program stores a fair amount of information about the URLs it visits, in part to only index a page once. This takes a fair amount of RAM. With cheap RAM, it never hurts to throw more memory at indexing larger sites. In a pinch, swap will work, but it obviously really slows things down.The 3.2 development code helps with many of these limitations.
The code itself doesn’t put any real limit on the number of pages. There are several sites in the hundreds of thousands of pages. As for practical limits, it depends a lot on how many pages you plan on indexing. Some operating systems limit files to 2 GB in size, which can become a problem with a large database. There are also slightly different limits to each of the programs. Right now htmerge performs a sort on the words indexed. Most sort programs use a fair amount of RAM and temporary disk space as they assemble the sorted list. The htdig program stores a fair amount of information about the URLs it visits, in part to only index a page once. This takes a fair amount of RAM. With cheap RAM, it never hurts to throw more memory at indexing larger sites. In a pinch, swap will work, but it obviously really slows things down. The 3.2 development code helps with many of these limitations. In paticular, it generates the databases on the fly, which means you don’t have to sort them before s