Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

What are the practical and/or theoretical limits of ht://Dig?

April 26, 2017Dig ht limits practical theoretical

0

10 Posted

What are the practical and/or theoretical limits of ht://Dig?

2 Answers

0

Posted

The code itself doesn’t put any real limit on the number of pages. There are several sites in the hundreds of thousands of pages. As for practical limits, it depends a lot on how many pages you plan on indexing. Some operating systems limit files to 2 GB in size, which can become a problem with a large database. There are also slightly different limits to each of the programs. Right now htmerge performs a sort on the words indexed. Most sort programs use a fair amount of RAM and temporary disk space as they assemble the sorted list. The htdig program stores a fair amount of information about the URLs it visits, in part to only index a page once. This takes a fair amount of RAM. With cheap RAM, it never hurts to throw more memory at indexing larger sites. In a pinch, swap will work, but it obviously really slows things down.The 3.2 development code helps with many of these limitations.

0

Posted

The code itself doesn’t put any real limit on the number of pages. There are several sites in the hundreds of thousands of pages. As for practical limits, it depends a lot on how many pages you plan on indexing. Some operating systems limit files to 2 GB in size, which can become a problem with a large database. There are also slightly different limits to each of the programs. Right now htmerge performs a sort on the words indexed. Most sort programs use a fair amount of RAM and temporary disk space as they assemble the sorted list. The htdig program stores a fair amount of information about the URLs it visits, in part to only index a page once. This takes a fair amount of RAM. With cheap RAM, it never hurts to throw more memory at indexing larger sites. In a pinch, swap will work, but it obviously really slows things down. The 3.2 development code helps with many of these limitations. In paticular, it generates the databases on the fly, which means you don’t have to sort them before s