Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

Why is NSP so slow on large files of text?

April 26, 2017files large nsp slow text

0

Posted

Why is NSP so slow on large files of text?

1 Answer

0

Posted

As of version 0.61 (and before) NSP does all counting using a Perl hash. In other words, each Ngram is an element in a hash, and a counter is kept for that Ngram. This works very nicely, but if you start to deal with millions of words of text it can result in a very large hash that consumes quite a bit of memory. Remember, the crucial constraint is not how many words in the corpus, but how many unique Ngrams you have in the text. As N gets larger, the number of hash elements grows larger and larger. However, if you are dealing with unigrams NSP can process extremely large files (50 million words) quite efficiently. To start with I would recommend that you gradually increase the value of N and the amount of text you try to process with NSP, just to get a sense of how long things take on your system. Start with processing unigrams in a 1,000,000 word corpus, then try bigrams. Then move up to 5,000,000 words with unigrams, and so forth.