Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

Doesn a lossless compressor wastes resources on coding random noise and (spelling) errors in enwik8, in contrast to lossy compressors like the human brain?

April 26, 2017coding compressor enwik8 errors lossless noise Resources spelling

0

Posted

Doesn a lossless compressor wastes resources on coding random noise and (spelling) errors in enwik8, in contrast to lossy compressors like the human brain?

1 Answer

0

Posted

— Not really, due to the following reasons: • The test data is very clean. Misspelled words and grammatical errors are very rare. This is one reason why we chose this data set. See http://cs.fit.edu/~mmahoney/compression/textdata.html for an analysis of the data. • Even if the corpus would contain a lot of noise, lossless compression is still the right way to go for. One can show that among the shortest codes for a noisy data corpus, there is a two-part code of length l(A)+l(B), where A contains all “useful” information and B contains all noise. The theory is called “algorithmic statistics”, while in practice two-part MDL is used. A probabilistic model M plus the log-likelihood of the data under the model: CodeLength(x) = -log P(x|M) + L(M). Or briefly: Noise does not at all harm the strong relation between compression and understandng/intelligence/predictability/etc. • It is not always clear what accounts for errors/noise and what for useful data. Consider e.g. some keyboard-layout,