Doesn a lossless compressor wastes resources on coding random noise and (spelling) errors in enwik8, in contrast to lossy compressors like the human brain?
— Not really, due to the following reasons: • The test data is very clean. Misspelled words and grammatical errors are very rare. This is one reason why we chose this data set. See http://cs.fit.edu/~mmahoney/compression/textdata.html for an analysis of the data. • Even if the corpus would contain a lot of noise, lossless compression is still the right way to go for. One can show that among the shortest codes for a noisy data corpus, there is a two-part code of length l(A)+l(B), where A contains all “useful” information and B contains all noise. The theory is called “algorithmic statistics”, while in practice two-part MDL is used. A probabilistic model M plus the log-likelihood of the data under the model: CodeLength(x) = -log P(x|M) + L(M). Or briefly: Noise does not at all harm the strong relation between compression and understandng/intelligence/predictability/etc. • It is not always clear what accounts for errors/noise and what for useful data. Consider e.g. some keyboard-layout,