What types of mistakes do OCR packages typically make?
Each text has its own peculiarities, but there are a number of well-known scanning errors you will be dealing with all the time. Punctuation is always a problem. Periods, commas and semi-colons are often confused, as are colons and semi-colons. There are also usually a number of extra or missing spaces in the e-text. The problem of quotes can assume nightmarish proportions in a text which contains a lot of dialog, particularly when single and double quotes are nested. The numeral 1, the lower-case letter l, the exclamation mark ! and the capital I are routinely confused, and often, single or double quotes may be mistaken for one of these. Lower-case m is often mistaken for rn or ni. The letters h and b and e and c are commonly mis-read, and these are probably the hardest of all to catch, since ear/car, eat/cat, he/be, hear/bear, heard/beard are all common words which no spell-checker will flag as problems. For example: ” Hello1′ caIled jirnmy breczily. 11Anyone home ? ” There seemed to