How accurate should OCR be?
OCR packages commonly say that they are “99%+” accurate, or something like that. Let’s analyze what that actually means: say there are 1,000 characters (letters) on each page, then with 99.9% accuracy, you would expect to have to make 1 correction per page. With 99% accuracy, that would be up to 10 corrections per page. And in a 400-page book, this all adds up. But there’s a “Your Mileage May Vary” clause built into that. Typically, the manufacturers test their OCR on fresh, laser-printed or press-printed copy with perfect scans, and this is fair, since they are aiming their products primarily at businesses that process these kinds of materials. You are not dealing with fresh print; you’re dealing with old books, yellowed, spotted, marked, imperfectly printed in the first place, and possibly using unfamiliar fonts. And it’s unlikely that you will have the patience to get a perfect scan on every page. The result is that the accuracy of OCR for typical PG work doesn’t match the accuracy