I’ve never been particularly good at reading handwriting — particularly cursive. Upon seeing a letter from Alexander Hamilton at Loyola’s University Archives and Special Collections during the Chicago Open Archives weekend, the historicity of the document was fascinating, but I found my eyes skimming across the text without putting in the deliberate effort to read what A.Ham was writing. I don’t think I’m out of line to say that old texts are hard. Through less familiar words or sentence structures, and in the case of some authors, extremely liberal approaches to spelling, historical documents often present real challenges in the very basic task of reading them.
I’m saying this to say that I have a lot of sympathy for the problem of computer text recognition. In doing a simple test of Google Books’s OCR tech, on a 1782 edition of Don Quixote, the surprise was not that the text was somewhat garbled. Instead, I was struck by how much it had to accomplish to get what it did. This is a document with two columns, numerous small marks and splotches, and unconventional typographic standards in how it handles long quotes. Notably, the document also has extensive usage of the long s, the character that looks like an f in many older texts (and which occasionally serves as the source of inadvertent humor/inappropriateness).
Google’s transfer to plaintext handles this fairly adroitly, even if the end result is still pretty messy. Notably, they account for the long s very well, and properly scan a full column at a time. If the standard is perfection, the results are far from that, but I’d argue that with some documents it’s easier to read plaintext with bit of stray punctuation or the occasional ‘j’ instead of ‘i’ than to work through the original text in its image form. The bigger problems arise when doing larger data analytics…
In playing around with Google’s ngram viewer, I discovered this outlier — I doubt that ‘blog’ had much reason to show up in texts in 1909. In researching why this was the case, I found some questionable metadata (including an unauthorized biography of Taylor Swift purportedly from 1906), but also a much more common misreading of the abbreviation ‘bldg.’ as ‘blog.’ This gets to the functionality related to the purpose of the OCR. If I was reading a catalog and placing the word in context, I can easily determine the error. And if the goal is to produce a readable digital document, Google Books is making some strides. But the fidelity of its work isn’t quite fine enough to produce clean data for analytics, which leads to these oddities on the very fringes of its operation (and which call into question the accuracy of its other similar statistics).
Really, the biggest problem in the Don Quixote text comes from the repeated quotes at the side of a long block, as well as the footnotes, which contain more mangled text. But what this represents is that the conversion performs strongly in terms of picking up characters and words. The biggest issue is not character recognition but formatting. The goal of conversion seems to be to preserve as much of the specific words and characters as possible, but it also explicitly fails to capture the placement of the text — the columns, the footnotes, etc. — which constitute information and convey meaning in their own right.
The question of whether this kind of text recognition is successful depends a lot on your metrics for success. What’s the goal of the project? Being able to search a database of documents, being able to read the text, and being able to operate upon the data are all perfectly valid reasons to try and convert book images into plaintext. But each goal has its own sets of challenges and best practices. At the moment, Google Books does a little bit of everything, and suffers in some of those realms as a result. But it’s much easier to pick out flaws than to recognize that the technology here is performing a complex and rather advanced service.