As we work towards a re-release of full text search on Open Library (peek), we’ve seen much more of the OCR output of our book scans. Depending on the text, the OCR can range from 99% perfect to 99% covered in gobbledygook. Hence my delight to see oldweather.org from the Zooniverse Project, where you and I can “help improve reconstructions of past weather and climate across the world by finding and recording historical weather observations in handwritten Royal Navy ship logs.”
If you look at the site’s tutorial, you’ll see that there’s a transcription interface that’s specific to maritime log books. That’s exciting! Imagine tools like this for all sorts of other documents!
It’s fun to think about ways we might be able to encourage people to help correct bad OCR, in all sorts of documents. The Internet Archive has just scanned several years worth of the US Census, for example.
Here’s a page from the 1880 Los Angeles Census. Throw this at an OCR program, and it would throw it straight back. Possibly shredded.
And yet to our eyes, it’s a well-structured document, and relatively easy to read (once you’re used to the lovely old script). That’s what’s so clever about the Old Weather project: the team has isolated the structure of the log books, and designed the tool to help guide you to provide the information it needs within that structure.