As we work towards a re-release of full text search on Open Library (peek), we’ve seen much more of the OCR output of our book scans. Depending on the text, the OCR can range from 99% perfect to 99% covered in gobbledygook. Hence my delight to see oldweather.org from the Zooniverse Project, where you and I can “help improve reconstructions of past weather and climate across the world by finding and recording historical weather observations in handwritten Royal Navy ship logs.”
Why computers can’t do the job from National Maritime Museum on Vimeo.
If you look at the site’s tutorial, you’ll see that there’s a transcription interface that’s specific to maritime log books. That’s exciting! Imagine tools like this for all sorts of other documents!
Old Weather – Weather and Events from The Zooniverse on Vimeo.
It’s fun to think about ways we might be able to encourage people to help correct bad OCR, in all sorts of documents. The Internet Archive has just scanned several years worth of the US Census, for example.
Here’s a page from the 1880 Los Angeles Census. Throw this at an OCR program, and it would throw it straight back. Possibly shredded.
And yet to our eyes, it’s a well-structured document, and relatively easy to read (once you’re used to the lovely old script). That’s what’s so clever about the Old Weather project: the team has isolated the structure of the log books, and designed the tool to help guide you to provide the information it needs within that structure.
We’re definitely looking towards Old Weather and the National Library of Australia‘s great Trove Newspapers site for inspiration (and collaboration?).
Always happy to find new collaborations – get in touch!
I’d really like to know what the Archive.org plans are for allowing corrections to OCR.
I’ve been working on using the OpenLibrary BookReader (fantastic tool and team, by the way) to enable transcription of some handwritten material hosted on the Internet Archive. I’d love to be able to update the OCR (which is currently garbage) with the results of that process so that the spiffy PDF and e-book convertors would pick up the transcriptions. But so far, it looks like OCR corrections are only done via re-upload of the book as a Project Gutenberg text. Are there plans to allow changes to the OCR for existing books after scans have been made?
(Incidentally, a good directory on manuscript transcription tools is Melissa Terras’ blog post “Crowdsourcing Manuscript Material”, if you’re looking for similar projects.
great progress thus far….congratulations!
This cross-written book is another example of something that defies OCR:
http://www.archive.org/stream/lettertoannewarr00west
(via TikiRobot: http://www.tikirobot.net/wp/2010/10/07/cross-writing/)
Pingback: Technology By Day » Why Computers Can't Do The Job « The Open Library Blog
Pingback: Archives Outside » Ahoy there! Oldweather.org uses Crowdsourcing to transcribe archives and support scientific and historical research
Pingback: Crowd-sourcing now and then « all things cataloged