Why Computers Can't Do The Job

As we work towards a re-release of full text search on Open Library (peek), we’ve seen much more of the OCR output of our book scans. Depending on the text, the OCR can range from 99% perfect to 99% covered in gobbledygook. Hence my delight to see oldweather.org from the Zooniverse Project, where you and I can “help improve reconstructions of past weather and climate across the world by finding and recording historical weather observations in handwritten Royal Navy ship logs.”

Why computers can’t do the job from National Maritime Museum on Vimeo.

If you look at the site’s tutorial, you’ll see that there’s a transcription interface that’s specific to maritime log books. That’s exciting! Imagine tools like this for all sorts of other documents!

Old Weather – Weather and Events from The Zooniverse on Vimeo.

It’s fun to think about ways we might be able to encourage people to help correct bad OCR, in all sorts of documents. The Internet Archive has just scanned several years worth of the US Census, for example.

Here’s a page from the 1880 Los Angeles Census. Throw this at an OCR program, and it would throw it straight back. Possibly shredded.

US Census

And yet to our eyes, it’s a well-structured document, and relatively easy to read (once you’re used to the lovely old script). That’s what’s so clever about the Old Weather project: the team has isolated the structure of the log books, and designed the tool to help guide you to provide the information it needs within that structure.

We’re definitely looking towards Old Weather and the National Library of Australia‘s great Trove Newspapers site for inspiration (and collaboration?).

This entry was posted in Uncategorized. Bookmark the permalink. Both comments and trackbacks are currently closed.

4 Comments

  1. Posted October 12, 2010 at 9:55 am | Permalink

    Always happy to find new collaborations – get in touch!

  2. Posted October 12, 2010 at 12:30 pm | Permalink

    I’d really like to know what the Archive.org plans are for allowing corrections to OCR.

    I’ve been working on using the OpenLibrary BookReader (fantastic tool and team, by the way) to enable transcription of some handwritten material hosted on the Internet Archive. I’d love to be able to update the OCR (which is currently garbage) with the results of that process so that the spiffy PDF and e-book convertors would pick up the transcriptions. But so far, it looks like OCR corrections are only done via re-upload of the book as a Project Gutenberg text. Are there plans to allow changes to the OCR for existing books after scans have been made?

    (Incidentally, a good directory on manuscript transcription tools is Melissa Terras’ blog post “Crowdsourcing Manuscript Material”, if you’re looking for similar projects.

  3. jezzabelly
    Posted October 12, 2010 at 12:49 pm | Permalink

    great progress thus far….congratulations!

  4. Posted October 12, 2010 at 1:20 pm | Permalink

    This cross-written book is another example of something that defies OCR:
    http://www.archive.org/stream/lettertoannewarr00west
    (via TikiRobot: http://www.tikirobot.net/wp/2010/10/07/cross-writing/)

3 Trackbacks

  • open library logo
  • follow us on twitter

  • Recent Posts

  • Archives