Towards better EPUBs at Open Library and the Internet Archive

Screen Shot 2016-06-23 at 17.26.54

You may have read about our recent downtime. We thought it might be a good opportunity to let you know about some of the other behind the scenes things going on here. We continue to answer email, keep the FAQ updated and improve our metadata. Many of you have written about the quality of some of our EPUBs. As you may know, all of our OCR (optical character recognition) is done automatically without manual corrections and while it’s pretty good, it could be better. Specifically we had a pernicious bug where some books’ formatting led to the first page of chapters not being part of some books’ OCRed EPUB. I personally had this happen to me with a series of books I was reading on Open Library and I know it’s beyond frustrating.

To address this and other scanning quality issues, we’re changing the way EPUBs work. We’ve improved our OCR algorithm and we’re shifting from stored EPUB files to on-the-fly generation. This means that further developments and improvements in our OCR capabilities will be available immediately. This is good news and has the side benefit of radically decreasing our EPUB storage needs. It also means that we have to

  • remove all of our old EPUBs (approximately eight million items for EPUBs generated by the Archive)
  • put the new on-the-fly EPUB generation in place (now active)
  • do some testing to make sure it’s working as expected (in process)

We hope that this addresses some of the EPUB errors people have been finding. Please continue to give us feedback on how this is working for you. Coming soon: improvements to Open Library’s search features!

Posted in Code/API | Tagged , , , | Comments closed

Not just scanning – Thoreau’s Cape Cod

It makes no odds what it is you carry, so long as you carry the truth along with you. – intro to 1893 edition

There are many good responses to “Why do we still have libraries when everything is online?” My favorite one has to do with the importance of finding people to curate and sort and sift through the enormous bulk of online material to create knowledge and wisdom from what is merely just data. Small projects which do not scale. Henry David Thoreau went to Cape Cod in the mid 1800s and wrote about the experience. His writings on Cape Cod were published in 1865 and reprinted many times after that. The text can be found any number of places, but actually flipping through the books reveals a lot more about the cultural history of this book and the text it contains. Just the covers alone are lovely to look at.

cover of Cape Cod featuring windmill

Cover featuring the Eastham Windmill

 

cover of Cape Cod featuring cranberry motif

Cover featuring cranberry motif

Looking through the many copies Open Library has, there’s a lot of marginalia and other interesting things to peek at. One version appears to have been purchased for a dollar while another may have cost upwards of thirty.

image from inside of Cape Cod book

The book was frequently given to libraries as a gift. Sometimes by people you may have heard of.

bookplate of Frederick Law Olmstead Jr.

Some of these versions have beautiful and unusual illustrations and some have photographs.

a cape cod citizen

first page with flower water color

Some have illustrations nearly obliterated by low quality scanning (not ours).

low quality scan

And some have little mysteries. What does “By transfer The White House” mean? What did the War Department think of this book?

by transfer, the White House

front page of book with War Department stamp

All of these are aspects of the book–one work,many editions–that surface through close inspection, with human eyes.

The Concord MA library has scanned, assembled and anotated a set of images of Thoreau’s surveys which is another wonderfully curated set of digitized ephemera that help us understand our world..

Posted in thingsinbooks | Tagged , , , | Comments closed

25,000 emails in three years

25,000

A slightly more personal note here… it’s been a little over three years since I started working at Open Library and just this past week we hit a milestone of 25,000 emails sent. That’s slightly lower than the number of emails we get because some are just saying “Thank you!” and some we forward to other departments and yes, a few are spam. But the rest–the tech support, the early book returns, the reference questions, the merge requests–have been answered by me and Michelle and Laurel.

It’s been very gratifying to help keep Open Library’s ebook lending library open and thriving and very interesting to watch the ebook environment changing around us since we first opened in a much more limited fashion in 2005. Here’s to ten more years of free ebook lending and a continually improving ebook reader experience in the next ten years!

screenshot of the first Open Library page

Posted in News | Comments closed

February 1-5 is #ColorOurCollections Week

There are a lot of neat public domain images in our collections. We’ve highlighted them in the past and continue to encourage people to use, remix and share our content. This week for the #ColorOurCollections event, we’ve pulled out some especially colorable images and made them into PDFs that you can print out and color. We’ve created a few pairs of images we think you’ll like. Here are the images and links to the books where you can find and download even more. If you just want to download a zip file of all eight images, click here.

apollos_genii nuptial_bath

punkah  mandan

greek_costume2 greek_costume

papilio cicada

16478943838_7297d310e6_o

Posted in News | Tagged , , | Comments closed

It’s Aaron Swartz Day – Here’s an Open Library status report

Next week is the third annual Aaron Swartz Day (2013, 2014), a celebration and Hackathon which takes place at the Internet Archive on November 7th. Please consider joining us. More information about this year’s events can be found here. We have a lot of good news on our end.

Image from page 323 of "The bicycling world" (1881)

My name is Jessamyn and I’ve been working for Open Library for the past few years after being inspired by Aaron Swartz Day 2013. I work with Giovanni Damiola and Michelle Krasowski and many of the other wonderful people at the Archive to keep this valuable resource up and running.

Posted in Uncategorized | Comments closed
  • open library logo