Towards better EPUBs at Open Library and the Internet Archive

You may have read about our recent downtime. We thought it might be a good opportunity to let you know about some of the other behind the scenes things going on here. We continue to answer email, keep the FAQ updated and improve our metadata. Many of you have written about the quality of some of our EPUBs. As you may know, all of our OCR (optical character recognition) is done automatically without manual corrections and while it’s pretty good, it could be better. Specifically we had a pernicious bug where some books’ formatting led to the first page of chapters not being part of some books’ OCRed EPUB. I personally had this happen to me with a series of books I was reading on Open Library and I know it’s beyond frustrating.

To address this and other scanning quality issues, we’re changing the way EPUBs work. We’ve improved our OCR algorithm and we’re shifting from stored EPUB files to on-the-fly generation. This means that further developments and improvements in our OCR capabilities will be available immediately. This is good news and has the side benefit of radically decreasing our EPUB storage needs. It also means that we have to

  • remove all of our old EPUBs (approximately eight million items for EPUBs generated by the Archive)
  • put the new on-the-fly EPUB generation in place (now active)
  • do some testing to make sure it’s working as expected (in process)

We hope that this addresses some of the EPUB errors people have been finding. Please continue to give us feedback on how this is working for you. Coming soon: improvements to Open Library’s search features!

openlibrary.org downtime (resolved)

Apologies for the service interruption. Our new Virtual Machine Environment hiccuped this morning, and we’re shaking out the process we need to go through to restart everything smoothly. Covers should be being served normally from covers.openlibrary.org using ISBN or other supported identifiers.

I’ll update here if there’s relevant news – hopefully it won’t be down much longer.

Update, 4:00PM PST: OK. We’re stable again now. Damn race conditions in Linux kernels!

Update, 10:00PM PST: The website is down again.

Update, May 17 2:00AM PST: We’re back online now. But, covers.openlibrary.org is offline, which unfortunately means that external sites accessing Open Library covers using ISBNs is also offline. Covers on openlibrary.org are being served from an alternate location. (We need to wrangle some DNS settings to get this fixed, hopefully within a few hours.)