Towards better EPUBs at Open Library and the Internet Archive

Screen Shot 2016-06-23 at 17.26.54

You may have read about our recent downtime. We thought it might be a good opportunity to let you know about some of the other behind the scenes things going on here. We continue to answer email, keep the FAQ updated and improve our metadata. Many of you have written about the quality of some of our EPUBs. As you may know, all of our OCR (optical character recognition) is done automatically without manual corrections and while it’s pretty good, it could be better. Specifically we had a pernicious bug where some books’ formatting led to the first page of chapters not being part of some books’ OCRed EPUB. I personally had this happen to me with a series of books I was reading on Open Library and I know it’s beyond frustrating.

To address this and other scanning quality issues, we’re changing the way EPUBs work. We’ve improved our OCR algorithm and we’re shifting from stored EPUB files to on-the-fly generation. This means that further developments and improvements in our OCR capabilities will be available immediately. This is good news and has the side benefit of radically decreasing our EPUB storage needs. It also means that we have to

  • remove all of our old EPUBs (approximately eight million items for EPUBs generated by the Archive)
  • put the new on-the-fly EPUB generation in place (now active)
  • do some testing to make sure it’s working as expected (in process)

We hope that this addresses some of the EPUB errors people have been finding. Please continue to give us feedback on how this is working for you. Coming soon: improvements to Open Library’s search features!

This entry was posted in Code/API and tagged , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

2 Comments

  1. R VENKATARAMAN
    Posted August 13, 2016 at 8:14 am | Permalink

    At present open library books can be sent to kindly only for users in USA. This should be extended to any Kindle user anywhere. In the days of internet what is the problem

  2. Elizabeth Rhodes
    Posted August 16, 2016 at 9:46 pm | Permalink

    I recently discovered Open Library, and after reading this blog post I feel I should let you know that every one of the books I’ve checked out so far has had the first page of every chapter missing. This has been present in both epub and pdf files. Any estimate on when repairs to this glitch will be completed? Anything readers can do to assist?

  • open library logo