Tag Archives: Search

Towards better EPUBs at Open Library and the Internet Archive

Screen Shot 2016-06-23 at 17.26.54

You may have read about our recent downtime. We thought it might be a good opportunity to let you know about some of the other behind the scenes things going on here. We continue to answer email, keep the FAQ updated and improve our metadata. Many of you have written about the quality of some of our EPUBs. As you may know, all of our OCR (optical character recognition) is done automatically without manual corrections and while it’s pretty good, it could be better. Specifically we had a pernicious bug where some books’ formatting led to the first page of chapters not being part of some books’ OCRed EPUB. I personally had this happen to me with a series of books I was reading on Open Library and I know it’s beyond frustrating.

To address this and other scanning quality issues, we’re changing the way EPUBs work. We’ve improved our OCR algorithm and we’re shifting from stored EPUB files to on-the-fly generation. This means that further developments and improvements in our OCR capabilities will be available immediately. This is good news and has the side benefit of radically decreasing our EPUB storage needs. It also means that we have to

  • remove all of our old EPUBs (approximately eight million items for EPUBs generated by the Archive)
  • put the new on-the-fly EPUB generation in place (now active)
  • do some testing to make sure it’s working as expected (in process)

We hope that this addresses some of the EPUB errors people have been finding. Please continue to give us feedback on how this is working for you. Coming soon: improvements to Open Library’s search features!

New Bits!

A few hours ago we released a couple of new bits and pieces we thought it was worth mentioning.

First, we’ve re-arranged the way search results display so our search facets are more obvious, there’s a new cover view, and the pagination is tidier.

You’ve always been able to see facets on the search page, but we were trying to find a way to make them more exploratory and interactive – hopefully, this redesign is a start. So, you can click on a facet to narrow your search, then another, and another. It starts to get interesting when you remove previously selected facets from the search, and begin to move sideways through the catalogue. (The team has wasted some hours playing with this!)

As I was bouncing around, I found a few gems, including 6 digitized books about the Masai, written between 1857 and 1905, including the fascinating Vocabulary of the Enguduk Iloigob and Through Masai land: a journey of exploration among the snowclad volcanic mountains and strange tribes of eastern equatorial Africa.

There’s also Cookery recipes by St. Mary’s Guild, Mill Valley, California – just around the corner from us here in San Francisco – published in 1902 and available to read online. Pickles, Marmalades, Jellies, Preserves is “swooning in sweetness” on Page 71, and the scan is full of hand-written notes, as any good cookbook should be!

And, as NASA celebrates the 40th Anniversary of the Apollo mission, here’s a bit of Mars-related science fiction to whet your appetite. If you like space stuff, you’ll love the collection of fantastic 16mm videos shot on board Apollo, hosted over at nasaimages.org, another project of the Internet Archive.

The other cool thing that we released is integration with the new, improved book reader available on archive.org. Improvements include a one-page view, access to the full resolution of the original scan (in that one page view), and the ability to link into a specific page in a scanned book, just by grabbing the URL in the navigation bar whenever you’re looking at a certain page, like I did above to link to Page 71 of the cookery book. (The URL updates on the fly as you turn the pages – super cool!) There’s more information over at the Open Content Alliance blog.

We’d love to hear what you think of the new search results page, so please leave us a comment!