The Little Bot That Could

By George Oates

homebuying is lots and lots of paperwork

Meet oclcBot. He was written by Bruce Washburn at OCLC Research to help connect Open Library records to Worldcat.org. He’s just finished updating almost 4 million Open Library editions with links! No metadata exchange at all, except these identifiers. Tiny, but powerful, because that lets systems that “speak OCLC” communicate directly with Open Library without knowing any Open Library IDs. As Anand mentioned in his recent post about Coverstore Improvements, we’ve also made the system for displaying covers externally using other types of identifiers more efficient.

There was a bit of a bumpy start to oclcBot’s updates, and Bruce and I thought it might be good to hear what it was like in the trenches. From Bruce:

This project was essentially very simple: find corresponding Open Library and OCLC WorldCat records by a shared attribute (ISBN), and update the Open Library record with the corresponding OCLC number. Once OCLC had generated a list of OCLC numbers and their corresponding ISBNs, it seemed to be a simple matter of using the very robust Open Library API to look for matching records, check to see if they already included an OCLC number, and update the record accordingly. Complications arose, related to scale. There were about 90 million ISBNs to check from the OCLC list, and checking them one at a time via the API was projected to take a very long time. So we used a data dump of all the Open Library records to identify those with ISBNs, and also built a very fast index of the OCLC list to check against. With that we were able to produce a new list of Open Library records and corresponding new OCLC numbers. And a batch update facility in the Open Library API made it possible to send API requests 1,000 records at a time. The pre-processing and the batch process both yielded some additional lists that will require more scrutiny to process (records associated with multiple ISBNs, API exceptions for individual records), but the great majority of records were updated via the oclcBot without any further effort.

So, it’s still early days with our Bot operations, but we’re looking for external developers who might be interested to try to do these “surgical strike” style updates to loads of Open Library records at once. If you’re curious, please visit our Writing Open Library Bots in the Open Library Developers area.

Thank you, Bruce!

(And thanks to Solo for the CC BY-NC-SA 2.0 oclcBot photo.)

Mike Matas: A next-generation digital book (TED)

By George Oates

Coverstore Improvements

By Anand Chitipothu

We have done some improvements to coverstore, the Open Library book covers service, recently.

Now it is possible to access book covers by all the available identifiers. For example:

http://covers.openlibrary.org/b/goodreads/6383507-M.jpg
http://covers.openlibrary.org/b/librarything/8071257-M.jpg

Accessing covers by ISBNs is insensitive to hyphens now. For example, all the following URLs point to the same cover and this works even if the ISBN is specified with hyphens in the edition record.

http://covers.openlibrary.org/b/isbn/1-59286-793-6-M.jpg
http://covers.openlibrary.org/b/isbn/1592867936-M.jpg

Please refer to the Open Library Covers API for more details.

We have built a secondary database for storing edition identifiers to cover ID mapping to make the accessing covers faster. Because of this there is some delay between adding an identifier to an edition record and accessing the cover using the newly added identifier. The delay is usually couple of seconds.

Please note that this API is intended for displaying covers on public facing websites and not for bulk download. To download the book covers in bulk, please refer to Bulk Access section of the API documentation.

We have recently noticed that some bots are downloading book covers by ISBNs at very high rate and that effected the performance of the system badly. We have added rate-limiting to limit the number of requests per IP address. The current allowed limit is 100 requests per IP for every 5 minutes. The limit is applicable only for cover accesses by various identifiers and there is no limit of accessing covers by cover ID.

This limit should be good enough for linking covers on public facing websites. Please consider using Open Library Books API if your website demands more. Since the Books API provides book cover URLs using cover IDs, the rate-limit won’t be applicable.

Please get in touch with us if you need any assistance in using this API to show book covers on your website.

pystatsd & 5,000 Lists!

By George Oates

We’re working hard to improve Open Library’s general stability and performance, after a few harrowing weeks moving our hardware infrastructure around. We’re beginning to measure more stuff across the site, from general activity levels (about 40,000 catalog edits every month!) to quite specific actions (like, seeing that every second, 1-3 people open up our BookReader).

We’ve begun using a super awesome, real-time stats processing package called pystatsd, a Python implementation of Etsy’s statsd server. My favourite bit is a program that sits on top of that called graphite which takes all the stats we collect with pystatsd and renders them as graphs in a browser. Suddenly, we can see the system in a new and useful way.

We’re also looking hard at improving our memcached configuration, recently introducing another 4 memcached machines into our pool. Now that we can measure memcached hits and misses using pystatsd and graphite, we’ll be able to tell when our caching stuff is actually improving. Yay!

Memcached hits & misses

Another tweak you might find interesting… it used to be that lists would only show up on the main Lists page if they contained at least 3 seeds. The other day, Raj and I upped that to at least 5 seeds, and that immediately produced a selection of arguably more interesting lists, most of which settle around a subject area. Here’s a small selection:

Have you made a great list, or found someone else’s? Let us know in the comments!

Alice: The On-Line Catalog

By George Oates

Ohio University's Alden Library Alice Catalog, 1983

So awesome. That’s 1983 in Ohio, folks.

New Titles in Lending Library!

By George Oates

Our little lending library is continuing to grow, this time with 90 new titles purchased directly from two fabulous eBook publishers: A Book Apart & Smashwords.

3 titles from A Book Apart are all must-reads for any discerning web professional…

Thanks to Mandy, Jeffrey and Jason at A Book Apart for joining in the fun. (Incidentally, Mandy’s blog, A Working Library, is a great read.)

There are also 87 ePubs from Smashwords, by authors like Amanda Hocking, Ruth Ann Nordin and Gerald M. Weinberg

Thanks to @markcoker at Smashwords for working with us to get these new titles online.

Loans through the Open Library are exclusive to one Open Library account holder at once, for up to two weeks. For most titles, you can access the eBooks in one of three ways: directly in your web browser (using our BookReader), as a PDF or ePub (downloaded into Adobe Digital Editions). The new Smashwords titles are a bit different – they’re only available in ePub format, so only downloadable and readable in Adobe Digital Editions.

The Internet Archive (and Open Library) is actively seeking publishers who’d like us to buy their eBooks and make them available in the Lending Library. If you are a publisher interested to sell us your wares, please get in touch!

While that Lending Library — available to anyone with an Open Library account — is growing, we’re also working to expand the collection for our “In-Library” loans, currently at about 85,000 eBooks. This special In-Library program is a bit different, because it requires patrons to literally be inside a participating library’s network. Once that’s the case, patrons can see all the books available in the In-Library collection on Open Library, from all the libraries in the In-Library pool, currently around 150 North American libraries.

Public Library: An American Commons

By George Oates





Public Library: An American Commons
is a photography exhibition on at the San Francisco Public Library’s Jewett Gallery, running from April 9 to June 12. The photographer, Robert Dawson, has captured the American relationship with public libraries across the country in a series of intimate portraits. From the Design Observer review:

What’s at stake here is more than access to a room full of books. The modern American public library is reading room, book lender, video rental outlet, internet café, town hall, concert venue, youth activity center, research archive, history museum, art gallery, homeless day shelter, office suite, coffeeshop, seniors’ clubhouse and romantic hideaway rolled into one.

Minimum Viable Record?

By George Oates

Having worked more closely with bibliographic data than I had ever expected to over the last couple of years, I still can’t quite believe how complicated it can be. I keep holding tight something Karen Coyle told me when I first started at Open Library, that “library metadata is diabolically rational.” Now that I’ve witnessed the cataloging from lots of different sources and am more familiar with the level of detail that’s possible in a library catalog, I have a new fondness for these intensely variegated information systems; at times devilishly detailed, at others wildly incomplete or arcanely abbreviated. Everyone likes to arrange things and classify them into groups. It’s when you try to get people to put things into groups that someone else has come up with that it starts getting messy.

At Open Library, we’re attempting to ingest catalog data from, well, everywhere. Every “dialect” of cataloging practice makes this mass consumption harder. In spite of the rational goal of standardized data entry, there is an intense diffusion of practice. (Have a look at Seeing Standards: A Visualization of the Metadata Universe by Jenn Riley and Devin Becker if you haven’t already.)

A challenge I think we face today is a metastasized level of complexity, particularly as we attempt to begin to catalog works that have no physical form, but only exist electronically. Any challenge presents opportunity, and the opportunity here is to radically simplify the way things are represented in catalogs.

In February, I gave a presentation at the recent API Workshop held at the Maryland Institute of Technology and the Humanities (MITH). I talked about Open Library and paid particular attention to the resources we’re trying to put in place for developers to hook into the system.

Part of the presentation was an impromptu survey of the audience, where I passed around an index card for everyone, and asked people to write down the 5 fields they thought were adequate to describe a book. I framed the survey as a search for a “minimum viable record,” and it was fascinating to watch the audience squirm a bit as they asked for more guidance on the challenge. Can fields repeat? What’s the audience for this description? etc.

I’ve collated the results of the forty or so respondents into an ugly spreadsheet. There are 4 sheets, linked in the green strip at the bottom of the page:

  1. Book Raw – unfiltered results, in the order they were written
  2. Book Cooked V1 – all results blended, sorted alphabetically
  3. Book Merged – all results grouped
  4. Summary – with counts and a graph!

Here’s the final result:

So, on the shoulders of “minimum viable product“, a way for web application developers to get working code deployed quickly and effectively, I wonder if it’s time for a “minimum viable record” in place for bibliographic systems. Enough detail for a computer to match, correlate and compare, but not so much that having to process each record stops everything in its tracks.

You might have heard of the Open Publication Distribution System (OPDS) Catalog specification, which is a syndication format for electronic publications. Certainly, this new standard is a great step towards simpler representations of books — in this case, OPDS was initially designed to represent eBooks specifically — but I find myself wondering if it could be reduced further still, to pave the way for even easier exchange between systems. (Please note that all our edition records are now available in OPDS format, as well as RDF and JSON.)

Something like Title, Author, Date, Subject[s] and Identifier[s] might just do the trick, though it is of course irresistibly debatable. It’s an idea we’re going to look to as we work on our new Write API for Open Library. This minimum viable record will play gatekeeper for any new records we ingest (or that you export).

What do you think of this minimum viable blog post?

Book as Art Object

By George Oates

The Making of Tree of Codes, written by Jonathan Safran Foer, constructed by Die Keure, a printing house in Belgium.

Plus a fabulous write up from the publisher, Visual Editions: “The book is as much a sculptural object as it is a work of masterful storytelling: here is an “enormous last day of life” that looks like it feels.” [via foe]

In these reaction snippets, I love that a chap mentions, “OK, I’m getting the hang of it now.”

Open Library Architecture Diagram

By raj

Here is a diagram of the current Open Library architecture:

click for full-size image