Public Library: An American Commons

By George Oates





Public Library: An American Commons
is a photography exhibition on at the San Francisco Public Library’s Jewett Gallery, running from April 9 to June 12. The photographer, Robert Dawson, has captured the American relationship with public libraries across the country in a series of intimate portraits. From the Design Observer review:

What’s at stake here is more than access to a room full of books. The modern American public library is reading room, book lender, video rental outlet, internet café, town hall, concert venue, youth activity center, research archive, history museum, art gallery, homeless day shelter, office suite, coffeeshop, seniors’ clubhouse and romantic hideaway rolled into one.

Minimum Viable Record?

By George Oates

Having worked more closely with bibliographic data than I had ever expected to over the last couple of years, I still can’t quite believe how complicated it can be. I keep holding tight something Karen Coyle told me when I first started at Open Library, that “library metadata is diabolically rational.” Now that I’ve witnessed the cataloging from lots of different sources and am more familiar with the level of detail that’s possible in a library catalog, I have a new fondness for these intensely variegated information systems; at times devilishly detailed, at others wildly incomplete or arcanely abbreviated. Everyone likes to arrange things and classify them into groups. It’s when you try to get people to put things into groups that someone else has come up with that it starts getting messy.

At Open Library, we’re attempting to ingest catalog data from, well, everywhere. Every “dialect” of cataloging practice makes this mass consumption harder. In spite of the rational goal of standardized data entry, there is an intense diffusion of practice. (Have a look at Seeing Standards: A Visualization of the Metadata Universe by Jenn Riley and Devin Becker if you haven’t already.)

A challenge I think we face today is a metastasized level of complexity, particularly as we attempt to begin to catalog works that have no physical form, but only exist electronically. Any challenge presents opportunity, and the opportunity here is to radically simplify the way things are represented in catalogs.

In February, I gave a presentation at the recent API Workshop held at the Maryland Institute of Technology and the Humanities (MITH). I talked about Open Library and paid particular attention to the resources we’re trying to put in place for developers to hook into the system.

Part of the presentation was an impromptu survey of the audience, where I passed around an index card for everyone, and asked people to write down the 5 fields they thought were adequate to describe a book. I framed the survey as a search for a “minimum viable record,” and it was fascinating to watch the audience squirm a bit as they asked for more guidance on the challenge. Can fields repeat? What’s the audience for this description? etc.

I’ve collated the results of the forty or so respondents into an ugly spreadsheet. There are 4 sheets, linked in the green strip at the bottom of the page:

  1. Book Raw – unfiltered results, in the order they were written
  2. Book Cooked V1 – all results blended, sorted alphabetically
  3. Book Merged – all results grouped
  4. Summary – with counts and a graph!

Here’s the final result:

So, on the shoulders of “minimum viable product“, a way for web application developers to get working code deployed quickly and effectively, I wonder if it’s time for a “minimum viable record” in place for bibliographic systems. Enough detail for a computer to match, correlate and compare, but not so much that having to process each record stops everything in its tracks.

You might have heard of the Open Publication Distribution System (OPDS) Catalog specification, which is a syndication format for electronic publications. Certainly, this new standard is a great step towards simpler representations of books — in this case, OPDS was initially designed to represent eBooks specifically — but I find myself wondering if it could be reduced further still, to pave the way for even easier exchange between systems. (Please note that all our edition records are now available in OPDS format, as well as RDF and JSON.)

Something like Title, Author, Date, Subject[s] and Identifier[s] might just do the trick, though it is of course irresistibly debatable. It’s an idea we’re going to look to as we work on our new Write API for Open Library. This minimum viable record will play gatekeeper for any new records we ingest (or that you export).

What do you think of this minimum viable blog post?

Book as Art Object

By George Oates

The Making of Tree of Codes, written by Jonathan Safran Foer, constructed by Die Keure, a printing house in Belgium.

Plus a fabulous write up from the publisher, Visual Editions: “The book is as much a sculptural object as it is a work of masterful storytelling: here is an “enormous last day of life” that looks like it feels.” [via foe]

In these reaction snippets, I love that a chap mentions, “OK, I’m getting the hang of it now.”

Open Library Architecture Diagram

By raj

Here is a diagram of the current Open Library architecture:

click for full-size image

A Library Primer

By George Oates

Just discovered this wonderful book of 60 short chapters on how to start a library: A Library Primer by John Cotton Dana, Fourth Edition, published 1906 by Library Bureau.

To the librarian himself one may say: Be punctual; be attentive; help develop enthusiasm in your assistants; be neat and consistent in your manner. Be careful in your contracts; be square with your board; be concise and technical; be accurate; be courageous and self-reliant; be careful about acknowledgments; be not worshipful of your work; be careful of your health. Last of all, be yourself.

And, it’s fantastic that our Read To Me feature in the BookReader can understand the librarian’s neat hand on the page of examples in the Ink and Handwriting chapter.

Specimen Alphabets and Figures

Scheduled Downtime: (Again) 9:30AM PST, 2011-03-10

By George Oates

Original post, 2011-03-07: The time has come for Open Library to migrate fully to the Internet Archive’s new virtual machine architecture. We expect the site to be down for about 2 hours as we move data and update various config files. Please bear with us… there are lots of balls in the air that we need to catch!

Also, we’ll post updates here if the plan changes.

Update, 11:30pm PST, 2011-03-07: Ok! The site’s back online, on brand new hardware. Everything looks about right, and we’re warming various caches and testing performance on various elements. Fingers crossed everything will warm up nicely over the next few hours. Yay!

Update, 9:30am PST, 2011-03-08: Just a little note to let you know that we’re still working on the migration. Our coverstore is struggling, and we’re tweaking our Gunicorn & lighttpd config in the new system. Apologies for the service interruption – you might see covers loading slowly, amongst other things we’re still discovering. More updates as they come to hand…

Update, 8:45am PST, 2011-03-10: Apologies for the short notice, but we’ll be bringing the site offline around 9:30am PST this morning, since we need to downgrade our lighttpd install from 1.4.26 to 1.4.19. The theory is that the newer version is still a bit unstable, and that’s part of the reason the site has been a little “bouncy” since Monday.

Heads Up! Data Center Migration in progress

By George Oates

You might notice a few hiccups, timeouts or slow-loading pages as you wander around Open Library over the next few days. The whole Internet Archive is migrating to a new virtual machine data center, which is no mean feat.

From Open Library’s point of view, that means moving data and services to the new virtual machine configuration, and making sure that everything’s running smoothly. We’re hoping this move will result in faster performance, and flexibility for increasing hardware and improving tools into the future.

Your patience is appreciated. See you on the other side!

UPDATE, 3:25pm PST: Our cover service is spluttering at the moment. That’s affecting the whole site’s performance. We’re looking into it. Apologies for the service interruption.

UPDATE, 5:40pm PST: OK. We’re pretty sure we’ve fixed the covers trouble. Yay! Also, we’re considering taking the site offline on Sunday evening (PST) to do the heavy lifting associated with the migration to the new virtual machines. We’ll let you know as far in advance as we can exactly when and for how long.

Get Thee to a Library!

By George Oates

For our first big release of 2011, we’d like to introduce you to a couple of new bits and pieces on Open Library:

  1. A new home page design
    New Homepage When we launched the site redesign almost a year ago, the home page was trying to make it clear that it was possible to edit the Open Library site, and that we welcome your contributions. You might remember the cheeky "Ever wanted to play librarian?" phrase. Now that the new design has settled somewhat, and we have a great level of activity across the site, we wanted to shift the focus again, to make it clearer that you can actually get to books as well. Not only over 1 million free eBooks, but also our small, but growing Lending Library.

    So, the new home page displays 3 new "carousels" that display an assortment of free eBooks to read, a small curated selection of titles from the Lending Library, and Version 1 of a new "Return Cart" feature, that shows you eBooks that have, well, been recently returned.

    Stats!

    We’ve also added some activity graphs at the bottom of the page, which tell you that in the last 28 days (at time of writing), we’ve had:

    • 5,794,587 unique visitors,
    • 14,219 new members sign up,
    • 39,939 edits to the catalog, 
    • 990 new lists created, and
    • 3,340 eBooks borrowed.

    Wow!

  2. The "In Library" lending program
    In one small step for library kind and readers around the world, today we’re announcing a new collection of "In Library" eBooks available for loan. Here’s the idea: there’s a group of libraries participating in the pilot program, each of which has added eBooks to the new pool.


    See a map displaying the participating libraries – Yay OpenStreetMap!

    The interesting part is that you, dear patron, need to get your bones into the actual libraries themselves to borrow any of the titles from any of the libraries in the pool. Once you’ve done that, the loan acts just like the "normal" Lending Library loans that are available to any Open Library account holder around the world, 5 books at a time, for up to 2 weeks. Cool, huh?

Read the Internet Archive announcement about the In Library program, or if you happen to be from a library that’s interested in joining in the fun, please get in touch with us.

As an aside, I’m attending the Maryland Institute for Technology in the Humanities API Workshop this weekend to talk about the Open Library API and how people are using it, so if you happen to be there, please come and say hello!

Loading the text of 2 million books into solr

By Edward Betts

Here at the Internet Archive we’ve scanned millions of books. One of our challenges is helping people find books they want to read. The bibliographic records can be searched in Open Library. If somebody knows the title they can find the book. It would be useful if all the text inside the book was also searchable. This is a problem I’ve been working on.

I started by taking the output in XML from our OCR and tidying up the results. The OCR program only considers individual pages, so paragraphs that span a page were split up. A phrase search that crosses the page boundary wouldn’t match. My code detects paragraphs broken across pages or columns and recombines them. It also deals intelligently with hyphenated words by joining them back together.

The process of fixing up the OCR involves parsing XML, it is slower than just reading a text file. Fortunately the books are stored on a few hundred computers in our petabox storage cluster. I’m able to run the XML parsing and OCR tidy stage on the computers in the cluster in parallel.

Next I take this data and feed it into solr. My solr schema is pretty simple. The fields are just the Internet Archive identifier and the text of book in a field called body. When we show a search inside results page we can pull the book title from the Open Library database. It might be quicker to generate result pages if all the data need to display them were in the search index, but that would mean updating the index whenever one of these piece of information changes. It is much simpler to only update the search index if the output from OCR changes.

The books we scan are in many different languages. It would be nice to provide stemming for all of these languages, but at first I’m loading them into a single field with minimal English stemming rules. The body field where the text of the book goes is compressed, this keeps the index smaller. The field has termVectors=”true” set, this improves the speed of highlighting, which is important to us.

In the solr config I had to increase the value of maxFieldLength. By default only the first 100,000 words in a field are indexed. Many of our books are longer than this.

Hathi Trust are doing some very similar work, they spotted that solr was using a lot of memory, they were able to fix it by increasing the value of termInfosIndexDivisor. This was a helpful tip. I set termInfosIndexDivisor to 4 and our memory usage dropped from 13GB to 7GB.

Unlike the Hathi Trust we’re not using solr sharding, we’ve loaded just over 2 million books into a single 3TB index.

Most searches take about a second, but some complex searches can take 10 seconds or longer. I’m working on speeding these up.

In a later blog post I’ll write about how we use the results we get from solr.

Happy New Year! And, lists…

By George Oates

Wishing you and yours a very happy 2011, and hoping you enjoyed your holidays!

I wanted to take a moment to show you some of the fantastic Lists we’ve noticed over the last few weeks since the new feature launched.

As with most of the other types of things we have on Open Library (Works, Editions etc), you can watch the recent activity on Lists to keep an eye out for new developments, and we’re just putting the finishing touches on new API documentation for Lists to accompany the release. Actually, this was one of the first times we’ve drafted the API docs before building the feature, which will hopefully be the shape of things to come.

It’s been really exciting to watch how you’re making use of Lists. There’s a ton of variety and interesting new “collections” emerging, like the ones I’ve pulled out above. While we’re really happy with Version 1, there are lots of directions we could go with Lists, so if you have ideas, please leave a comment here or drop us a line.

Also, we’ve noticed a bit of trouble with the search in our new Bookreader over the holiday period. For those of you who’ve written in, thank you, and we wanted to let you know we’re looking into it and hope to resolve it quickly.