An update on Open Library

By George Oates

It’s been some months since we’ve updated you about what the Open Library is up to. Sorry about that. Thought it might be nice to produce a novella/brain dump to let you know where we’re at.

The short answer is: all sorts of things! I’ve been leading the project now for about 6 months, and have finally settled down enough to tell you what we’re up to. We’d love to hear what you think of our ideas perhaps in the comments of this post, or on our general discussion mailing list.

The Open Library project began in February of 2007, and launched in November that year, so it’s approaching 3 years old. During that time, we’ve amassed one of the biggest virtual library catalogs online, at some 23 million edition entries and some 6 million or so author records. We also have a ton of book covers. Our catalog is entirely open and free to use. You can download everything if you wish, or use our API to either link to our records, or to display Open Library data on your website.

When I started, we didn’t have much insight into what was happening on Open Library. We knew people were using it, but didn’t really know how much, or who or what was happening. The majority of edits to the catalog are made by our bots, running updates across the system, and creating new stub records. While this work is essential, I couldn’t see any of the humans using the catalog. So, we started trying to get some insight into the site usage using tools like Alexa amongst others.

Here’s what we’ve uncovered so far, with more to come:

  • We have an average of about 400 concurrent visitors at any time, peaking at up to 900 people
  • We’ve increased daily unique IPs from about 100,000 back in June to somewhere around 250,000 today
  • Our site uptime has steadied to a very healthy 100% on a good week
  • Our bounce rate is high. Too high. It’s a concern for us that people lob into Open Library thanks to our high search engine ranking, but bounce straight out again
  • There are over 3,000 sites that link directly into Open Library. Wonderful! We’re working on understanding what those links are, and from where.
  • Our membership is fairly small, but growing every day. You can edit and use Open Library without creating an account, which probably accounts for the modest membership.

I think the story that numbers like those tell is that we have an excellent foundation for growth. This is precisely what we’re banking on as we announce that we’ll be releasing a redesign of Open Library in the next few months. We’ll stay in touch about the actual dates, and it’s very likely we’ll do a soft release before we make the final transition live. Please, watch the blog for updates on timing.

There are a number of enhancements to Open Library that we’re planning to make in the upcoming redesign (or, “realignment” as Cameron Moll has written about). That’s not to say we won’t be taking the opportunity to update the site’s look and overall usability, but, the core of the release will be about the catalog and how you see it.

Having researched a lot of the historical documentation surrounding the project , I saw tons of ideas that sounded great and which it’s time to create, like the ability to tag records or authors, provide tools to upload small collections from special interest or rural libraries and to push what bibliographic data means on the web. We’re looking forward to beginning to make some of these ideas reality into 2010.

Key Components of the Redesign

Works
Open Library deals with books at the edition level. This makes finding “War and Peace” really tricky, because all we currently display are the hundreds of editions in a big unordered list. Tricky to find what you’re searching for… Luckily, the cataloging standards initiative called Functional Requirements for Bibliographic Records (FRBR), describes a “super-level” of book called “the Work” which describes the abstract idea of a book and not its constituent editions, probably making it easier to get started with research and the like.

We’ve been toiling for the past several months to roll up all our editions into logical Works. This is incredibly tricky for all sorts of reasons and as much as we would like it to be bulletproof perfect on the first go, it’s likely people will see one edition that should be in certain Work, or Work records that are really same book. Providing tools for fixing dupes like that is next on the list. That said, we’ve been testing our brand new Work search lately, and it’s given me (at least) an entirely different and exciting iew on the Open Library. We can suddenly see things like the books in our catalog with the most editions, or all the Works by Mark Twain (instead of a massive list of all the editions he’s supposed to have written) and more. Truly, it’s invigorating after being stuck in the edition “mud” for so long. Not that edition data is bad, of course, just that the aggregate is extremely useful.

Subjects
As a non-librarian, I have been both shocked and awed by the degree of classification that’s possible using library practices. Catalogers have worked hard to put books into very specific descriptive boxes and hierarchies. Being a fan of messy data and classification, I have stumbled upon lots of classifications for books whose “order” seems quite nonsensical.

For example, many of the science fiction books listed on Open Library have several very similar, convoluted subject classifications, separated by all manner of different characters. To the human eye, it seems like duplication of effort. One book might have the following subjects assigned to it:

Science Fiction – General, Fiction / Science Fiction / General, Fiction, Fiction – Science Fiction, Science Fiction

We could just show a list of concepts, like:

Science Fiction, General and Fiction

…instead. and turn each of those terms into links, which take you through to a page that can show all books with the same subjects.

Similarly…

Probability & statistics, Probabilities, Mathematics, Science/Mathematics, Probability & Statistics – General, Mathematics / Statistics

… could be consolidated into Probability, Statistics, Probabilities, Mathematics, Science and that pesky “General” subject. People are good at reading collections of words in a list and understanding the concepts of the list, we think. It’s almost more difficult to parse the variants as you see above, with all their repeats and the use of characters to indicate some sort of hierarchy.

So, we’re going to try that (but not delete the LCSHs, of course).

Links, links, links
The key interface into the current catalog is a search box: essential if you know what you’re looking for, but useless for browsing. We’re going to introduce new navigation elements into the site that will help people dive into the catalog and bounce around. Certainly, we’ll still have search (much improved, upgraded to SOLR 1.4), but, as we think about that high bounce rate, we want to help people hop around the catalog instead of coming and going so quickly. To borrow a phrase from Tom Coates, we are constructing a new view to the catalog to represent is as a web of data instead of discrete records. The more connections we can create between records, the richer that browsing experience can be.

From a linked data perspective, we also want to introduce the ability for people to connect our records with many more systems online. Right now, you can assign up to 6 identifiers with Open Library edition records: ISBN (10 & 13), Library of Congress Control Number (LCCN), Library of Congress Classification System (LC), Internet Archive and OCLC. These IDs are certainly valuable, and in deep circulation in library catalogs around the world, buuuut… there are loads of other bookish sites out there on the web that also have wonderful, rich information about books that we’d like to connect to. Examples include Goodreads, LibraryThing, Zotero amongst others, really any resources that people think are useful! The idea is to stop worrying about a canonical identifier and simply to try accumulate as many identifiers as we can. This idea will take a while to bear fruit, but it works on the premise that we have a new opportunity in cataloging now: to place books in a network instead of on a shelf.

Similarly, we would like to collect links to other sites that are relevant to a certain book or author. Did you know Alain de Botton has a Twitter account? Sites like The Guardian, Flashlight Worthy or the New York Review of Books have incredibly rich information about books and authors that would be wonderful to connect with from the Open Library catalog.

We’re also excited about the role Open Library can play in the new Book Server initiative that was launched by the Internet Archive in October this year:

The BookServer is a growing open architecture for vending and lending digital books over the Internet. Built on open catalog and open book formats, the BookServer model allows a wide network of publishers, booksellers, libraries, and even authors to make their catalogs of books available directly to readers through their laptops, phones, netbooks, or dedicated reading devices.

The basic idea is that publishers can publish a list of any/all epubs in their catalog to be aggregated by other services online. Open Library could be one of those aggregators. We hope to show a real time representation of whatever we can aggregate, so when you look for individual books, you see a live list of where you can get your hands on the document, whether for purchase or download. After all, isn’t the job of a library to get people to books?

Librarianship as the Foundation of Open Library
Open Library’s mission has always been to build a page on the web for every book ever published. We have only been able to start achieving that mission on the shoulders of the work of librarians. While it’s possible (and encouraged) for people to add new records for books we don’t know about yet, the vast majority of our records come directly from library catalogs.

The opportunity we have now is to help interested contributors to enrich these records. Having people who love a particular book, or who have some knowledge in a particular subject area, or who enjoy correcting typos, or who like to make sure all the boxes are filled in, or have a photo of a book they’ve read, or who found a great review of a book on another site can all contribute information to the Open Library. As Tim Spalding, founder of Library Thing, noted in his Social Cataloging talk presented in New Zealand this October, nobody’s quite sure where this “social cataloging” might go, or when it might become useful to librarians in a cataloging sense. What we do know is that there’s a lot of knowledge out there on the web about books, and we want to make a Open Library a place where people can contribute any amount, no matter how small, to make the catalog more useful.

The Open Library is an amazing resource, and now it’s time to take it to the next level. Yeah!

By the way, we’re looking for at least one senior web developer to join the team too, so if you’d like to join a small team doing interesting things with library catalogs, APIs and SOLR on an extensible wiki-editable platform built in Python, and you live close to San Francisco or would move here, please drop a line to info@.

Note: I did a small copy edit Dec 4. Pretty sure I didn’t remove anything of substance.

20 Responses to “An update on Open Library”

  1. I work for a library consortium as Systems Administrator for the LAN, Web Server; email Server and Digital libray server and collections. I have been putting a lot of materials up on the Internet Archive and am not pleased with the pace of collection page development. I am willing and could contribute a lot to the development of Collections pages and have some ideas for using the Open Library as a Catalog for small to medium sized libraries in my area who do not have the funds to pay for OPACS. Union catalogs as subsets of the Open Library would be very useful for encouraging the participation of small libraries. I would be willing to volunteer my time to help in the development process, but it does not seem like Internet Archive or Open Library are willing to give administrative Access to no-employees. This is really a sub-optimum way to drive the developmnt of an organization. I am near retirement and would be willing to work full-time on such a project, but it seems that there is no interest in having others work in the community in other than a paid capacity.

    Gerry


    Gerard Arthus
    Systems Administrator:
    Long Island library Resources Council
    627 N. Sunrise Service Road
    Medford, New York
    USA 11713
    Office: 631-675-1563
    Fax: 631-675-1573
    Home: 631-289-7565
    Cell: 631-335-5250

    Professor:
    Graduate Computer Science/Engineering Management Program/Environmental Science
    Long Island University/CW Post Campus
    720 Northern Boulevard
    Brookville, New York
    11548-1300

  2. Martin says:

    Wow … so much good stuff in this post, hard to make just one comment and keep it from turning into a piece as long as the post. So I’ll limit this comment to the identifiers portion of the post.

    I’m all for multiple identifiers, in fact, I like to term them “Yet Another Digital Identifier” (or YADI). The more the better.

    An added point, in some of the identifiers you note (specifically the ISBN/ISSN and LC classification; and to some extent the LCCN), there is also interesting semantic meaning that could be used for data mining. In the case of the ISBN, it is, of course, the name of the publisher encoded in the number. In the case of the LC class number, it’s a very specific subject area; and the LCCN gives you, well, the year it was cataloged (I said meaning, not the meaning of life!).

    Looking forward to more news from OL!

  3. Kolja21 says:

    Sounds like an exiting future. I only miss two points:

    When will the beta-status of Open Library end?

    How can Open Library improve the authority control for authors? Today one author is listed under different names and spellings. There is no determination between a name (John Smith) and a person (John Smith, 1900-1970, French engineer).

    Probably it would help to use authority records like PND and LCCN and/or to increase the number of links to Wikipedia with a help of a bot.

    Greetings from Germany, Kolja

  4. MJ Suhonos says:

    Great news! Lots of very exciting stuff coming down the pipe.

    A short comment on subjects (from a librarian, but not necessarily an expert on classification) — when you say “separated by all manner of different characters. To the human eye, it seems like duplication of effort”, what you’re actually seeing are the limitations of pre-digital encoding of a subject hierarchy (and subdivisions).

    For example, the subjects you list above are intended to show that an item is classified as containing “science fiction”, which is a subclass of “fiction”; and at the same time, it fits into the subdivision of “general science fiction”, as opposed to, say, “star trek fiction” (which is a real LCSH heading) or “canadian science fiction” (which would be geographic subdivision). So it’s not quite the same as the “messy data list of concepts” you describe.

    Now that we don’t have to try to squeeze all this information into a printed card, a better approach is probably to use linked data to tie the item to the LC subject authority — eg. using ic.loc.gov:

    http://id.loc.gov/authorities/sh85118629#concept

    As Kolja raises, this linked data approach could be used for author authorities as well, assuming LOC or someone else exposed such a service.

    There’s lots of discussion of these issues on the NGC4lib mailing list, and definitely Open Library could benefit from being among the first large databases to used linked data for subjects. Not to mention the impact this would have on facilitating browsing (which was a “nice side-effect” from subject ordering anyway). But I’ll stop there. ;-)

  5. cjordan62 says:

    well done to the open library team. I’ve been enjoying the way the Trove catalogue at the National Library of Australia groups editions into works and lets us (the public!) change these groupings. The more these sites can cooperate, the better!

  6. George Oates says:

    @Gerry – I’m sorry to hear that you’re not happy with your IA experience. Not to pass the buck, but Open Library doesn’t have too much to do with Collections on archive.org on a day-to-day basis.

    I’m glad to hear you have ideas around smaller libraries using Open Library to represent their catalogs. As I mentioned, this has been an idea surrounding Open Library for some time now, but we haven’t managed to make it real yet. It would seem prudent for us to look to work with an existing online ILS like Koha, for example, to help manage the transition of records. I think it’s important not for Open Library to have too broad a scope (like delving into the world of circulation and other library systems not intimately tied to catalog management), but to find partners to collaborate with.

    I take your point about participation in development being difficult (impossible?) too. Even though I’m not a programmer myself, I understand that the Open Library’s underlying system is elegant, yet fairly obscure with a steep learning curve. We could be doing a better job of making this easier, for sure. Please be assured it’s not for lack of interest! We’d love to have a thriving community help contribute to Open Library dev! It would be really useful to hear where the current set up breaks down for you. I don’t have the skills to install and run Open Library myself, but I might be able to transform your feedback into making the overall process clearer…

    I’ll drop you an email to see if I can put you touch with someone who might help with IA questions, and I’d love to hear your ideas about small libraries getting involved more deeply with Open Library.

  7. George Oates says:

    @Gerry – Gah! Just realised there’s no email address for you – Please email me? glo@… Would love to talk more.

  8. George Oates says:

    @Martin – Yep – the identifiers are a goldmine! Edward Betts (on the Open Library team) has already poked at ISBNs to reveal publisher codes. See http://blog.openlibrary.org/2009/07/20/isbn-publisher-codes/

    It would be awesome to have a little tool which expands these codes back into information. It seems like we have a lot of records with an LCCN but without the accompanying text that describes what the LCCN is referring to. We’ve definitely got the big list of LCSH from http://id.loc.gov/authorities/search/, full of RDF snippets like this:

    http://id.loc.gov/authorities/sh95004917.rdf

    … and we’re very interested in exploring the relatedness between terms, so we might be able to say something like “Virginia Woolf is related to Leonard Woolf, Duncan Grant, Vita Sackville-West, Bloomsbury, England and cups of tea, and we have 24 books about her.” (or whatever).

    Please, feel free to point me at resources I probably don’t know about! This really is uncharted water for me.

  9. George Oates says:

    @Kolja21 –

    > When will the beta-status of Open Library end?

    Beta is weird, isn’t it? I think it has been useful for sites over the past 4-5 years because it gives a way for people to say “there are probably bugs,” which relieves pressure.

    We haven’t quite decided yet, but, I don’t see any particular reason to keep a beta label on Open Library anymore. While we would certainly plan for the new release to be bug free, that seems virtually impossible in my experience. We just need to make sure that there aren’t any Super Bugs in the key processes that will literally halt usage.

    Just between you and me, I’m also betting that the redesign won’t be right first time, and that there’ll be a ton of feedback (and probably criticism) of the approach we’re trying to take. So, it might not be that we’ll be fixing bugs as much as adapting features :)

    > How can Open Library improve the authority control for authors?
    > Today one author is listed under different names and spellings.

    Good question. I think the reason there’s so much variation in the system today is because we’ve received records from so many different sources. I suppose you could rely on a single catalog using the same data entry format for a name, but to expect that to translate into every other catalog seems impossible.

    A lot of our author records have close to zero metadata either, which makes them very difficult to disambiguate, like a birthdate.

    We also have a lot of fields for authors about names (however empty or unclear they are currently): personal name, title, alternate name, fuller name etc etc. I’m proposing to the team that we only use two fields for author name: Name, and Alternate Names. We can use Alternate Names for any/all variants both in the names themselves – pseudonyms and such – and the order the name is written. Even though it feels like we might be de-normalizing the data, I think we might be OK.

    Here’s an example we’ve used to talk about:

    Name: Arthur Conan Doyle
    Alternative names:
    Arthur Conan Doyle DL
    Arthur Ignatius Conan Doyle
    Arthur Ignatius Conan Doyle DL
    Doyle, Arthur Conan
    Doyle, Arthur Ignatius Conan
    Doyle, Arthur Conan DL
    Doyle, Arthur Conan Sir
    Doyle, Arthur Conan Sir DL
    Doyle, Sir Arthur Conan
    Doyle, Sir Arthur Conan DL
    Sir Arthur Conan Doyle
    Sir Arthur Conan Doyle DL
    Sir Arthur Ignatius Conan Doyle
    Sir Arthur Ignatius Conan Doyle DL

    > Probably it would help to use authority records like PND and LCCN
    > and/or to increase the number of links to Wikipedia with a help of a
    > bot.

    Definitely. We’ve also been experimenting with systems like Open Calais and AlchemyAPI to see what we might be able to glean from the Internet Archive’s full text corpus, but nothing to show from that just yet.

    I don’t suppose you have a spare Normdaten-DVD-ROM lying around, do you?

    ;)

  10. George Oates says:

    @MJ Suhonos – Yep, I see that the subject headings represent a hierarchy. It was a more a comment on the fact that you can have two remarkably similar subject headings, like: Fiction/Science Fiction and Fiction–Science Fiction.

    As Karen Coyle (who advises the Open Library team on all things library metadata related) noted for me, “Basically, LCSH isn’t about the world, it’s about books. So LC creates a subject when a book is written about it. if no one has written about brie, then there’s no entry.”

    I find that fascinating. And curious that the average number of subjects assigned to any one book is probably less than 5. (That’s a guess, of course.)

    > As Kolja raises, this linked data approach could be used for author
    > authorities as well, assuming LOC or someone else exposed such a
    > service.

    Definitely. Actually, people can download all the author records on Open Library today – if anyone feels like have a play, you’re absolutely welcome! http://openlibrary.org/dev/docs/jsondump (Author file is 209MB.) I’m ashamed to admit I’ve never actually looked at that myself!

    It’s downloading now :)

  11. George Oates says:

    @cjordan62 – I was thrilled to see the NLA Open Library integration! (I’m Australian.)

    http://trove.nla.gov.au/

    On a search results page, you can see links to Open Library, Hathi Trust, and Google Preview:

    http://trove.nla.gov.au/result?q=moby+dick

    Great stuff!

  12. Kolja21 says:

    @George
    > I don’t suppose you have a spare Normdaten-DVD-ROM lying around, do you?

    Unfortunately not. I would love to have one. But the German Wikipedia has got 92.200 PNDs connected with (biographical-)articles.

    Beside: It would be a great improvement if one could link the names in the filed “contributions” with an autor.

  13. [...] Finder, mentioned in my last post, is not the only example of a work-oriented catalog.  A recent status report from the OpenLibrary Project indicates that they are moving to make their catalog work-oriented as [...]

  14. MJ Suhonos says:

    @George:

    >And curious that the average number of subjects assigned to any one book is probably less than 5. (That’s a guess, of course.)

    I think you’ll find the average is closer to 3 — why?

    1) It’s hard for a single cataloguer to assign more than a handful of subjects without spending excess time and effort. (a great argument for social cataloguing ala. LibraryThing)

    2) Historically, it’s the number of subjects that could reasonably fit on a typewritten 7.5 x 12.5 cm catalogue card. :-)

  15. zim says:

    Open Library uses the Gnu Bookreader, and i am trying to find documentation, or a support forum for the open-source Open Library GnuBookreader.

    I am trying to develop an online php application that utilizes the Bookreader and have a ton of questions, but cannot find anywhere to ask them. The readme.txt file is literally one line: openlibrary book reader, so i am getting no directions there; can anybody point me to anywhere i can get code development help with the Bookreader?

  16. George Oates says:

    Hi Zim,

    There’s a page on Open Library about the Internet Archive Bookreader:

    http://openlibrary.org/dev/docs/bookreader

    Try that, and please use the channels outlined there if you need further help.

    Cheers.

  17. pk says:

    Quick Question: When a user enters ISBN-10 for a particular book, doesn’t it make sense to add a ISBN-13 for that book automatically? This would save some data entry for many people who use OL. Just a suggestion.

  18. [...] Comments pk on An update on Open LibraryGeorge Oates on An update on Open Libraryzim on An update on Open LibraryMJ Suhonos on An update on [...]

  19. George Oates says:

    @pk – Sorry for the slow reply – yes, that’s a good idea. We’ve had a poke around ISBN 10 –> 13 converters.

    http://www.isbn-13.info/

    You’re right. It would be handy. Just another thing on our list!

    Actually, this seems like the sort of plug-in that someone else out there on the interwebs might like to contribute to Open Library…

    Any takers?

  20. [...] we mentioned in two previous blog posts [1][2], the main features of the new design [...]