Having worked more closely with bibliographic data than I had ever expected to over the last couple of years, I still can’t quite believe how complicated it can be. I keep holding tight something Karen Coyle told me when I first started at Open Library, that “library metadata is diabolically rational.” Now that I’ve witnessed the cataloging from lots of different sources and am more familiar with the level of detail that’s possible in a library catalog, I have a new fondness for these intensely variegated information systems; at times devilishly detailed, at others wildly incomplete or arcanely abbreviated. Everyone likes to arrange things and classify them into groups. It’s when you try to get people to put things into groups that someone else has come up with that it starts getting messy.
At Open Library, we’re attempting to ingest catalog data from, well, everywhere. Every “dialect” of cataloging practice makes this mass consumption harder. In spite of the rational goal of standardized data entry, there is an intense diffusion of practice. (Have a look at Seeing Standards: A Visualization of the Metadata Universe by Jenn Riley and Devin Becker if you haven’t already.)
A challenge I think we face today is a metastasized level of complexity, particularly as we attempt to begin to catalog works that have no physical form, but only exist electronically. Any challenge presents opportunity, and the opportunity here is to radically simplify the way things are represented in catalogs.
In February, I gave a presentation at the recent API Workshop held at the Maryland Institute of Technology and the Humanities (MITH). I talked about Open Library and paid particular attention to the resources we’re trying to put in place for developers to hook into the system.
Part of the presentation was an impromptu survey of the audience, where I passed around an index card for everyone, and asked people to write down the 5 fields they thought were adequate to describe a book. I framed the survey as a search for a “minimum viable record,” and it was fascinating to watch the audience squirm a bit as they asked for more guidance on the challenge. Can fields repeat? What’s the audience for this description? etc.
I’ve collated the results of the forty or so respondents into an ugly spreadsheet. There are 4 sheets, linked in the green strip at the bottom of the page:
- Book Raw – unfiltered results, in the order they were written
- Book Cooked V1 – all results blended, sorted alphabetically
- Book Merged – all results grouped
- Summary – with counts and a graph!
Here’s the final result:
So, on the shoulders of “minimum viable product“, a way for web application developers to get working code deployed quickly and effectively, I wonder if it’s time for a “minimum viable record” in place for bibliographic systems. Enough detail for a computer to match, correlate and compare, but not so much that having to process each record stops everything in its tracks.
You might have heard of the Open Publication Distribution System (OPDS) Catalog specification, which is a syndication format for electronic publications. Certainly, this new standard is a great step towards simpler representations of books — in this case, OPDS was initially designed to represent eBooks specifically — but I find myself wondering if it could be reduced further still, to pave the way for even easier exchange between systems. (Please note that all our edition records are now available in OPDS format, as well as RDF and JSON.)
Something like Title, Author, Date, Subject[s] and Identifier[s] might just do the trick, though it is of course irresistibly debatable. It’s an idea we’re going to look to as we work on our new Write API for Open Library. This minimum viable record will play gatekeeper for any new records we ingest (or that you export).
What do you think of this minimum viable blog post?