Having worked more closely with bibliographic data than I had ever expected to over the last couple of years, I still can’t quite believe how complicated it can be. I keep holding tight something Karen Coyle told me when I first started at Open Library, that “library metadata is diabolically rational.” Now that I’ve witnessed the cataloging from lots of different sources and am more familiar with the level of detail that’s possible in a library catalog, I have a new fondness for these intensely variegated information systems; at times devilishly detailed, at others wildly incomplete or arcanely abbreviated. Everyone likes to arrange things and classify them into groups. It’s when you try to get people to put things into groups that someone else has come up with that it starts getting messy.
At Open Library, we’re attempting to ingest catalog data from, well, everywhere. Every “dialect” of cataloging practice makes this mass consumption harder. In spite of the rational goal of standardized data entry, there is an intense diffusion of practice. (Have a look at Seeing Standards: A Visualization of the Metadata Universe by Jenn Riley and Devin Becker if you haven’t already.)
A challenge I think we face today is a metastasized level of complexity, particularly as we attempt to begin to catalog works that have no physical form, but only exist electronically. Any challenge presents opportunity, and the opportunity here is to radically simplify the way things are represented in catalogs.
In February, I gave a presentation at the recent API Workshop held at the Maryland Institute of Technology and the Humanities (MITH). I talked about Open Library and paid particular attention to the resources we’re trying to put in place for developers to hook into the system.
Part of the presentation was an impromptu survey of the audience, where I passed around an index card for everyone, and asked people to write down the 5 fields they thought were adequate to describe a book. I framed the survey as a search for a “minimum viable record,” and it was fascinating to watch the audience squirm a bit as they asked for more guidance on the challenge. Can fields repeat? What’s the audience for this description? etc.
I’ve collated the results of the forty or so respondents into an ugly spreadsheet. There are 4 sheets, linked in the green strip at the bottom of the page:
- Book Raw – unfiltered results, in the order they were written
- Book Cooked V1 – all results blended, sorted alphabetically
- Book Merged – all results grouped
- Summary – with counts and a graph!
Here’s the final result:
So, on the shoulders of “minimum viable product“, a way for web application developers to get working code deployed quickly and effectively, I wonder if it’s time for a “minimum viable record” in place for bibliographic systems. Enough detail for a computer to match, correlate and compare, but not so much that having to process each record stops everything in its tracks.
You might have heard of the Open Publication Distribution System (OPDS) Catalog specification, which is a syndication format for electronic publications. Certainly, this new standard is a great step towards simpler representations of books — in this case, OPDS was initially designed to represent eBooks specifically — but I find myself wondering if it could be reduced further still, to pave the way for even easier exchange between systems. (Please note that all our edition records are now available in OPDS format, as well as RDF and JSON.)
Something like Title, Author, Date, Subject[s] and Identifier[s] might just do the trick, though it is of course irresistibly debatable. It’s an idea we’re going to look to as we work on our new Write API for Open Library. This minimum viable record will play gatekeeper for any new records we ingest (or that you export).
What do you think of this minimum viable blog post?
George, we’ve been doing some investigating in this area, largely by seeing what MARC tags have actually been used. I posted about this work recently, that was actually performed by a working group headed up by my colleague Karen Smith-Yoshimura. My post might be seen as a gentle introduction to that work, but I wouldn’t stop there. The full report by her working group (to which my blog post points) is likely to be very helpful and it has the actual evidence that support the assertions. Start here: http://hangingtogether.org/?p=834
Thanks, Roy – good to know there are others thinking that this might be an issue. Nice to have support of a more rigorous data-driven survey too.
I reckon you could ditch the subject[s] field – they’re essentially arbitrary and open to interpretation anyway, so you can leave that kind of stuff to the crowd/some future process.
Date feels like a nice-to-have. It’s useful for disambiguation and occasionally adds a bit of context, but most of the time you don’t care about the exact year or date.
So I’ve got your minimum viable fields list down to 3! 🙂
F
That’s an interesting point, about date. I’ve noticed quite a few dummy dates like ‘9999’ in Open Library data. I bet $10 they’re caused by a required field in some software somewhere 🙂
Date is important for copyright status.
Subject is the the Dewey or CI/SfB classification.
I guess you can do without these. In fact, if all you need is disambiguation then you don’t really need more than the iSBN number. On wikipedia an iSBN is all you need to link to a wide variety of book catalogs
http://en.wikipedia.org/wiki/Special:BookSources/
Thanks, Joe. But of course, as useful as it can be, the trouble with reliance on ISBN as the one true identifier is that not every book has one. (It was only published as an international standard in 1970.) I’m with Frankie too, that the subject doesn’t have to be a formal classification.
Another problem with relying on ISBN is that the numbers are not truly unique. Some publishers purchase a range of numbers and then re-use them.
There is no such thing as “the one true identifier”, no matter if you choose ISBN or any other system. By the way the title can also used as identifier of a book. It all depends what you mean by “a book” and in which context you want to identify it. I wonder why you did not mention Dublin Core which has obviously very similar motivation.
Absolutely agree that there is no One True Identifier – that’s why I’m suggesting the approach of building a network of possible identifiers instead. The hope is that eventually people can query Open Library (or other systems that share this network) using identifiers that they are familiar with initially, instead of needing awareness of, say, Open Library’s data structure or identifiers.
Certainly, Dublin Core is a very similar motivation. I noted DC in an interview with Audrey Watters on the O’Reilly Radar site last week. The curious thing about DC is that it has been unable to resist the subdivision & splintering that comes with any information system.
9999 is common in serials MARC records (008 field) to signify it is still being published, eg start date 1976, end date 9999 for a periodical that started in 1976 and shows no sign of stopping.
Date matters. 1611, 1811 or 2011? Date of original publication (when was Homer, or Archimedes, or Ptolemy originally published?) or latest revision, or latest translation?
This is helpful, George. I’m kicking around the idea of an OpenBookShelf for WordPress, capable of drawing from Open Library and other sources. A minimum viable record would be part of the core design. Maybe we’ll talk more on this.
Yes, John. Let’s talk more… it’s always useful to have real bodies on the other side of work like this. And, we love you!
I agree 100% with all you’re saying! You put your finger on some issues library cataloging is facing: can we sustain the current practice? Shouldn’t we rather focus on a few high-value fields? I certainly look forward to reading more about your ideas and experiences in this area.
I read a lot, across subject areas, across fiction genres and across time periods. Even if I’m only reading light fiction I prefer to have some idea of the date it’s published for context. Occasionally I’ve picked up older e-books without enough data and I’ve only had to go check out a catalogue anyway to figure it out. If I was only reading latest releases it wouldn’t matter and I suppose there are people who do that. And frankly, try reading a lot of non-fiction without having subject access to the mass of information out there! Any sort of subject will help, although yes of course it is open to interpretation and innacuracies.
The identifier / ISBN is of most use to those where editions matter, and that isn’t me, but I’d hate to be studying and have all that specific information dissappear. It’s also useful when people can’t remember a title properly, and that includes staff you are requesting books from who, no matter how slowly I went, couldn’t write down what I was spelling.
Publisher data for me is mostly about place context. Hopefully the language will be obvious from the title. I’d like to know if it’s an e-book but if it is I suppose there’d be a link to make it obvious. The description will tell you if you can read it in a day or a month and if you can carry it in your hand or need a trolley! Cover. I don’t care what a cover looks like but if there’s a significant variation in the title on the cover that is useful. Availability goes out of date very quickly. Edition see identifyer. Fiction / non fiction – well if you use subjects that will be obvious! Illustrations are of minor usefulness to me, but:
the whole problem with this debate around quality of cataloguing is that it’s about cataloguing for the majority. That isn’t fair to the minority. Of any type or need.
And what’s with smell??????
I completely agree about date needed for context of older fiction. However, what I care about is the original date, not the date of this printing (although this can be important as well). The original date is often not included in a record. I also use the date to assess scholarly work. When a book about the Civil War or astronomy (for example) was published will help you assess the book.
I was directed to the “arXiv meta-data format” by Jerome McDonough, via Twitter. Similar idea, though I’d suggest that some fields be repeatable, in particular, IDENTIFIER. I wonder what the uptake is like…
Viability depends on usage scenarios, of course. But I will say that when I rolled out the Online Books Page database in 1994, all that was in my records was title, author(s) (with one of 5 author rules), and URLs (annotated with format and name of site).
I think there might have been an unstructured note early on as well (as there still is).
Later on, I added LC call number, persistent record ID, and eventually LCSH subject. The note field not often, but not always, included date and place of publication, and publisher, though not in a structured form. More recently language has been added for non-English titles, not yet in structured form.
The author fields also acquired subfields for linkages with external databases (such as the Celebration of Women Writers) and for informal name display (when the usual de-inversion of Lastname, Firstname doesn’t look right).
More specialized fields have also been added for serials and works (as distinct from editions/manifestations).
So it’s been a slow accretion of attributes to meet the needs of my catalog; but it started quite simple (even omitting things like date), and is fairly simple still. I don’t know how generally this approach will work, but I thought you might find it of interest.
ISBN works as an identifier a lot of the time, and is essential. However ISBN usage is done by publishers. Some publishers (I notice especially parts of Europe and Latin America, from which I’ve worked with publications) reuse ISBNs.
Pingback: Cataloging Futures
The minimum elements totally depend on the user tasks – if all you want is inventory, then an identifier (whether one that comes with the item or is assigned locally) is all that’s needed. On the other hand if you want to fulfill the FRBR and/or FRAD user tasks, you need the basic elements those models identified (that are also stated in RDA as “core” elements).
MARC analyses will only tell us past practice, which may or may not be helpful guidance.
That’s a good point, Barbara, about inventory or not. The use here, for me at least, is about computers being able to exchange awareness of inventory in a lightweight way, something akin to a handshake. “I have this book. Do you have it too?” Once that’ question is answered, you can interrogate for more information. Maybe an MVR is just about establishing that hook.
But you can only identify one item with another if you have an adequate description, and that is a harder task than one might think, even if two people agree what elements allow one to say “too.” That is why we have different cataloging rules and why descriptive bibliographers are not the same as catalogers. Certainly more than 5 elements are needed. The list of elements is interesting, but undoubtedly biased by your audience (e.g., I doubt that most library users would place an identifier so high).
At this point in time, identifiers can represent more than one manifestation (ISBDs, anyone?), and the same type of identifier is frequently applied to the same manifestation (e.g. OCLC numbers).
“if you have an adequate description” – what do you think that is, Larry?
Pingback: The search for a minimum viable record « Phx Friends of UA SIRLS
Pingback: The Internet Archive, Now Preserving Printed Books As Well — NetworqScience Blog
Pingback: The Internet Archive, Now Preserving Printed Books As Well | Tech News Ninja
Pingback: The Internet Archive, Now Preserving Printed Books As Well
Pingback: The Internet Archive, Now Preserving Printed Books As Well | Derivations of Thought
Pingback: The Internet Archive, Now Preserving Printed Books As Well