Meet oclcBot. He was written by Bruce Washburn at OCLC Research to help connect Open Library records to Worldcat.org. He’s just finished updating almost 4 million Open Library editions with links! No metadata exchange at all, except these identifiers. Tiny, but powerful, because that lets systems that “speak OCLC” communicate directly with Open Library without knowing any Open Library IDs. As Anand mentioned in his recent post about Coverstore Improvements, we’ve also made the system for displaying covers externally using other types of identifiers more efficient.
There was a bit of a bumpy start to oclcBot’s updates, and Bruce and I thought it might be good to hear what it was like in the trenches. From Bruce:
This project was essentially very simple: find corresponding Open Library and OCLC WorldCat records by a shared attribute (ISBN), and update the Open Library record with the corresponding OCLC number. Once OCLC had generated a list of OCLC numbers and their corresponding ISBNs, it seemed to be a simple matter of using the very robust Open Library API to look for matching records, check to see if they already included an OCLC number, and update the record accordingly. Complications arose, related to scale. There were about 90 million ISBNs to check from the OCLC list, and checking them one at a time via the API was projected to take a very long time. So we used a data dump of all the Open Library records to identify those with ISBNs, and also built a very fast index of the OCLC list to check against. With that we were able to produce a new list of Open Library records and corresponding new OCLC numbers. And a batch update facility in the Open Library API made it possible to send API requests 1,000 records at a time. The pre-processing and the batch process both yielded some additional lists that will require more scrutiny to process (records associated with multiple ISBNs, API exceptions for individual records), but the great majority of records were updated via the oclcBot without any further effort.
So, it’s still early days with our Bot operations, but we’re looking for external developers who might be interested to try to do these “surgical strike” style updates to loads of Open Library records at once. If you’re curious, please visit our Writing Open Library Bots in the Open Library Developers area.
Thank you, Bruce!
(And thanks to Solo for the CC BY-NC-SA 2.0 oclcBot photo.)
Just curious, would you be able to make the isbn/oclc mapping available?
That’s really a question for Bruce, Ed. He’s the man who made the various mapping files. I’d love to see that in the wild though, for what it’s worth.
Bruce is an amazing and resourceful colleague and I’m pleased you’ve had the opportunity to work with him!
Hmm, Open Library data is bulk-downloadable, isn’t it? So for the portion of OCLCnums that were succesfully attached to OL records…. the OCLCnum to ISBN mapping is already available, derivable from available OL records, yeah? Is this true?
Jonathan: I’m not certain that the dumps are produced regularly yet. It may be that the latest oclcBot updates aren’t represented yet. Besides, working with the bulk dumps is far too unwieldy for lots of people… Seems to me it might be useful to produce a “minimum viable record” set of the system. A dataset that contained (something like) Title, Author(s), Subject(s), Date, Identifier(s) for everything…
Pingback: Announcing a new Read API « The Open Library Blog