Google Summer of Code 2018

This is Internet Archive’s second year participating in Google Summer of Code, but for Open Library, it’s an exciting first. Open Library’s mission is to create, “a web page for every book” and this summer, we’re fortunate to team with Salman Shah to advance this mission. Salman’s Google Summer of Code roadmap aims to targets two core needs of openlibrary.org: modernizing and increasing the coverage of its book catalog and improving website reliability. 

Bots & Open Library

Every day, users contribute thousands of edits and improvements to Open Library’s book catalog. Anyone with an Open Library account can add a book record to the catalog if it doesn’t already exist. There’s also a great walkthrough on adding or editing data for existing book pages. Making edits manually can be tedious and so the majority of new book pages on Open Library are automatically created by Bots which have been programmed to perform specific tasks by our amazing community of developers and digital librarians. This month, Salman programmed two new bots. The first one is called ia-wishlist-bot. It makes sure an Open Library catalog record exists for each of the 1M books on the Internet Archive’s Wishlist, compiled by Chris Freeland and Matt Miller. The second bot, named onix-bot, takes book feeds (in a special format called ONIX) from our partners (e.g. Cory McCloud at Bibliometa), and makes sure the books exist in our catalog.

Importing Internet Archive Wishlist

Earlier this year, as part of the Open Libraries initiative, Chris Freeland, with the help of Matt Miller and others, compiled a Wishlist of hundreds of thousands of book recommendations for the Internet Archive to digitize:

“Our goal is to bring 4 million more books online, so that all digital learners have access to a great digital library on par with a major metropolitan public library system. We know we won’t be able to make this vision a reality alone, which is why we’re working with libraries, authors, and publishers to build a collaborative digital collection accessible to any library in the country.”

In support of this mission, the Open Library team decided it would be helpful if the metadata for these books were imported into the openlibrary.org catalog. 

Importing thousands of books in bulk into Open Library’s catalog presents several challenges. First, many precautions have to be taken to avoid adding duplicate book and author records to the database. To avoid the creation of duplicate records, Salman used the Open Library Book API to check for existing works by ISBN10, ISBN13, and OCLC identifiers. For this project, we were specifically interested in books which had no other editions on Open Library, so any time we noticed an existing edition for the same work, we skipped it. A second check used the Open Library Search API to check for any existing editions with a similar title and author. If there’s a plausible match, we don’t add it to Open Library. This process leaves us with a much shorter list of presumably unique works to add to Open Library.

Finding book covers for this new shortlist was the next challenge to overcome. These book covers typically come from an Open Library partner like Better World Books. Because Better World Books doesn’t have book covers for every book in our list, we had to be mindful that sometimes their service returns a default fallback image (which we had to detect). We wouldn’t want to add these placeholder images into Open Library’s catalog.

The last step is to make sure we’re not accidentally creating new Author records when we add our shortlist of books to Open Library. Even if we’ve taken precautions to ensure that a book with the same identifiers, title, and author doesn’t already exist doesn’t guarantee that the author isn’t already registered in our database. If they are, duplicating the author record would result in a negative and confusing user experience for readers searching for this author. We check to see if an author already exists on Open Library by using the Author search API and faceting on their name, as well as birth and death dates (where available in our shortlist).

In summary:

  • The Project started with 1 million books which were to be added to Open Library, out of those 1 million books.
  • A lot of these works were duplicates and already existed on Open Library and were merged on Open Library. The number of works that were left after this round were 255,276.
  • The parameters that were matched were ISBN, Title and Author Name and we were started with the top 1000 Open Library works which were added to Open Library. One example for one of the books that were added can be found here

An important output from this step was the standardization and generalization of our bot creation process.

Importing ONIX Records

In late 2017, one of our partners, Cory McCloud from Bibliometa, gifted Open Library access to tens of thousands of book metadata records in ONIX format:

ONIX for Books is an XML format for sharing bibliographic data pertaining to both traditional books and eBooks. It is the oldest of the three ONIX standards, and is widely implemented in the book trade in North America, Europe and increasingly in the Asia-Pacific region. It allows book and ebook publishers to create and manage a corpus of rich metadata about their products, and to exchange it with their customers (distributors and retailers) in a coherent, unambiguous, and largely automated manner.”

Many publishers use ONIX feeds to disseminate the metadata and prices of their books to partner vendors. Cory and his team thought Bibliometa’s ONIX records could be a great opportunity for synergy; to get publishers and authors increased exposure and recognition, and to improve the completeness and quality of Open Library’s catalog.

The steps for processing Bibliometa’s ONIX records is similar to importing books from the Internet Archive Wishlist, especially the steps for ensuring we weren’t creating duplicate records in Open Library. At the same time, the task of determining which authors already exist and which need to be created in the catalog was exacerbated by the fact that fewer birth and death dates were available, greatly reducing our confidence in author searching & matching. In other ways, creating an ONIX import pipeline was simplified by our earlier efforts which had established key conventions for how new bots may be created using the openlibrary-bots repository. Additionally, our ONIX feeds have the advantage of coming with book covers whereas we had to manually source book covers for items in the wishlist. 

The first step towards adding these records to Open Library was to write a parser to convert these ONIX feeds into a format which Open Library can understand.  . Open Library did have an ONIX Parser and Import Script written by the co-founder of Open Library, Aaron Swartz who had written the initial script to parse ONIX Records and add them to the Open Library Database. Like much of Open Library’s scripts, this code was in Python 2.7, encoded a much earlier version of the ONIX specification, and made use of a very old xml parser which was difficult to extend. Unfortunately, we couldn’t find any drop-in python replacements for the ONIX parser on github. These factors motivated rolling our own new ONIX parser.

To start with Salman received a dump of ~70,000 ONIX records from bibliometa to be evaluated for import into Open Library. There were two checks that were implemented for this procedure:

  1. Checking if there was an existing ISBN-10 or ISBN-13 for that particular work on Open Library using the Open Library Client.
  2. Matching via Title or Author and see if the record exists on Open Library or not via an API Call.

While much of the ONIX parser is complete, the ONIX Bot project is still in development.

A Guide on Writing Bots

Interested in writing your own Open Library Bot? For more information on how to make an Open Library Bot and their capabilities, please consult our documentation. The basic steps are:

  1. Apply for a Bot Account on Open Library by contacting the Open Library Maintainer and obtain a bot account. A good way to do this is to respond to this issue on github.
  2. After registering a bot account and having it approved, you can write a bot by extending the openlibrary-client to add accomplish tasks like adding new works to Open Library. You can refer to the openlibrary-client examples.
  3. All bots that add works to Open Library have to be added, are added to the Open Library Bots Repository on Github. Every bot has its own directory with a README containing instructions on how to reproducibly run the bot. Each bot should also link to a corresponding directory within the openlibrary-bots archive.org item where the outputs of the bot may be stored for provenance.

Next Steps: Provisioning

Unfortunately, there wasn’t enough time during the GSoC program to complete all three phrases of our roadmap (Wishlist, ONIX, and Provisioning). The objective of the third phase of our plan was to make Open Library deployment more robust and reliable using Docker and Ansible. Docker has been a discussion point of several Open Library Community Calls and has catalyzed the creation of a docker branch on the Open Library Github Repository which addresses some of the basic use cases outlined in the GSoC proposal. One important outcome is the identification of concrete steps and recommendations which the community can implement to improve Open Library’s provisioning process:

  • Switch from Docker to Docker Compose: Currently the Docker branch uses single Docker files to manage the dependencies for Docker. The goal is to use a single docker-compose file which will manage all services being used.
  • Switch Open Library to use Ansible (a software that automates software provisioning, configuration management, and application deployment). Have a Production as well as a Development Playbook. Playbooks are Ansible’s configuration, deployment, and orchestration language. They can describe a policy you want your remote systems to enforce, or a set of steps in a general IT process.  
  • Use Ansible Vault which is a feature of ansible that allows keeping sensitive data such as passwords or keys in encrypted files. This will replace the current system of having a olsystem.

Retrospective

In retrospect, Google Summer of Code 2018 has resulted in thousands of new books being added to the Open Library catalog. Conventions were established both to streamline and make it easier for others to create new bots in the future and to continue and extend this summer’s work.

Some of the key points that we overlooked while going drafting the proposal were as follows:

  1. Checking whether a book exists on Open Library or not is hard. We started with a simple Title match and ended up with formatting the title, formatting the authors to ensure no new author objects are created, making changes to the code to ensure it doesn’t break when there are no authors for a work in our data.
  2. Improving the openlibrary-client as well as documenting it extensively to ensure that future developers don’t have to go through the code to understand what that particular function ends up doing and how it can be used.
  3. Setting up a structure for the openlibrary-bots directory to ensure future developers are easily able to find the required code they need if they are writing their own bot.
  4. Assuming that data would be perfect and it was a matter of copy-pasting, but in reality, Salman and Mek had to go through the data to understand where the code broke because of various reasons like having a ‘,’(comma) in the string and so on.

One learning we obtained from participating in GSoC for the first time is that we may have been better off focusing on two instead of three work deliverables. By the end of the program, we didn’t have enough time for our third phase, even though we were proud of the progress we made. On the flip side, because of discussions catalyzed during our community calls and suggestions outlined in our GSoC proposal, there is now ongoing community progress on this final phase — dockerization of Open Library — which can be found here.

A major win of this GSoC project is that the project’s complexity necessitated Salman explore writing test cases for the first time and provided first hand experience as to the importance of a test harness in developing an end to end data processing pipeline.

 Three of our biggest objective key results during this program were:

  1. Quality assuring and updating the documentation of the openlibrary-client tool to support future developers.
  2. Creating a new `openlibrary-bots` repository with documentation and processes to ensure that there is a standard way to add future bots moving forward. And also making sure our Wishlist and ONIX bot processes are well documented with results which are reproducible.
  3. Adding thousands of new modern books to the Open Library catalog

Project Links

  1. Open Library Client – https://github.com/internetarchive/openlibrary-client
  2. Open Library Bots (IA Wishlist Bot) – https://github.com/internetarchive/openlibrary-bots/tree/master/ia-wishlist-bot
  3. Open Library Bots (ONIX Bot) – https://github.com/internetarchive/openlibrary-bots/tree/master/onix-bot
  4. Docker (In Progress) – https://github.com/internetarchive/openlibrary/tree/docker