Category Archives: Code/API

A High Schooler’s Experience Contributing to the Open Book Genome Project

Meet Teo Cheng, a high school student who has been volunteering to lead development on the Open Book Genome Project. In 2021, Teo took Harvard’s online CS50 Intro to Computer Science to prepare for his AP Computer Science Principles exam. The following summer, he took MIT’s Introduction to Computer Science and Programming course. To put these learnings into practice and gain more hands-on experience, he searched for impactful opportunities within the non-profit Internet Archive, where his father Brenton Cheng runs the UX team. For the past year, Teo has been working closely with Mek, making improvements to the Open Book Genome Project Sequencer — a software robot that reads the Internet Archive’s publicly available books and derives public insights to enable greater access to their themes and data. Meet Teo!

Goals

This year, by joining the Open Book Genome Project team, I hoped to understand a piece of production software well enough to make meaningful contributions. Also, because this project may someday be run on every book digitized by the Internet Archive, I wanted to gain experience contributing to something which needs to have a high level of accuracy and runtime performance. When I joined the project, I learned of several problems. For example, the book sequencer module, which is responsible for deriving ngrams, was noisy and wasn’t honoring the defined stop words. Also, the page type detection would frequently break because it was too strict and wasn’t robust against OCR errors, punctuation, and variety in syntax. Furthermore, because I already have experience programming, I was interested in learning more about the engineering development process, such as using tools like git, writing tests, and running pipelines.

What I’ve Learned

So far while working on the Open Book Genome Project (OBGP) I’ve gained experience with the following 10 things: I learned how to use docker to install a project in a contained way without having to mess up my computer’s file system. I used ssh to run the OBGP pipeline on a more powerful remote computer. Because the internet connection could be disrupted, we did our work using a program called tmux which ensured our processes would continue running even if the connection between the client and server died. This remote computer ran Linux and so I needed to learn basic BASH commands. I also needed to learn about XML and JSON formats, and how those are used in the results of our pipeline. We used bash commands and regex (e.g. grep) to analyze the pipeline results, such as to extract URL counts from books. Some bash commands I used to discover link counts are: for loop, grep, variables, cat, wc. I worked on improving the existing OBGP Sequencer, so I had to learn how to read through and understand a new codebase. To submit our code changes, we used the git protocol and managed our tasks on GitHub.

Accomplishments

In addition to learning a lot throughout working on the Open Book Genome Project, I’ve accomplished a few different things. I noticed the issue with the Page Type Detector, which I solved. My improvement to the detector involved allowing regex patterns in addition to exact text matches. I also improved the ISBN detector to reduce false positives, which were happening pretty commonly. Lastly, I solved the bug with the stop words that get removed from the ngrams to make them less noisy and more useful. I also added more stop words to decrease the amount of clutter in the ngram results.

How it Works

As a developer on the Open Book Genome Project, here’s an inside look at what it’s like when staff members run the Sequencer on Internet Archive’s books:

  1. Set up the project using the Docker instructions
  2. On Archive.org, identify a search query which returns the books we want to sequence
  3. Create an AdvancedSearch query which returns identifiers for these books in JSON
  4. Reformat the results from this query and feed it into the Sequencer pipeline

Here’s an example of a completed book_genome.json created by this process.

Want to try it yourself?

You can add your own processing modules too! If you’d like to try out the Open Book Genome Project Sequencer using just your browser, you can try it using the OBGP google colab.

Learn More

Want to learn more about the Open Book Genome Project? Check out the official bookgenomeproject.org website, Open Library’s announcement of the project, and learn about the work of Nolan Windham who previously led development on the project as a high school senior and incoming college freshman as part of Google Summer of Code.

Want to contribute?

Come volunteer to be an Open Library or Open Book Genome Project fellow!

Towards better EPUBs at Open Library and the Internet Archive

Screen Shot 2016-06-23 at 17.26.54

You may have read about our recent downtime. We thought it might be a good opportunity to let you know about some of the other behind the scenes things going on here. We continue to answer email, keep the FAQ updated and improve our metadata. Many of you have written about the quality of some of our EPUBs. As you may know, all of our OCR (optical character recognition) is done automatically without manual corrections and while it’s pretty good, it could be better. Specifically we had a pernicious bug where some books’ formatting led to the first page of chapters not being part of some books’ OCRed EPUB. I personally had this happen to me with a series of books I was reading on Open Library and I know it’s beyond frustrating.

To address this and other scanning quality issues, we’re changing the way EPUBs work. We’ve improved our OCR algorithm and we’re shifting from stored EPUB files to on-the-fly generation. This means that further developments and improvements in our OCR capabilities will be available immediately. This is good news and has the side benefit of radically decreasing our EPUB storage needs. It also means that we have to

  • remove all of our old EPUBs (approximately eight million items for EPUBs generated by the Archive)
  • put the new on-the-fly EPUB generation in place (now active)
  • do some testing to make sure it’s working as expected (in process)

We hope that this addresses some of the EPUB errors people have been finding. Please continue to give us feedback on how this is working for you. Coming soon: improvements to Open Library’s search features!

KohaCon 2011

Anand and I attended the Koha Conference in Thane, Mumbai earlier in November and spoke about Open Library. The conference took place from Oct 31 till 2 November. There was a hackfest following the event from 4th to 6th.

We missed the first day and presented our talk on the second day of the event. The first day had a number of interesting talks mainly about libraries shifting to Koha and about deployment issues. We spent our free time speaking to Robin Sheat, Dobrica Pavlinu i and Ian Walls among others about ways to tie up the Open Library data along with Koha installations. While the audience was somewhat small, it was truly international. There were folks from Kenya, Nigeria, France, the States, New Zealand, Australia, Croatia and of course various parts of India. We also met Savitra who apart from being a Koha developer, runs a Bangalore based company called OSS labs that provides hosted Koha instances for libraries.

We presented on the last day. Our slides are available at http://internetarchive.github.com/kohacon2011-presentation/. It was an introduction to Open Library, the data we have and some discussions on the API. There were a few questions mainly about copyright issues and about the classification system we use on the website. The conference was attended by many librarians and two of them (The Institute of Management Studies Library at VPM Thane and the University of Zagreb Faculty of Humanities and Social Sciences Library, Croatia) have applied to join the Open Library Lending Library program.

After the presentations, November 3rd was a day off and we spent it wandering around the older parts of Mumbai. On November 4th, we went back and spent the morning brainstorming about ideas to implement. We came up with a few

The first is a simple database update that presents OL as a search option when a book is not found while searching in a Koha installation. It’s been done and signed off.

The second was a simple Javascript change that fetches covers and borrow information from Open Library and then presents it when searches are done on Koha. This has been implemented as well.

The third is the most involved part and we have started work on an API to upload covers to OL which can be used by any external program. We have also started work on an API for Koha to search our records to see if the book being added is already in our database (in which case, it can auto complete the details for them). The search will also return the cover if it exists. On our end, if the koha side agrees, we can populate our database with the catalogue record being searched for and if a cover is uploaded, we can get a copy of that as well. This means that if a Koha instance in one library has uploaded a cover, other libraries will be able to use it. On the Koha side, Robin has a private branch that contains the work in progress. Details are in the bugzilla entry.

We’re following up on the bugs and the lending library requests to join. On the overall, it was a wonderful event and one that benefited Open Library as well as Koha.

BookReader Work Sprint at NYPL Labs

We had a really fantastic code/work sprint for the BookReader organized by the most excellent NYPL Labs.  The sprint was designed to bring together organizations that have an interest in the BookReader as a way to foster the sharing of interest, code and expertise.

New York Public Library

We started by making a list of desired features and prioritizing them.  High on the list was to make the code more modular and easier to understand, reuse and extend.  We made great progress towards that goal by creating a new plugin architecture that allows new views of the book to be added cleanly to the existing code.  For example, it will be possible to create a book view that uses the <canvas> tag or other advanced web technologies and have it automatically included in the BookReader application simply by including that plugin’s JavaScript file.

Looking down into the stacks

Another highly desired feature is making it easier for people to use their own books with the BookReader application.  Doug Reside from NYPL Labs contributed a “book loader” (our new term for the piece of code that connects the BookReader to the underlying images and metadata for a book to display) that allows you to specify the images for a book directly inside an HTML file.  This new loader provides a simple way to use the BookReader for your own books.

The new code is currently on the codesprint branch of the BookReader github repository.  We plan to integrate the new plugin system once the code has been polished and tested. Updated documentation is also coming. You can subscribe to the bookreader-announce mailing list to be notified when the code is released. You can also find more information about developing and using the BookReader in our developer resources.

Mitch Brodsky with his BookReader customized for the NY Philharmonic

This works sprint hosted by NYPL Labs marks an exciting new milestone in the development of the BookReader. We’re setting the foundation for greater re-use and collaboration around the BookReader. Many thanks to Doug Reside, David Riordan and Ben Vershbow of NYPL Labs for organizing the sprint and the fantastic attendees who contributed ideas and code commits!

BookReader Sprinters