Category Archives: Data

A High Schooler’s Experience Contributing to the Open Book Genome Project

Meet Teo Cheng, a high school student who has been volunteering to lead development on the Open Book Genome Project. In 2021, Teo took Harvard’s online CS50 Intro to Computer Science to prepare for his AP Computer Science Principles exam. The following summer, he took MIT’s Introduction to Computer Science and Programming course. To put these learnings into practice and gain more hands-on experience, he searched for impactful opportunities within the non-profit Internet Archive, where his father Brenton Cheng runs the UX team. For the past year, Teo has been working closely with Mek, making improvements to the Open Book Genome Project Sequencer — a software robot that reads the Internet Archive’s publicly available books and derives public insights to enable greater access to their themes and data. Meet Teo!

Goals

This year, by joining the Open Book Genome Project team, I hoped to understand a piece of production software well enough to make meaningful contributions. Also, because this project may someday be run on every book digitized by the Internet Archive, I wanted to gain experience contributing to something which needs to have a high level of accuracy and runtime performance. When I joined the project, I learned of several problems. For example, the book sequencer module, which is responsible for deriving ngrams, was noisy and wasn’t honoring the defined stop words. Also, the page type detection would frequently break because it was too strict and wasn’t robust against OCR errors, punctuation, and variety in syntax. Furthermore, because I already have experience programming, I was interested in learning more about the engineering development process, such as using tools like git, writing tests, and running pipelines.

What I’ve Learned

So far while working on the Open Book Genome Project (OBGP) I’ve gained experience with the following 10 things: I learned how to use docker to install a project in a contained way without having to mess up my computer’s file system. I used ssh to run the OBGP pipeline on a more powerful remote computer. Because the internet connection could be disrupted, we did our work using a program called tmux which ensured our processes would continue running even if the connection between the client and server died. This remote computer ran Linux and so I needed to learn basic BASH commands. I also needed to learn about XML and JSON formats, and how those are used in the results of our pipeline. We used bash commands and regex (e.g. grep) to analyze the pipeline results, such as to extract URL counts from books. Some bash commands I used to discover link counts are: for loop, grep, variables, cat, wc. I worked on improving the existing OBGP Sequencer, so I had to learn how to read through and understand a new codebase. To submit our code changes, we used the git protocol and managed our tasks on GitHub.

Accomplishments

In addition to learning a lot throughout working on the Open Book Genome Project, I’ve accomplished a few different things. I noticed the issue with the Page Type Detector, which I solved. My improvement to the detector involved allowing regex patterns in addition to exact text matches. I also improved the ISBN detector to reduce false positives, which were happening pretty commonly. Lastly, I solved the bug with the stop words that get removed from the ngrams to make them less noisy and more useful. I also added more stop words to decrease the amount of clutter in the ngram results.

How it Works

As a developer on the Open Book Genome Project, here’s an inside look at what it’s like when staff members run the Sequencer on Internet Archive’s books:

  1. Set up the project using the Docker instructions
  2. On Archive.org, identify a search query which returns the books we want to sequence
  3. Create an AdvancedSearch query which returns identifiers for these books in JSON
  4. Reformat the results from this query and feed it into the Sequencer pipeline

Here’s an example of a completed book_genome.json created by this process.

Want to try it yourself?

You can add your own processing modules too! If you’d like to try out the Open Book Genome Project Sequencer using just your browser, you can try it using the OBGP google colab.

Learn More

Want to learn more about the Open Book Genome Project? Check out the official bookgenomeproject.org website, Open Library’s announcement of the project, and learn about the work of Nolan Windham who previously led development on the project as a high school senior and incoming college freshman as part of Google Summer of Code.

Want to contribute?

Come volunteer to be an Open Library or Open Book Genome Project fellow!

The Open Book Genome Project

We’ve all heard the advice, don’t judge a book by its cover. But then how should we go about identifying books which are good for us? The secret depends on understanding two things:

  1. What is a book?
  2. What are our preferences?

We can’t easily answer the second question without understanding the first one. But we can help by being good library listeners and trying to provide tools, such as the Reading Log and Lists, to help patrons record and discover books they like. Since everyone is different, the second question is key to understanding why patrons like these books and making Open Library as useful as possible to patrons.

What is a book?

As we’ve explored before, determining whether something is a book is a deceptively difficult task, even for librarians. It’s a bound thing made of paper, right? But what about audiobooks and ebooks? Ok, books have ISBNs right? But many formats can have ISBNs and books published before 1967 won’t have one. And what about yearbooks? Is a yearbook a book? Is a dictionary a book? What about a phonebook? A price guide? An atlas? There are entire organizations, like the San Francisco Center for the Book, dedicated to exploring and pushing the limits of the book format.

In some ways, it’s easier to answer this question about humans than books because every human is built according to a specific genetic blueprint called DNA. We all have DNA, what make us unique are the variations of more than 20,000 genes that our DNA are made of, which help encode for characteristics like hair and eye color. In 1990, an international research group called the Human Genome Project (HGP) began sequencing the human genome to definitively uncover, “nature’s complete genetic blueprint for building a human being”. The result, which completed in 2003, was a compelling answer of, “what is a human?”.

Nine years later, Will Glaser & Tim Westergren drew inspiration from HGP and launched a similar effort called the Music Genome Project, using trained experts to classify and label music according to a taxonomy of characteristics, like genre and tempo. This system became the engine which powers song recommendations for Pandora Radio.

Circa 2003, Aaron Stanton, Matt Monroe, Sidian Jones, and Dan Bowen adapted the idea of Pandora to books, creating a book recommendation service called BookLamp. Under the hood, they devised a Book Genome Project which combined computers and crowds to “identify, track, measure, and study the multitude of features that make up a book”.

Their system analyzed books and surfaced insights about their structure, themes, age-appropriateness, and even pace, bringing us withing grasping distance of the answer to our question: What is a book?

BookLamps-Theme-Currents-for-Carrie

Sadly, the project did not release their data, was acquired by Apple in 2014, and subsequently discontinued. But they left an exciting treasure map for others to follow.

And follow, others did. In 2006, a project called the Open Music Genome Project attempted to create a public, open, community alternative to Pandora’s Music Genome Project. We thought this was a beautiful gesture and a great opportunity for Open Library; perhaps we could facilitate public book insights which any project in the ecosystem could use to create their own answer for, “what is a book?”. We also found inspiration from complimentary projects like StoryGraph, which elegantly crowd sources book tags from patrons to help you, “choose your next book based on your mood and your favorite topics and themes”, HaithiTrust Research Center (HTRC) which has led the way in making book data available to researchers, and the Open Syllabus Project which is surfacing useful academic books based on their usage across college curriculum.

Introducing the Open Book Genome Project

Over the last several months, we’ve been talking to communities, conducting research, speaking with some of the teams behind these innovative projects, and building experiments to shape a non-profit adaptation of these approaches called the Open Book Genome Project (OBGP).

Our hope is that this Open Book Genome Project will help responsibly make book data more useful and accessible to the public: to power book recommendations, to compare books based on their similarities and differences, to produce more accurate summaries, to calculate reading levels to match audiences to books, to surface citations and urls mentioned within books, and more.

OBGP hopes to achieve these things by employing a two pronged approach which readers may continue learning about in following two blog posts:

  1. The Sequencer – a community-engineered bot which reads millions of Internet Archive books and extracts key insights for public consumption.
  2. Community Reviews – a new crowd-sourced book tagging system which empowers readers to collaboratively classify & share structured reviews of books.

Or hear an overview of the OBGP in this half-hour tech talk:

Amplifying the Voices Behind Books With the Power of Data

Exploring how Open Library uses author data to help readers move from imagination to impact

By Nick Norman, Edited by Mek & Drini

Image Source: Pexels / Pixabay from popsugar

According to René Descartes, a creative mathematician, “The reading of all good books is like a conversation with the finest [people] of past centuries.” If that’s true, then who are some of the people you’re talking to?

Continue reading

Time travel through millions of historic Open Library images

The BBC has an article about Kalev Leetaru’s project to extract images from millions of Open Library pages.

You can read about how it works…

The Internet Archive had used an optical character recognition (OCR) program to analyse each of its 600 million scanned pages in order to convert the image of each word into searchable text. As part of the process, the software recognised which parts of a page were pictures in order to discard them.

Mr Leetaru’s code used this information to go back to the original scans, extract the regions the OCR program had ignored, and then save each one as a separate file in the Jpeg picture format. The software also copied the caption for each image and the text from the paragraphs immediately preceding and following it in the book. Each Jpeg and its associated text was then posted to a new Flickr page, allowing the public to hunt through the vast catalogue using the site’s search tool.

“I think one of the greatest things people will do is time travel through the images,” Mr Leetaru said.

… or just check out some of the results. Images plus citations plus metadata! We couldn’t be happier. Free to use with no restrictions.

Continue reading