A High Schooler’s Experience Contributing to the Open Book Genome Project

Meet Teo Cheng, a high school student who has been volunteering to lead development on the Open Book Genome Project. In 2021, Teo took Harvard’s online CS50 Intro to Computer Science to prepare for his AP Computer Science Principles exam. The following summer, he took MIT’s Introduction to Computer Science and Programming course. To put these learnings into practice and gain more hands-on experience, he searched for impactful opportunities within the non-profit Internet Archive, where his father Brenton Cheng runs the UX team. For the past year, Teo has been working closely with Mek, making improvements to the Open Book Genome Project Sequencer — a software robot that reads the Internet Archive’s publicly available books and derives public insights to enable greater access to their themes and data. Meet Teo!

Goals

This year, by joining the Open Book Genome Project team, I hoped to understand a piece of production software well enough to make meaningful contributions. Also, because this project may someday be run on every book digitized by the Internet Archive, I wanted to gain experience contributing to something which needs to have a high level of accuracy and runtime performance. When I joined the project, I learned of several problems. For example, the book sequencer module, which is responsible for deriving ngrams, was noisy and wasn’t honoring the defined stop words. Also, the page type detection would frequently break because it was too strict and wasn’t robust against OCR errors, punctuation, and variety in syntax. Furthermore, because I already have experience programming, I was interested in learning more about the engineering development process, such as using tools like git, writing tests, and running pipelines.

What I’ve Learned

So far while working on the Open Book Genome Project (OBGP) I’ve gained experience with the following 10 things: I learned how to use docker to install a project in a contained way without having to mess up my computer’s file system. I used ssh to run the OBGP pipeline on a more powerful remote computer. Because the internet connection could be disrupted, we did our work using a program called tmux which ensured our processes would continue running even if the connection between the client and server died. This remote computer ran Linux and so I needed to learn basic BASH commands. I also needed to learn about XML and JSON formats, and how those are used in the results of our pipeline. We used bash commands and regex (e.g. grep) to analyze the pipeline results, such as to extract URL counts from books. Some bash commands I used to discover link counts are: for loop, grep, variables, cat, wc. I worked on improving the existing OBGP Sequencer, so I had to learn how to read through and understand a new codebase. To submit our code changes, we used the git protocol and managed our tasks on GitHub.

Accomplishments

In addition to learning a lot throughout working on the Open Book Genome Project, I’ve accomplished a few different things. I noticed the issue with the Page Type Detector, which I solved. My improvement to the detector involved allowing regex patterns in addition to exact text matches. I also improved the ISBN detector to reduce false positives, which were happening pretty commonly. Lastly, I solved the bug with the stop words that get removed from the ngrams to make them less noisy and more useful. I also added more stop words to decrease the amount of clutter in the ngram results.

How it Works

As a developer on the Open Book Genome Project, here’s an inside look at what it’s like when staff members run the Sequencer on Internet Archive’s books:

  1. Set up the project using the Docker instructions
  2. On Archive.org, identify a search query which returns the books we want to sequence
  3. Create an AdvancedSearch query which returns identifiers for these books in JSON
  4. Reformat the results from this query and feed it into the Sequencer pipeline

Here’s an example of a completed book_genome.json created by this process.

Want to try it yourself?

You can add your own processing modules too! If you’d like to try out the Open Book Genome Project Sequencer using just your browser, you can try it using the OBGP google colab.

Learn More

Want to learn more about the Open Book Genome Project? Check out the official bookgenomeproject.org website, Open Library’s announcement of the project, and learn about the work of Nolan Windham who previously led development on the project as a high school senior and incoming college freshman as part of Google Summer of Code.

Want to contribute?

Come volunteer to be an Open Library or Open Book Genome Project fellow!

Introducing Trusted Book Providers

Building the Internet’s library is no easy task, and it can’t be done alone. Thankfully, we’re not alone in wanting to provide access to knowledge, books, and reading — which is why we’re excited to introduce Trusted Book Providers into Open Library. This feature allows us to provide direct “Read” links to a number of carefully selected, reputable sources of books online. Integrations with Project Gutenberg and LibriVox are up and running, and integrations with Standard Ebooks, OpenStax, and Wikisource are in progress. By linking to these outstanding organizations, we’re excited to help promote their wonderful work as well as give Open Library patrons easy access to more trusted sources for digital books. We see this as a step in helping the world of open access books flourish.

Viewing LibriVox and Gutenberg works in Open Library

For more than ten years, Open Library has allowed patrons from across the globe to read, borrow, and listen to digital books from the Internet Archive’s prodigious lending library and public domain collection. Since then, the Internet Archive has partnered closely with more than 1,000 US libraries to accession books, ensure their digital preservation, and make them useful to select audiences, such as those with print disabilities, through controlled library practices.

Open Library is now excited to expand its “Read” buttons to include not only the millions of books made available by the Internet Archive, but also works from other trusted digital collections. What does this mean for patrons? It means more books and more reading options — such as LibriVox’s human-read public domain audiobooks, Standard Ebooks’ lovingly formatted modern epubs, or Project Gutenberg’s reflowable-text books. We hope this will result in a more inclusive ecosystem and shine more light on the amazing work done by these other mission-aligned non-profit organizations.

Choosing the First Trusted Book Providers

We selected the first group of Trusted Book Providers based on several factors. First, we prioritized non-profit organizations who are reputable, well-established, and have a similar focus on serving public good. Second, we looked for providers whose holdings increased the diversity of book formats Open Library may link to. Thirdly, we looked for providers who focus on open & permissive licensing, or public domain material.

Project Gutenberg

Project Gutenberg is the oldest digital library online. Founded in 1971 (was the internet even around then?), the volunteer-driven organization is dedicated to creating free, open, long-lasting eBooks that are easily accessible from many devices. The Internet Archive already proudly preserves most of Project Gutenberg’s over 60,000 titles, and Open Library is excited to be able to have users read from Project Gutenberg directly. For patrons, the human-curated, reflowable-text formats made available by Project Gutenberg are ideal for reading on small screens, e-readers, and also for powerful accessibility customization, like dyslexic fonts and screen readers.

Browse on Open Library

LibriVox

Founded in 2005, LibriVox’s stated mission is “to make all books in the public domain available, narrated by real people and distributed for free, in audio format on the internet.” And with over 15,000 editions in over 80 languages, they’re making great headway! The Internet Archive also works with LibriVox, and provides storage for their mass of audio files. For patrons, LibriVox integration means they will now have access to human-spoken audiobooks for many public domain works.

Browse on Open Library

Standard Ebooks

Standard Ebooks is a volunteer-driven project dedicated to producing new editions of public domain ebooks that are lovingly formatted, open source, free of copyright restrictions, and free of cost. Founded in 2015, Standard Ebooks books are carefully standardized and normalized to work great as reflowable-text html, as well as modern epubs with all the trimmings — table of contents, typographical attention to detail, beautiful public domain cover art, and more. For patrons, Standard Ebooks’ over 500 titles are perfect for reading on web browsers, phones, or e-readers due to their reflowable text and modern epub features specifically optimized for every e-reader platform.

In Progress… | Browse at Standard Ebooks

OpenStax

OpenStax is a non-profit dedicated to creating original, free, open-access high school and college textbooks. Part of the non-profit corporation, Rice University, OpenStax has created over 60 high quality, peer-reviewed textbooks since its launch in 2012, with some titles available in English, Spanish, and Polish. Open Library will include OpenStax read links so our patrons can find and access these digital-only materials online or as PDF or ePub downloads.

In Progress… | Browse at OpenStax

Wikisource

Launched in 2003, Wikisource is an online digital library of free-content textual sources on a wiki, operated by the Wikimedia Foundation (the folks who run Wikipedia). Wikisource has a huge community of editors dedicated to converting scans of classic books to error-free, proofread digital books. And improving their records is as easy as editing a Wikipedia page! Offering reading options online or offline as PDF, ePub, mobi, etc for millions of records, Wikisource’s catalog, spanning over 30 languages, is unparalleled. And soon, you’ll be able to find these works right in Open Library!

In Progress… | Browse at Wikisource

How Trusted Book Providers Work

As a patron, you shouldn’t have to do anything special to access titles from our Trusted Partners.

When designing support for Trusted Providers, we wanted to find the right balance between convenience and trust. We didn’t want patrons to get confused by a button taking them to a new website without warning. But we also didn’t want to introduce unnecessary friction and multiple clicks preventing patrons from easily accessing books. As a result, our team team converged on two strategies:

  1. When a Read button is for a Trusted Provider, the button will have an external link icon like:
  2. When you click a Trusted Provider button, a message will appear on Open Library providing context about the Trusted Provider. The Trusted Provider link will be open within a new browser tab.

Recommend a Trusted Book Provider

Are you a book service, library, or publisher which would like to integrate with the Open Library’s catalog? Or is there a service you’d like to recommend?

Please recommend or apply to become a Trusted Book Provider using this form.

Open Library in Every Language

The Open Library catalog is used by patrons from across the globe, but its usage is predominated by English speakers (32% US, 9% India, 5% UK, 4% Canada). This is driven by four factors which we’re working to change.

  1. International Holdings – It goes without saying that, in order to be an Open Library for the Internet™ our catalog needs to include book records and link to source material from more languages. We’re actively working with the acquisitions team within the Internet Archive to fight for greater diversification of our book holdings, including more languages and regions. If you are an international library or publisher, you may help us by sharing your catalog metadata and we’ll happily include these records on Open Library & provide back-links so patrons know where the metadata comes from.
  2. Search – In order for Open Library to be as useful as possible for diverse communities around the globe, our search engine has to show patrons the right books with appropriately translated titles. Managing a search engine for a service like Open Library is a full-time job. Presently, this gargantuan task is spearheaded by Drini Cami. Presently, because of historical reasons & performance, the Open Library search engine indexes on Works (collections of editions) as opposed to Editions. This limits our ability to tailor search results and show patrons book editions in their preferred language. This year we made progress on supporting Edition-level indexing and “search for books in language” (one of our most requested features) will be on our roadmap for 2022.
  3. Marketing – Open Library is run by a small team of staff that you can count on one hand and our success depends on the efforts of volunteers who champion literacy and librarianship for their communities. We’re still learning which channels may be best to extend our offerings to patrons in regions which we’re currently under-serving. If you have an idea on how we can reach a new community, we’d love your advice and your help. Please send us you ideas using the “Communication & Outreach” link on our volunteer page.
  4. Translation & Localization – Making a website like Open Library accessible and usable to an international audience takes more than clicking “google translate”. For years Open Library has had a pipeline and process for adding translations.

Goal: 5 Languages

Our current goal is to fully localize the Open Library website into 10 languages. We currently have contributions for translations across 7 languages: Čeština, Deutsch, English, Español, Français, Hrvatski, and తెలుగు.

English, Spanish, French, and Croatian (Hrvatski) are the most up to date and you can try the website in those languages by clicking their respective links. Can you help us get one of these other languages across the finish line?

Why Contribute Now?

In the past, translators did not have an automatic way to receive feedback about whether they had contributed translations correctly. Translators would need to have a conversation with staff in order to get started, submit translations for review, and then a member of staff would report back if there was a mistake. This process had so much friction that it resulted in many incomplete translation submissions.

This year, Jim Champ, Drini Cami, and others in the community added automated validation so translators get near-real time feedback about whether translations had been submitted correctly. Now, submitting a translation is much simpler and only requires one to know the target language. Here’s how!

How it Works

All you need in order to contribute translations is a Github account. Translations can be contributed directly on the Github website by following the Translator’s Contributor’s Guide with no special software required to participate.

Want to Help Translate?

Let us know here: https://openlibrary.org/volunteer#translator

Meet our Translators

Daniel – Spanish

Daniel Capilla lives in Málaga, Spain and has been contributing to Open Library since 2013. Daniel’s interest in contributing to Open Library was sparked by his joy of reading and all things  library-and-book-related as well as the satisfaction he gets from contributing to open source projects and knowing that everyone will be able to freely enjoy his contributions in the future. Dan has made significant contributions by adding a first Spanish translation and believes:

“The issue of the internationalization of the Open Library seems to me to be a fundamental issue for the project to have more acceptance, especially in non-English speaking countries. This is an issue on which there is still much to be done.”

Follow Daniel on twitter: @dcapillae

2021 End-of-Year Community Updates

Hi Open Library Community! This is going to be a less formal post detailing some of our recent community meetings and exciting Q3 (quarter 3) opportunities to learn, celebrate, and participate with the Open Library project.

Earlier this Month

Upcoming Events

  1. 📙 Library Leader’s Forum 2021-10-13 & 2021-10-20
  2. 🎉 Open Library Community Celebration (RSVP) 2021-10-26
  3. 📅 2022 Roadmap Community Planning (join) 2021-11-02 @ 10am PT

Open Library Community Celebration 2021

Last year we started the tradition of doing an Open Library Community Celebration to honor the contributions & impact of those in our community. On October 26, 2021 @ 10am Pacific we will be hosting our 2nd annual community celebration. We hope you can join us!

During this online event, you’ll hear from members of the community as we:

  • Announce our latest developments and their impacts
  • Raise awareness about opportunities to participate
  • Show a sneak-peek into our future: 2022

EDIT: The Community Celebration happened and you may watch it here!


5-Year Vision

End of September on 2021-09-28 @ 10am PT, the Open Library community came together to brainstorm Open Library’s possible long-term directions. Anyone in the community is welcome to comment and add their notes and thoughts:

https://docs.google.com/document/d/1q_jAcdEc705H3gsZv_Yt_08c8YFmefdSvMbiljc2O8g/edit#


2021 Year-End Review

On 2021-10-12 @ 10am PT the community met to review what we had accomplished (see review doc) on our 2021 roadmap.


2022 Community Planning

First week of November on 2021-11-02 @ 10am PT the community will meet to brainstorm goals for Open Library’s 2022 roadmap. This community planning call will be open to the public here.

EDIT: Community Planning happened and you can see the results or leave comments using the 2022 roadmap link.

EDIT: Our 2022 Priorities can now be seen here: https://docs.google.com/document/d/1edU3lCTHAjFr1mXUilh8l1_rNek33pRTzFHltBU–fM/edit#

How one volunteer is sharing a better reading experience with all of us

For nearly 15 years Open Library has been giving patrons free access to information about books in its catalog, direct to their computers. But for millions of readers across the globe who rely on their phones for access, this hasn’t always presented the ideal mobile reading experience.

This year, a volunteer within the Open Library community named Mark developed an independent mobile app, an unofficial companion to the website called the Open Library Reader. This lite app, which is available for free from the Apple store and Play store, emphasizes the mobile reading experience and showcases the books within a patron’s Open Library reading log. It’s a great way to take your personal library with you on the go.

While Open Library Reader is an unofficial app which is not maintained or supported by the staff at Internet Archive, we’re ecstatic that talented volunteers within our community are stepping up to design new experiences they wish existed for themselves and others. We applaud Mark, not only for the time he invested and showing what’s possible with our APIs, but — true to the spirit of Open Library — for sharing his app for free with patrons, in such a way which seems to respect patron privacy.

We sat down with Mark for an interview to learn why he created the Open Library Reader and which of its features may be appreciated by book lovers who are on the go.

A picture of a patron’s personal library when logged in to the Open Library Reader app

Open Librarian: “Why did you find the need to build an Open Library Reader?”

Mark: I read a lot of books on my iPad, especially old, hard-to-find mystery novels. Open Library has a lot of great reads, but I was getting frustrated trying to manage my Reading Log and read books in the tablet browser. There was a lot of scrolling and clicking around, a tap in the wrong place could send me off somewhere else, and the book I was reading was always surrounded by browser and bookreader controls. I just wanted to sit down and read, and not have to be reminded of the fact that I was looking at a website through a browser.

Open Librarian: What were some of the approaches Open Library Reader used to solve these problems?

Mark: I thought about some of the good tablet-based reading experiences I’ve had, and imagined what it could look like if the interface were centered around the individual reader and the small set of tools they need to find, manage, and read books. So the Reading Log shelves and the reading interface are the core of the app, and everything else kind of happens at the edges. Everything you need is just one tap away. The reading interface is still the familiar Internet Archive BookReader, but I’ve overlaid some additional functionality. You can hide all the controls with the single tap, and the book expands to completely fill the screen. I also added a swipe gesture, so it’s easy to turn pages if you’re holding your device with one hand on the couch.

Open Librarian: What does it feel like to use? Can we have a tour?

Mark:

Open Librarian: What is your favorite part of the app? I like how it shows the return time

Mark: That is cool — that’s another example of centering the needs of the reader. It’s hard to pick a favorite part. Every feature is the result of me reading in the app every day for months before I released it. Periodically, I’d think “that’s kind of annoying” or “I wish I could…” and I’d go code for a while until I was happy with the experience. But the full-screen reading mode is probably my favorite. With the high-resolution page scans expanded to fill the screen, it’s almost like reading a physical book.

Open Librarian: What was your experience like developing the Reader?

Mark: I’m a retired web developer, so interface design, user experience, APIs and that sort of thing are nothing new, but I’ve never built a native app. After some reading, I picked Google’s Flutter tool, which allows easy cross-platform app development. I was amazed at how fast it was to assemble a simple app with just a few lines of code, and then it was just a matter of layering on the functionality I wanted. I spent a lot of time exploring the Open Library and Internet Archives APIs to figure out the best way to get at the data I needed, and even submitted a few updates to the Open Library codebase to support features I wanted to build. The Open Library team was extremely welcoming and supportive, and really made this app possible.

How can you support Mark’s work?

First, try downloading the Open Library Read App from the Apple store or Play store. If you have a suggestion, question, or feedback for Mark, send him an email to olreader@loomis-house.com. If you appreciate his work, consider rating the app on the app stores and leaving a review so others may discover and enjoy it too. To learn more about Mark and the Open Library Reader, look out for his upcoming interview on the Open Library Community Podcast.

Want to contribute to Open Library too?

See all the ways you can volunteer within the Open Library community!