Author Archives: Drini Cami

Improving Open Library’s Translation Pipeline

A forward by Drini Cami
Drini Cami here, Open Library staff developer. It’s my pleasure to introduce Rebecca Shoptaw, a 2024 Open Library Engineering Fellow, to the Open Library blog in her first blog post. Rebecca began volunteering with us a few months ago and has already made many great improvements to Open Library. I’ve had the honour of mentoring her during her fellowship, and I’ve been incredibly impressed by her work and her trajectory. Combining her technical competence, work ethic, always-ready positive attitude, and her organization and attention to detail, Rebecca has been an invaluable and rare contributor. I can rely on her to take a project, break it down, learn anything she needs to learn (and fast), and then run it to completion. All while staying positive and providing clear communication of what she’s working on and not dropping any details along the way.

In her short time here, she has also already taken a guidance role with other new contributors, improving our documentation and helping others get started. I don’t know how you found us, Rebecca, but I’m very glad you did!

And with that, I’ll pass it to Rebecca to speak about one of her first projects on Open Library: improving our translation/internationalization pipeline.

Improving Open Library’s Translation Pipeline

Picture this: you’re browsing around on a site, not a care in the world, and suddenly out of nowhere you are told you can “cliquez ici pour en savoir plus.” 

Maybe you know enough French to figure it out, maybe you throw it into Google Translate, maybe you can infer from the context, or maybe you just give up. In any of these cases, your experience of using the site just became that much less straightforward.

This is what the Open Library experience has been here and there for many non-English-speaking readers. All our translation is done by volunteers, so with over 300 site contributors and an average of 40 commits added to the codebase each week, there has typically been some delay between new text getting added to the site and that text being translated.

One major source of this delay was on the developer side of the translation process. To make translation of the site possible, the developers need to provide every translator with a list of all the phrases that will be visible to readers on-screen, such as the names of buttons (“Submit,” “Cancel,” “Log In”), the links in the site menu (“My Books,” “Collections,” “Advanced Search”), and the instructions for adding and editing books, covers, and authors. While updates to the visible text occur very frequently, the translation “template” which lists all the site’s visible phrases was previously only updated manually, a process that would  usually happen every 3-6 months. 

This meant that new text could sit on the site for weeks or months before our volunteer translators were able to access it for translation. There had to be a better way.

And there was! I’m happy to report that the Open Library codebase now automatically generates that template file every time a change is made, so translators no longer have to wait. But how does it work, and how did it all happen? Let’s get into some technical details.

How It Began

Back in February, one of the site’s translators requested an update to the template file so as to begin translating some of the new text. I’d done a little developer-side translation work on the site already, so I was assigned to the issue. 

I ran the script to generate the new file, and right away noticed two things:

  1. The process was very simple to run (a single command), and it ran very quickly.
  2. The update resulted in a 2,132-line change to the template file, which meant it had fallen very, very out of date.

I pointed this out to the issue’s lead, Drini, and he mentioned that there had been talk of finding a way to automate the process, but they hadn’t settled on the best way to do so.

I signed off and went to make some lunch, then ran back and suggested that the most efficient way to automate it would be to check whether each incoming change includes new/changed site text, and to run the script automatically if so. He liked the idea, so I wrote up a proposal for it, but nothing really came of it until:

The Hook

In March, Drini reached back out to me with an idea about a potentially simple way to do the automation. Whenever a developer submits a new change they would like to make to the code, we run a set of automatic tests, called “pre-commit hooks,” mostly to make sure that their submission does not contain any typos and would not cause any problems if integrated into the site. 

Since my automation idea had been to update the translation template each time a relevant change was made, Drini suggested that the most natural way to do that would be to add a quick template re-generation to the series of automated tests we already have.

The method seemed shockingly simple, so I went ahead and drafted an implementation of it. I tested it a few times on my own computer, found that it worked like a charm, and then submitted it, only to encounter:

The Infinite Loop of Failure

Here’s where things got interesting. The first version of the script simply generated a new template file whether or not the site’s text had actually been changed – this made the most sense since the process was so fast and if nothing actually had changed in the template, the developer wouldn’t notice a difference.

But strangely enough, even though my changes to the code didn’t include any new text, I was failing the check that I wrote! I delved into the code, did some more research into how these hooks work, and soon discovered the culprit. 

The process for a simple check and auto-fix usually works as follows:

  1. When the change comes in, the automated checks run; if the program notices that something is wrong (i.e. extra whitespace), it fixes any problems automatically if possible.
  2. If it doesn’t notice anything wrong and/or doesn’t make any changes, it will report a success and stop there. If it notices a problem, even if it already auto-fixed it, it will report a failure and run again to make sure its fix was successful.
  3. On the second run, if the automatic fix was successful, the program should not have to make any further changes, and will report a success. If the program does have to make further changes, or notices that there is still a problem, it will fail again and require human intervention to fix the problem.

This is the typical process for fixing small formatting errors that can easily be handled by an automation tool. But in this case, the script was running twice and reporting a failure both times.

By comparing the versions of the template, I discovered that the problem was very simple: the hook is designed, as described above, to report a failure and re-run if it has made any changes to the code. The template includes a timestamp that automatically lists when it was last updated down to the second. When running online, because more pre-commit checks are run than when running locally, pre-commit takes long enough that by the time it runs again, enough seconds have elapsed that it generates a new timestamp, causing it to notice a one-line difference between the current and previous templates (the timestamp itself), and so it fails again. I.e.:

  1. The changes come in, and the program auto-updates the translation template, including the timestamp.
  2. It notices that it has made a change (the timestamp and any new/changed phrases), so it reports a failure and runs again.
  3. The program auto-updates the translation template again, including the timestamp.
  4. It notices that it has made a change (the timestamp has changed), and reports a second failure.

And so on. An infinite loop of failure!

We could find no way to simply remove the timestamp from the template, so to get out of the infinite loop of failure, I ended up modifying the script so that it actually checks whether the incoming changes would affect the template before updating it. Basically, the script gathers up all the phrases in the current template and compares them to all the incoming phrases. If there is no difference, it does nothing and reports a success. If there is a difference, i.e. if the changes have added or changed the site’s text, it updates the template and reports a failure, so that now:

  1. The changes come in, and the program checks whether an auto-update of the template would have any effect on the phrases. 
  2. If there are no phrase changes, it decides not to update the template and reports a success. If there are phrase changes, it auto-updates the template, reports a failure and runs again.
  3. The program checks again whether an auto-update would have any effect, and this time it will not (since all the new phrases have been added), so it does not update the template or timestamp, and reports a success.

What it looks like locally:

A screen recording of the new translation script in action. A developer adds the word "Your" to the phrase "Delete Your Account" and submits the change. The automated tests run; the translation test fails, and updates the template. The developer submits the updated template change, and the automated tests run again and pass.

I also added a few other options to the script so that developers could run it manually if they chose, and could decide whether or not to see a list of all the files that the script found translatable phrases in.

The Rollout

To ensure we were getting as much of the site’s text translated as possible, I also proposed and oversaw a bulk formatting of a lot of the onscreen text which had previously not been findable by the template-updating function. The project was heroically taken on by Meredith (@merwhite11), who successfully updated the formatting for text across almost 100 separate files. I then did a full rewrite of the instructions for how to format text for translation, using the lessons we learned along the way.

When the translation automation project went live, I also wrote a new guide for developers so they would understand what to expect when the template-updating check ran, and answered various questions from newer developers re: how the process worked.

The next phase of the translation project involved using the same automated process we figured out to update the template to notify developers if their changes include text that isn’t correctly formatted for translation. Stef (@pidgezero-one) did a fantastic job making that a reality, and it has allowed us to properly internationalize upwards of 500 previously untranslatable phrases, which will make internationalization much easier to keep track of for future developers.

When I first updated the template file back in February of this year, it had not been updated since March of the previous year, about 11 months. The automation has now been live since May 1, and since then the template has already been auto-updated 35 times, or approximately every two to three days. 

While the Open Library translation process will never be perfect, I think we can be very hopeful that this automation project will make une grosse différence.

Image of Mississauga cityscape near sunset

Search Is Getting Smarter on Open Library

Image of Mississauga cityscape near sunset. Photo credit: Bart Brewinski

Dear readers,

I sit here, cosily on a cold winter’s night looking out over the Mississauga cityscape, thinking about the important mission we planned for and set out to accomplish almost a year ago: Empowering you, dear readers, to better search for and discover books on Open Library.

For too many years now you’ve been limited in how books can be found from Open Library’s extensive catalogue. Since the dawn of its existence, Open Library’s goal has been to make one web page for every book ever published. And to make those books accessible! But one problem with having millions and millions of book records, is that finding just the book you need can be difficult. Search is your gateway. Your one way to find what you’re looking for. But what if search can’t get you what you need?

Well for many readers, it was impossible to find what they were looking for. The search experience was plagued with limitations. It was impossible to find books in a certain language, or from a certain publisher. Sometimes, your search queries would even return no results at all — even for books actually in the library!

This past week I’ve been busy rolling out our improved search experience as the default across the site. Here are the previously impossible searches that are now possible!

Find borrowable or readable books in a specific language. Previously, the results wouldn’t guarantee that a borrowable or readable edition of the search result was in the specified language. Now you can! For example, for any fellow readers who are trying to learn German, you can now easily find Borrowable or Readable books in German ! Or… how about Spanish? Japanese? Polish? Take your pick!

Search results now prefer editions matching your language. If you have Open Library’s language set to French and you search for “harry potter”, you will see the French cover and title of Harry Potter first. Try it!

Combinations of edition query fields. Now, queries can filter on edition data as well as work data. All these queries used to be impossible on Open Library:

Search results now show the edition that best matches your query. Now, if you search for “one hundred years of solitude”, because your query is in English (regardless of your display language), the English title One Hundred Years of Solitude will be displayed instead of the original Spanish title, Cien años de soledad. Try it! Previously, searching for “one hundred years of solitude” wouldn’t match the correct book at all!

And for any developers out there, these features are also available via the Search API. You just need to add `editions` to the `fields` parameter to get back a new editions subfield with matching edition data.

Search is a behemoth, and there’s always more to do! Here are some of the tweaks and improvements we have lined up to improve upon this work:

These changes required an overhaul of our core Solr-based search infrastructure to make search results edition-aware. But now that this information is in our search engine, we just need to add it to more and more places. These are features that readers have long desired for searching Open Library. And now, their expectations are reality! Open Library just got a little easier to use, and a little more accessible and inclusive.

Happy Reading!

Drini (with some generous writing support and photography from Bart Brewinski)

Introducing Trusted Book Providers

Building the Internet’s library is no easy task, and it can’t be done alone. Thankfully, we’re not alone in wanting to provide access to knowledge, books, and reading — which is why we’re excited to introduce Trusted Book Providers into Open Library. This feature allows us to provide direct “Read” links to a number of carefully selected, reputable sources of books online. Integrations with Project Gutenberg and LibriVox are up and running, and integrations with Standard Ebooks, OpenStax, and Wikisource are in progress. By linking to these outstanding organizations, we’re excited to help promote their wonderful work as well as give Open Library patrons easy access to more trusted sources for digital books. We see this as a step in helping the world of open access books flourish.

Viewing LibriVox and Gutenberg works in Open Library

For more than ten years, Open Library has allowed patrons from across the globe to read, borrow, and listen to digital books from the Internet Archive’s prodigious lending library and public domain collection. Since then, the Internet Archive has partnered closely with more than 1,000 US libraries to accession books, ensure their digital preservation, and make them useful to select audiences, such as those with print disabilities, through controlled library practices.

Open Library is now excited to expand its “Read” buttons to include not only the millions of books made available by the Internet Archive, but also works from other trusted digital collections. What does this mean for patrons? It means more books and more reading options — such as LibriVox’s human-read public domain audiobooks, Standard Ebooks’ lovingly formatted modern epubs, or Project Gutenberg’s reflowable-text books. We hope this will result in a more inclusive ecosystem and shine more light on the amazing work done by these other mission-aligned non-profit organizations.

Choosing the First Trusted Book Providers

We selected the first group of Trusted Book Providers based on several factors. First, we prioritized non-profit organizations who are reputable, well-established, and have a similar focus on serving public good. Second, we looked for providers whose holdings increased the diversity of book formats Open Library may link to. Thirdly, we looked for providers who focus on open & permissive licensing, or public domain material.

Project Gutenberg

Project Gutenberg is the oldest digital library online. Founded in 1971 (was the internet even around then?), the volunteer-driven organization is dedicated to creating free, open, long-lasting eBooks that are easily accessible from many devices. The Internet Archive already proudly preserves most of Project Gutenberg’s over 60,000 titles, and Open Library is excited to be able to have users read from Project Gutenberg directly. For patrons, the human-curated, reflowable-text formats made available by Project Gutenberg are ideal for reading on small screens, e-readers, and also for powerful accessibility customization, like dyslexic fonts and screen readers.

Browse on Open Library

LibriVox

Founded in 2005, LibriVox’s stated mission is “to make all books in the public domain available, narrated by real people and distributed for free, in audio format on the internet.” And with over 15,000 editions in over 80 languages, they’re making great headway! The Internet Archive also works with LibriVox, and provides storage for their mass of audio files. For patrons, LibriVox integration means they will now have access to human-spoken audiobooks for many public domain works.

Browse on Open Library

Standard Ebooks

Standard Ebooks is a volunteer-driven project dedicated to producing new editions of public domain ebooks that are lovingly formatted, open source, free of copyright restrictions, and free of cost. Founded in 2015, Standard Ebooks books are carefully standardized and normalized to work great as reflowable-text html, as well as modern epubs with all the trimmings — table of contents, typographical attention to detail, beautiful public domain cover art, and more. For patrons, Standard Ebooks’ over 500 titles are perfect for reading on web browsers, phones, or e-readers due to their reflowable text and modern epub features specifically optimized for every e-reader platform.

In Progress… | Browse at Standard Ebooks

OpenStax

OpenStax is a non-profit dedicated to creating original, free, open-access high school and college textbooks. Part of the non-profit corporation, Rice University, OpenStax has created over 60 high quality, peer-reviewed textbooks since its launch in 2012, with some titles available in English, Spanish, and Polish. Open Library will include OpenStax read links so our patrons can find and access these digital-only materials online or as PDF or ePub downloads.

In Progress… | Browse at OpenStax

Wikisource

Launched in 2003, Wikisource is an online digital library of free-content textual sources on a wiki, operated by the Wikimedia Foundation (the folks who run Wikipedia). Wikisource has a huge community of editors dedicated to converting scans of classic books to error-free, proofread digital books. And improving their records is as easy as editing a Wikipedia page! Offering reading options online or offline as PDF, ePub, mobi, etc for millions of records, Wikisource’s catalog, spanning over 30 languages, is unparalleled. And soon, you’ll be able to find these works right in Open Library!

In Progress… | Browse at Wikisource

How Trusted Book Providers Work

As a patron, you shouldn’t have to do anything special to access titles from our Trusted Partners.

When designing support for Trusted Providers, we wanted to find the right balance between convenience and trust. We didn’t want patrons to get confused by a button taking them to a new website without warning. But we also didn’t want to introduce unnecessary friction and multiple clicks preventing patrons from easily accessing books. As a result, our team team converged on two strategies:

  1. When a Read button is for a Trusted Provider, the button will have an external link icon like:
  2. When you click a Trusted Provider button, a message will appear on Open Library providing context about the Trusted Provider. The Trusted Provider link will be open within a new browser tab.

Recommend a Trusted Book Provider

Are you a book service, library, or publisher which would like to integrate with the Open Library’s catalog? Or is there a service you’d like to recommend?

Please recommend or apply to become a Trusted Book Provider using this form.

Giacomo Cignoni: My Internship at the Internet Archive

This summer, Open Library and the Internet Archive took part in Google Summer of Code (GSoC), a Google initiative to help students gain coding experience by contributing to open source projects. I was lucky enough to mentor Giacomo while he worked on improving our BookReader experience and infrastructure. We have invited Giacomo to write a blog post to share some of the wonderful work he has done and his learnings. It was a pleasure working with you Giacomo, and we all wish you the best of luck with the rest of your studies! – Drini


Hi, I am Giacomo Cignoni, a 2nd year computer science student from Italy. I submitted my 2020 Google Summer of Code (GSoC) project to work with the Internet Archive and I was selected for it. In this blogpost, I want to tell you about my experience and my accomplishments working this summer on BookReader, Internet Archive’s open source book reading web application.

Continue reading