Category Archives: Uncategorized

Open Library Search: Balancing High-Impact with High-Demand

The Open Library is a card catalog of every book published spanning more than 50 million edition records. Fetching all of these records all at once is computationally expensive, so we use Apache Solr to power our search engine. This system is primarily maintained by Drini Cami and has enjoyed support from myself, Scott Barnes, Ben Deitch, and Jim Champ.

Our search engine is responsible for rapidly generating results when patrons use the autocomplete search box, when apps make book data requests using our programatic Search API, to load data for rending book carousels, and much more.

A key challenge of maintaining such a search engine is keeping its schema manageable so it is both compact and efficient yet also versatile enough to serve the diverse needs of millions of registered patrons. Small decisions, like whether a certain field should be made sortable can — at scale — make or break the system’s ability to keep up with requests.

This year, the Open Library team was committed to releasing several ambitious search improvements, during a time when the search engine was already struggling to meet the existing load:

  • Edition-powered Carousels that go beyond the general work to show you the most relevant, specific, available edition, in your desired language.
  • Trending algorithms that showcase what books are having sudden upticks, as opposed to what is consistently popular over stretches of time.
  • 10K Reading Levels to make the K-12 Student Library more relevant and useful.

Rather than tout a success story (we’re still in the thick of figuring out performance day-by-day), our goal is to pay it forward, document our journey, and give others reference points and ideas for how to maintain, tune, and advance a large production search system with a small team. The vibe is “keep your head above water.”

An AI generated image of someone holding a book above water

Starting in the Red

Towards the third quarter of last year, the Internet Archive and the Open Library were victim to a large scale, coordinate DDOS attack. The result was significant excess load to our search engine and material changes in how we secured and accessed our networks. During this time, the entire Solr re-indexing process (i.e. the technical process for rebuilding a fresh search engine from the latest data dumps) was left in an broken state.

In this pressurized state, our first action was to tune Solr’s heap. We had allocated 10GB of RAM to the Solr instance but also the heap was allowed to use 10GB, resulting in memory exhaustion. When Scott lowered the heap to 8GB, we encountered fewer heap errors. This was compounded by the fact that previously, we dealt with long spikes of 503s by restarting Solr, causing a thundering herd problem where the server would restart just to be overwhelmed by heap errors.

With 8GB of heap, our memory utilization gradually rose until we were using about 95% of memory and without further tuning and monitoring, we had few options other than to increase RAM available to the host. Fortunately, we were able to grow from ~16GB to ~24GB. We typically operate within 10GB and are fairly CPU bound with a load average of around 8 across 8 CPUs.

We then fixed our Solr re-indexing flow, enabling us to more regularly “defragment” — i.e. run optimize on Solr. In rare cases, we’ve been able to split traffic between our prod-solr and staged-solr to withstand large spikes in traffic though typically we’re operating from a single Solr instance.

Even with more memory, there’s only so many long, expensive requests Solr can queue before getting overwhelmed. Outside of looking at the raw Solr logs, our visibility into what was happening across Solr was still limited, so we put our heads together to discuss obvious cases where the website makes expensive calls wastefully. Jim Champ helped us implement book carousels that load asynchronously and only when scrolled into view. He also switched the search page to asynchronously load the search facets sidebar. This was especially helpful as previously, trying to render expensive search facets would cause the entire search results page, as opposed to only the facets side menu, to fail.

Sentry on the Hill

After several tiers of low hanging fruit was plucked, we used more specific tools and added monitoring. First, we added sentry profiling which gave us much more clarity about which queries were expensive, and how often Solr errors were occurring.

Sentry allows us to see a panoramic view of our search performance.

Sentry also gives us the ability to drill in and explore specific errors and their frequencies.

With profiling, we can even explore individual function calls to learn where the process is spending the most amount of time.

Docker Monitoring & Grafana

To further increase our visibility, Drini developed a new type of monitoring docker container that can be deployed agnostically to each of our VMs and use environment variables so that only relevant jobs would be run for that host. This approach has allowed us to centrally configure recipes so each host collects the data it needs and uploads it to our central dashboards in Grafana.

Recently, we added labels to all of our Solr calls so we can view exactly how many requests are being made of each query type and what their performance characteristics are.

At a top level, we can see in blue how much total traffic we’re getting to Solr and the green colors (darker is better) lets us know how many requests are being served quickly.

We can then drill in and explore each Solr query by type, identifying which endpoints are causing the greatest strain and giving us a way to then analyze nginx web traffic further in case it is the result of a DDOS.

Until recently, we were able to see how hard each of our main Open Library web application works were working at any given time. Spikes of pink or purple were when Open Library was waiting for requests to finish from Archive.org. Yellow patches — until recently — were classified as “other,” meaning we didn’t know exactly what was going on (even though Sentry profiling and flame graphs gave us strong clues that Solr was the culprit). By using pyspy with our new docker monitoring setup, we were able to add Solr profling into our worker graphs on Grafana and visualize the complete story clearly:

Once we turned on this new monitoring flow, it was clear these large sections of yellow, where workers were inundated with “unknown” work, were almost entirely (~50%) Solr.

With Great Knowledge…

Each graph helped us further direct and focus our efforts. Once we knew Open Library was being slowed down primarily by Solr, we began investigating requests and Drini noticed many Solr requests were living on for more than 10 seconds, even though the Open Library app has been given instructions to abandon any Solr query that takes more than 10 seconds. It turns out, even in these cases, Solr may continue to process the query in the background (so it can finish and cache the result for the future). This “feature” was resulting in Solr’s free connections becoming exhausted and a long haproxy queue. Drini modified our Solr queries to include a timeAllowed parameter to match Open Library’s contract to quit after 10 seconds and almost immediately the service showed signs of recovery:

After we set the timeAllowed parameter, we began to encounter more clear examples of queries failing and investigated patterns within Sentry. We realized a prominent trends of very expensive, unuseful, one-character or stop-word-like queries like “*" or “a" or “the". By looking at the full request and url parameters in our nginx logs, we discovered that the autocomplete search bar was likely responsible for submitting lots of unuseful requests as patrons typed out the beginning of their search.

To fix this, we patched our autocomplete to require at least 3 characters (and not e.g. the word “the”) and also are building in backend directives to Solr to pre-validate queries to avoid processing these cases.

Conclusion

Sometimes, you just need more RAM. Sometimes, it’s really important to understand how complex systems work and how heap space needs to be tuned. More than anything, having visibility and monitoring tools have bee critical to learning which opportunities to pursue in order to use our time effectively. Always, having talented, dedicated engineers like Drini Cami, and the support of Scott Barnes, Jim Champ, and many other contributors, is the reason Open Library is able to keep running day after day. I’m proud to work with all of you and grateful for all the search features and performance improvements we’ve been able to deliver to Open Library’s patrons in 2025.

Bringing Sidewalk Libraries Online

by Roni Bhakta & Mek

All around the world, sidewalk libraries have been popping up and improving people’s lives, grounded in our basic right to pass along the books we own: take a book, leave a book.

As publishers transition from physical books to ebooks, they are rewriting the rules to strip away the ownership rights that make libraries possible. Instead of selling physical books that can be preserved, publishers are forcing libraries to rent ebooks on locked platforms with restrictive licenses. What is a library that doesn’t own books? And it’s not just libraries losing this right — it’s us too.

⚠️ Did you know: When a patron borrows a book from their library using platforms like Libby, the library typically pays each year to rent the ebook. When individuals purchase ebooks on Amazon/Kindle, they don’t own the book — we are agreeing to a “perpetual” lease that can’t be resold or transferred and might disappear at any moment. In 2019, Microsoft Books shut down and customers lost access to their books.

This year, Roni Bhakta, from Maharashtra, India, joined Mek from the Internet Archive’s Open Library team for Google Summer of Code 2025 to ask: how can the idea of a sidewalk library exist on the Internet?

Our response is a new open-source, free, plug-and-play “Labs” prototype called Lenny, that lets anyone, anywhere – libraries, archives, individuals – set up their own digital lending library online to lend the digital books they own. You may view Roni’s initial proposal for Google Summer of Code here. To make a concept like Lenny viable, we’re eagerly following the progress of publishers like Maria Bustillos’s BRIET, which are creating a new market of ebooks, “for libraries, for keeps“.

Design Goals

Lenny is designed to be:

  • Self-hostable. Anyone can host a Lenny node with minimal compute resources.
  • Easy to install. A single https://lennyforlibraries.org/install.sh install script uses Docker so Lenny works right out of the box.
  • Preloaded with books. Lenny comes preloaded with over 500+ open-access books.
  • Compatible with dozens of existing apps. Each Lenny uses the OPDS standard to publish its collection, so any compatible reading app (Openlibrary, Internet Archive, Moon reader and others) can be used to browse its books.

Features

Lenny comes integrated with:

  • A seamless reading experience. An onboard Thorium Web EPUB reader lets patrons read digital books instantly from their desktop or mobile browsers.
  • A secure, configurable lending system. All the basic options and best practices a library or individual may need to make the digital books they own borrowable with protections.
  • A marketplace. Lenny is designing a connection to an experimental marketplace so one can easily buy and add new digital books to their collection.

Learn More

Lenny is an early stage prototype and there’s still much work to be done to bring the idea of Lenny to life. At the same time, we’ve made great progress towards a working prototype and are proud of the progress Roni has achieved this year through Google Summer of Code 2025.

We invite you to visit https://lennyforlibraries.org to learn more about how Lenny works and how you can try an early prototype on your personal computer.

What’s Trending on Open Library?

A major update to the Open Library search engine now makes it easy for patrons to find books that are receiving spikes of interest.

You may be familiar with the trending books featured on Open Library’s home page. Actually, you might be very familiar with them, because many seldom change! Our previous trending algorithm approximated popularity by tracking how often patrons clicked that they wanted to read a book. While this approach is great for showcasing popular books, the results often remain the same for weeks at a time.

The new trending algorithm, developed by Benjamin Deitch (Open Library volunteer and Engineering Fellow) and Drini Cami (Open Library staff and senior software engineer) uses hour-by-hour statistics to give patrons fresh, timely, high-interest books that are gaining traction now on Open Library.

This improved algorithm now powers much of the Open Library homepage and is fully integrated into Open Library’s search engine, meaning: 

  • A patron can sort any search on Open Library by trending scores. Check out what’s trending today in Sci-fi, Romance, and Short-reads in French.
  • A more diverse selection of books should be displayed within the carousels on the homepage, the library explorer, and on subject pages.
  • Librarians can leverage sort-by-trending to discover which high-traffic book records may be impactful to prioritize.

Sorting by Trending

From the search results page, a patron may change the “Relevance” sort dropdown to “Trending” to sort results by the new z-score trending algorithm:

The Algorithm

Open Library’s Trending algorithm is implemented by computing a z-score for each book, which compares each book’s: (a) “activity score” over the last 24 hours with (b) the total “activity score” of the last 7 days.

Activity scores are computed for a given time interval by adding the book’s total human page views (how often is the book page visited) with an amplified count of its reading log events (e.g. when a patron marks a book as “Want to Read”). Here, amplified means that a single reading log event has a higher impact on the activity score than a single page view.

All of the intermediary data used to compose the z-score is stored and accessible from our search engine in the following ways:

For Developers

While the trending_z_score is the ultimate value used to determine a book’s trending score on Open Library, developers may also query the search engine directly to access many of the intermediary, raw values used to compute this score.

For instance, we’ve been experimenting with the trending_score_daily_[0-6] fields and the trending_score_hourly_sum field to create useful ways of visualizing trending data as a chart over time:

The search engine may be queried and filter by:

  • trending_score_hourly_sum – Find books with the highest accumulative hourly score for today, as opposed to the computed weekly trending score.
  • trending_score_daily_0 through trending_score_daily_6 – Find books with a certain total score on a previous day of the week.
  • trending_z_score:{0 TO *] – Find books with a trending score strictly greater than 0. Note that this is using Lucene syntax, so squiggly brackets {} can be used for an exclusive range, and square brackets [] for an inclusive range.

The results of these queries may be sorted by:

  • trending – View books ordered by the greatest change in activity score growth first. This uses the trending_z_score.
  • trending_score_hourly_sum – View books with the highest activity score in the last 24 hours

We’d love your feedback on the new Trending Algorithm. Share your thoughts with us on bluesky.

API Search.json Performance Tuning

This is a technical post regarding a breaking change for developers whose applications depend on the /search.json endpoint that is scheduled to be deployed on January 21st, 2025.

Description: This change reduces the default fields returned by /search.json to a more restrictive and performant set that we believe will meet most clients’ metadata needs and result in faster, higher quality service for the entire community.

Change: Developers are strongly encouraged to now follow our documentation to set the fields parameter on their requests with the specific fields their application requires. e.g:

https://openlibrary.org/search.json?q=sherlock%20holmes&fields=key,title,author_key,author_name,cover_i

Those relying on the previous behavior can still access the endpoint’s previous, full behavior by setting fields=* to return every field.

Reasoning: Our performance monitoring at Open Library has shown a high number of 500 responses related to search engine solr performance. During our investigation, we found that some endpoints, like search.json, return up to 500kb of payload and often return fields with large lists of data that are not frequently used by many clients. For more details, you can refer to the pull request implementing this change: https://github.com/internetarchive/openlibrary/pull/10350

As always, if you have questions or comments, please message us on x/twitter @openlibrary, bluesky, open an issue on github, or contact mek@archive.org.

Warmly,

The Open Library Maintainers

Improving Search, Removing Dead-Ends

Thanks to the work of 2024 Design & Engineering Fellow Meredith White, the Open Library search page now suggests Search Inside results any time a search fails to find books matching by title or author.

Before:

After:

The planning and development of this feature were led by volunteer and 2024 Design & Engineering Fellow, Meredith White who did a fantastic job bringing the idea to fruition.

Meredith writes: Sooner or later, a patron will take a turn that deviates from what a system expects. When this happens, the system might show a patron a dead-end, something like: ‘0 results found’. A good UX though, will recognize the dead-end and present the patron with options to carry on their search. The search upgrade was built with this goal in mind: help patrons seamlessly course correct past disruptive dead-ends.

Many patrons have likely experienced a case where they’ve typed in a specific search term and been shown the dreaded, “0 results found” message. If the system doesn’t provide any next steps to the patron, like a “did you mean […]?” link, then this is a dead-end. When patrons are shown dead-ends, they have the full burden of figuring out what went wrong with their search and what to do next. Is the item the patron is searching for not in the system? Is the wrong search type selected (e.g. are they searching in titles rather than authors)? Is there a typo in their search query? Predicting a patron’s intent and how they veered off course can be challenging, as each case may require a different solution. In order to develop solutions that are grounded in user feedback, it’s important to talk with patrons

In the case of Open Library, interviewing learners and educators revealed many patrons were unaware that the platform has search inside capabilities.

“Several interviewees were unaware of Open Library’s [existing] full-text search, read aloud, or note-taking capabilities, yet expressed interest in these features.”

https://blog.openlibrary.org/2024/06/16/listening-to-learners-and-educators/

Several patrons were also unaware that there’s a way to switch search modes from the default “All” to e.g. “Authors” or “Subjects”. Furthermore, several patrons expected the search box to be type-agnostic.

From our conversations with patrons and reviewing analytics, we learned many dead-end searches were the result of patrons trying to type search inside queries into the default search, which primarily considers titles and authors. What does this experience look like for a patron? An Open Library patron might type into the default book search box, a book quote such as: “All grown-ups were once children… but only a few of them remember it“. Unbeknownst to them, the system only searches for matching book titles and authors and, as it finds no matches, the patron’s search returns an anticlimactic ‘No results found’ message. In red. A dead-end.

As a Comparative Literature major who spent a disproportionate amount of time of my undergrad flipping through book pages while muttering, “where oh where did I read that quote?”, I know I would’ve certainly benefitted from the Search Inside feature, had I known it existed. With a little brainstorming, we knew the default search experience could be improved to show more relevant results for dead-end queries. The idea that emerged is: display Search Inside results as a “did you mean?” type suggestion when a search returns 0 matches. This approach would help reduce dead-ends and increase discoverability of the Search Inside feature. Thus the “Search Inside Suggestion Card” was born.

The design process started out as a series of Figma drawings:

Discussions with the design team helped narrow in on a prototype that would provide the patron with enough links and information to send them on their way to the Search Inside results page, a book page or the text reader itself, with occurrences of the user’s search highlighted. At the same time, the card had to be compact and easy to digest at a glance, so design efforts were made to make the quote stand out first and foremost.

After several revisions, the prototype evolved into this design:

Early Results

The Search Inside Suggestion card went live on August 21st and thanks to link tracks that I rigged up to all the clickable elements of the card, we were able to observe its effectiveness. Some findings:

  • In the first day, 2k people landed on the Search Inside Suggestion card when previously they would have seen nothing. That’s 2,000 dead-end evasion attempts!
  • Of these 2,000 users, 60% clicked on the card to be led to Search Inside results.
  • 40% clicked on one of the suggested books with a matching quote.
  • ~8% clicked on the quote itself to be brought directly into the text.

I would’ve thought more people would click the quote itself but alas, there are only so many Comparative Literature majors in this world.

Follow-up and Next Steps

To complement the efforts of the Search Inside Suggestion card’s redirect to the Search Inside results page, I worked on re-designing the Search Inside results cards. My goal for the redesign was to make the card more compact and match its styling as closely as possible to the Search Inside Suggestion card to create a consistent UI.

Before:

After:

The next step for the Search Inside Suggestion card is to explore weaving it into the search results, regardless of result count. The card will offer an alternate search path in a list of potentially repetitive results. Say you searched ‘to be or not to be’ and there happens to be several books with a matching title. Rather than scrolling through these potentially irrelevant results, the search result card can intervene to anticipate that perhaps it’s a quote inside a text that you’re searching for. With the Search Inside Suggestion card taking the place of a dead-end, I’m proud to report that a search for “All grown-ups were once children…” will now lead Open Library patron’s to Antoine de Saint-Exupéry’s The Little Prince, page 174!

Technical Implementation

For the software engineers in the room who want a peek behind the curtain, working on the “Search Inside Suggestion Card” project was a great opportunity to learn how to asynchronously, lazy load “parts” of webpages, using an approach called partials. Because Search Inside results can take a while to generate, we decided to lazy load the Search Inside Suggestion Card, only after the regular search had completed.

If you’ve never heard of a partial, well I hadn’t either. Rather than waiting to fetch all the Search Inside matches to the user’s search before the user sees anything, a ‘No books directly matched your search’ message and a loading bar appear immediately. The loading bar indicates that Search Inside results are being checked, which is UX speak for this partial html template chunk is loading.

So how does a partial load? There’s a few key players:

  1. The template (html file) – this is the page that initially renders with the ‘No books directly matched your search’ message. It has a placeholder div for where the partial will be inserted.
  2. The partial (html file) – this is the Search Inside Suggestion Card
  3. The Javascript logic – this is the logic that says, “get that placeholder div from the template and attach it to an initialization function and call that function”
  4. More Javascript logic – this logic says, “ok, show that loading indicator while I make a call to the partials endpoint”
  5. A Python class – this is where the partials endpoint lives. When it’s called, it calls a model to send a fulltext search query to the database. This is where the user’s wrong turn is at last “corrected”. Their initial search in the Books tab that found no matching titles is now redirected to perform a Search Inside tab search to find matching quotes.
  6. The data returned from the Python class is sent back up the line and the data-infused partial is inserted in the template from step 1. Ta-da!

About the Open Library Fellowship Program

The Internet Archive’s Open Library Fellowship is a flexible, self-designed independent study which pairs volunteers with mentors to lead development of a high impact feature for OpenLibrary.org. Most fellowship programs last one to two months and are flexible, according to the preferences of contributors and availability of mentors. We typically choose fellows based on their exemplary and active participation, conduct, and performance within the Open Library community. The Open Library staff typically only accepts 1 or 2 fellows at a time to ensure participants receive plenty of support and mentor time. Occasionally, funding for fellowships is made possible through Google Summer of Code or Internet Archive Summer of Code & Design. If you’re interested in contributing as an Open Library Fellow and receiving mentorship, you can apply using this form or email openlibrary@archive.org for more information.