Category Archives: Search

Open Library Search: Balancing High-Impact with High-Demand

The Open Library is a card catalog of every book published spanning more than 50 million edition records. Fetching all of these records all at once is computationally expensive and so we use Apache Solr to power our search engine. This system is primarily maintained by Drini Cami and has enjoyed support from myself, Scott Barnes, Ben Deitch, and Jim Champ.

Our search engine is responsible for rapidly generating results when patrons use the autocomplete search box, when apps make book data requests using our programatic Search API, to load data for rending book carousels, and much more.

A key challenge of maintaining such a search engine is keeping its schema manageable so it is both compact and efficient yet also versatile enough to serve the diverse needs of millions of registered patrons. Small decisions, like whether a certain field should be made sortable can — at scale — make or break the system’s ability to keep up with requests.

This year, the Open Library team was committed to releasing several ambitious search improvements, during a time when the search engine was already struggling to meet the existing load:

  • Edition-powered Carousels that go beyond the general work to show you the most relevant, specific, available edition, in your desired language.
  • Trending algorithms that showcase what books are having sudden upticks, as opposed to what is consistently popular over stretches of time.
  • 10K Reading Levels to make the K-12 Student Library more relevant and useful.

Rather than tout a success story (we’re still in the thick of figuring out performance day-by-day), our goal is to pay it forward, document our journey, and give others reference points and ideas for how to maintain, tune, and advance a large production search system with a small team. The vibe is “keep your head above water”.

An AI generated image of someone holding a book above water

Starting in the Red

Towards the third quarter of last year, the Internet Archive and the Open Library were victim to a large scale, coordinate DDOS attack. The result was significant excess load to our search engine and material changes in how we secured and accessed our networks. During this time, the entire Solr re-indexing process (i.e. the technical process for rebuilding a fresh search engine from the latest data dumps) was left in an broken state.

In this pressurized state, our first action was to tune solr’s heap. We had allocated 10GB of RAM to the solr instance but also the heap was allowed to use 10GB, resulting in memory exhaustion. When Scott lowered the heap to 8GB, we encountered fewer heap errors. This was compounded by the fact that previously, we dealt with long spikes of 503s by restarting solr, causing a thundering herd problem where the server would restart just to be overwhelmed by heap errors.

With 8GB of heap, our memory utilization gradually rose until we were using about 95% of memory and without further tuning and monitoring, we had few options other than to increase RAM available to the host. Fortunately, we were able to grow from ~16GB to ~24GB. We typically operate within 10GB and are fairly CPU bound with a load average of around 8 across 8 CPUs.

We then fixed our solr re-indexing flow, enabling us to more regularly “defragment” — i.e. run optimize on solr. In rare cases, we’ve been able to split traffic between our prod-solr and staged-solr to withstand large spikes in traffic though typically we’re operating from a single solr instance.

Even with more memory, there’s only so many long, expensive requests solr can queue before getting overwhelmed. Outside of looking at the raw solr logs, our visibility into what was happening across solr was still limited so we put our heads together to discuss obvious cases where the website makes expensive calls wastefully. Jim Champ helped us implement book carousels that load asynchronously and only when scrolled into view. He also switched the search page to asynchronously load the search facets sidebar. This was especially helpful as previously, trying to render expensive search facets would cause the entire search results page to fail, as opposed to only the facets side menu.

Sentry on the Hill

After several tiers of low hanging fruit was plucked, we used more specific tools and added monitoring. First, we added sentry profiling which gave us much more clarity about which queries were expensive, how often solr errors were occurring.

Sentry allows us to see a panoramic view of our search performance.

Sentry also gives us the ability to drill in and explore specific errors and their frequencies.

With profiling, we can even explore individual function calls to learn where the process is spending the most about of time.

Docker Monitoring & Grafana

To further increase our visibility, Drini developed a new type of monitoring docker container that can be deployed agnostically to each of our VMs and use environment variables so that only relevant jobs would be run for that host. This approach has allowed us to centrally configure recipes so each host collects the data it needs and uploads it to our central dashboards in Grafana.

Recently, we added labels to all of our Solr calls so we can view exactly how many requests are being made of each query type and what their performance characteristics are.

At a top level, we can see in blue how much total traffic we’re getting to solr and the green colors (darker is better) lets us know how many requests are being served quickly.

We can then drill in and explore each solr query by type, identifying which endpoints are causing the greatest strain and giving us a way to then analyze nginx web traffic further in case it is the result of a DDOS.

Until recently, we were able to see how hard each of our main Open Library web application works were working at any given time. Spikes of pink or purple were when Open Library was waiting for requests to finish from Archive.org. Yellow patches — until recently — were classified as “other”, meaning we didn’t know exactly what was going on (even though Sentry profiling and flame graphs gave us strong clues that solr was the culprit). By using pyspy with our new docker monitoring setup, we were able to add solr profling into our worker graphs on Grafana and visualize the complete story clearly:

Once we turned on this new monitoring flow, it was clear these large sections of yellow, where workers were inundated with “unknown” work, were almost entirely (~50%) solr.

With Great Knowledge…

Each graph helped us further direct and focus our efforts. Once we knew Open Library was being slowed down primarily by solr, we began investigating requests and Drini noticed many solr requests were living on for more than 10 seconds, even though the Open Library app has been given instructions to abandon any solr query that takes more than 10 seconds. It turns out, even in these cases, solr may continue to process the query in the background (so it can finish and cache the result for the future). This “feature” was resulting in Solr’s free connections becoming exhausted and a long haproxy queue. Drini modified our solr queries to include a timeAllowed parameter to match Open Library’s contract to quit after 10 seconds and almost immediately the service showed signs of recovery:

After we set the timeAllowed parameter, we began to encounter more clear examples of queries failing and investigated patterns within Sentry. We realized a prominent trends of very expensive, unuseful, one-character or stop-word-like queries like “*" or “a" or “the". By looking at the full request and url parameters in our nginx logs, we discovered that the autocomplete search bar was likely responsible for submitting lots of unuseful requests as patrons typed out the beginning of their search.

To fix this, we patched our autocomplete to require at least 3 characters (and not e.g. the word “the”) and also are building in backend directives to solr to pre-validate queries to avoid processing these cases.

Conclusion

Sometimes, you just need more RAM. Sometimes, it’s really important to understand how complex systems work and how heap space needs to be tuned. More than anything, having visibility and monitoring tools have bee critical to learning which opportunities to pursue in order to use our time effectively. Always, having talented, dedicated engineers like Drini Cami, and the support of Scott Barnes, Jim Champ, and many other contributors, is the reason Open Library is able to keep running day after day. I proud to work with all of you and grateful for all the search features and performance improvements we’ve been able to deliver to Open Library’s patrons in 2025.

Connecting K-12 Students With Books by Reading Level

We recently improved the relevance of our Student Library collection by adding more than 10,000 reading levels to borrowable K-12 books in our search engine.

Screenshot of the Pre-K & Kindergarten and Grade 1 reading levels on Open Library.

What are Reading Levels?

In the same way a library-goer may be interested in finding a book on a specific topic, like vampires, they may also be interested in narrowing their search for books that are within their reading level. The readability of a book may be scored in numerous ways, such as analyzing sentence lengths or complexity, or the percentage of words that are unique or difficult, to name a few.

For our purposes, we choose to incorporate Lexile scores—a proprietary system developed by MetaMetrics®. Lexile scores are widely used within school systems and have a reliable scoring system that is accessible and well documented.

While the goal of our initiative was to add reading levels specifically for borrowable K-12 books within the Open Library catalog, Lexile also offers a fantastic Find a Book hub where teachers, parents, and students may search more broadly for books by Lexile score. We’re grateful that Lexile features “Find in Library” links to the Open Library so readers can check nearby libraries for the books they love!

Screenshot of Lexile's Find a Book hub.

Before Reading Levels

Before Open Library had reading level scores, the system used subject tags to identify books according to grade level. Many of these categories were noisy, inaccurate, and had high overlap, making it difficult to find relevant books. Furthermore, with grade level bucketing, there was no intuitive way to search for books across a range of reading levels.

Searching for Books by Reading Level

Now, lexile score ranges like “lexile:[500 TO 900]” can be used flexibly in search queries to find the exact books that are right for a reader, with results being limited by grade levels. https://openlibrary.org/search?q=lexile%3A%5B700+TO+900%5D

Putting It All Together

By utilizing these lexile ranges, we’ve been able to develop a more coherent and expansive K-12 portal experience where there are fewer duplicate books across grade levels.

We expect this improvement will make it easier for K-12 students and teachers to find appropriate books to satisfy their reading and learning goals. You can explore the newly improved K-12 student collection at http://openlibrary.org/k-12.

What Do You Think?

Is there something you miss from the previous K-12 page? Is the new organization more useful to you? Share your thoughts on bluesky.

Credits

This reading level import initiative was led by Mek, the Open Library program lead, with assistance from Drini Cami, Open Library senior software engineer. The project received support from Jordan Frederick, an Open Library intern who shadowed this project as part of her Master’s in Library and Information Science (MLIS) degree.

Loading the text of 2 million books into solr

Here at the Internet Archive we’ve scanned millions of books. One of our challenges is helping people find books they want to read. The bibliographic records can be searched in Open Library. If somebody knows the title they can find the book. It would be useful if all the text inside the book was also searchable. This is a problem I’ve been working on.

I started by taking the output in XML from our OCR and tidying up the results. The OCR program only considers individual pages, so paragraphs that span a page were split up. A phrase search that crosses the page boundary wouldn’t match. My code detects paragraphs broken across pages or columns and recombines them. It also deals intelligently with hyphenated words by joining them back together.

The process of fixing up the OCR involves parsing XML, it is slower than just reading a text file. Fortunately the books are stored on a few hundred computers in our petabox storage cluster. I’m able to run the XML parsing and OCR tidy stage on the computers in the cluster in parallel.

Next I take this data and feed it into solr. My solr schema is pretty simple. The fields are just the Internet Archive identifier and the text of book in a field called body. When we show a search inside results page we can pull the book title from the Open Library database. It might be quicker to generate result pages if all the data need to display them were in the search index, but that would mean updating the index whenever one of these piece of information changes. It is much simpler to only update the search index if the output from OCR changes.

Continue reading