Category Archives: Search

Connecting K-12 Students With Books by Reading Level

We recently improved the relevance of our Student Library collection by adding more than 10,000 reading levels to borrowable K-12 books in our search engine.

Screenshot of the Pre-K & Kindergarten and Grade 1 reading levels on Open Library.

What are Reading Levels?

In the same way a library-goer may be interested in finding a book on a specific topic, like vampires, they may also be interested in narrowing their search for books that are within their reading level. The readability of a book may be scored in numerous ways, such as analyzing sentence lengths or complexity, or the percentage of words that are unique or difficult, to name a few.

For our purposes, we choose to incorporate Lexile scores—a proprietary system developed by MetaMetrics®. Lexile scores are widely used within school systems and have a reliable scoring system that is accessible and well documented.

While the goal of our initiative was to add reading levels specifically for borrowable K-12 books within the Open Library catalog, Lexile also offers a fantastic Find a Book hub where teachers, parents, and students may search more broadly for books by Lexile score. We’re grateful that Lexile features “Find in Library” links to the Open Library so readers can check nearby libraries for the books they love!

Screenshot of Lexile's Find a Book hub.

Before Reading Levels

Before Open Library had reading level scores, the system used subject tags to identify books according to grade level. Many of these categories were noisy, inaccurate, and had high overlap, making it difficult to find relevant books. Furthermore, with grade level bucketing, there was no intuitive way to search for books across a range of reading levels.

Searching for Books by Reading Level

Now, lexile score ranges like “lexile:[500 TO 900]” can be used flexibly in search queries to find the exact books that are right for a reader, with results being limited by grade levels. https://openlibrary.org/search?q=lexile%3A%5B700+TO+900%5D

Putting It All Together

By utilizing these lexile ranges, we’ve been able to develop a more coherent and expansive K-12 portal experience where there are fewer duplicate books across grade levels.

We expect this improvement will make it easier for K-12 students and teachers to find appropriate books to satisfy their reading and learning goals. You can explore the newly improved K-12 student collection at http://openlibrary.org/k-12.

What Do You Think?

Is there something you miss from the previous K-12 page? Is the new organization more useful to you? Share your thoughts on bluesky.

Credits

This reading level import initiative was led by Mek, the Open Library program lead, with assistance from Drini Cami, Open Library senior software engineer. The project received support from Jordan Frederick, an Open Library intern who shadowed this project as part of her Master’s in Library and Information Science (MLIS) degree.

Loading the text of 2 million books into solr

Here at the Internet Archive we’ve scanned millions of books. One of our challenges is helping people find books they want to read. The bibliographic records can be searched in Open Library. If somebody knows the title they can find the book. It would be useful if all the text inside the book was also searchable. This is a problem I’ve been working on.

I started by taking the output in XML from our OCR and tidying up the results. The OCR program only considers individual pages, so paragraphs that span a page were split up. A phrase search that crosses the page boundary wouldn’t match. My code detects paragraphs broken across pages or columns and recombines them. It also deals intelligently with hyphenated words by joining them back together.

The process of fixing up the OCR involves parsing XML, it is slower than just reading a text file. Fortunately the books are stored on a few hundred computers in our petabox storage cluster. I’m able to run the XML parsing and OCR tidy stage on the computers in the cluster in parallel.

Next I take this data and feed it into solr. My solr schema is pretty simple. The fields are just the Internet Archive identifier and the text of book in a field called body. When we show a search inside results page we can pull the book title from the Open Library database. It might be quicker to generate result pages if all the data need to display them were in the search index, but that would mean updating the index whenever one of these piece of information changes. It is much simpler to only update the search index if the output from OCR changes.

Continue reading