Here at the Internet Archive we’ve scanned millions of books. One of our challenges is helping people find books they want to read. The bibliographic records can be searched in Open Library. If somebody knows the title they can find the book. It would be useful if all the text inside the book was also searchable. This is a problem I’ve been working on.
I started by taking the output in XML from our OCR and tidying up the results. The OCR program only considers individual pages, so paragraphs that span a page were split up. A phrase search that crosses the page boundary wouldn’t match. My code detects paragraphs broken across pages or columns and recombines them. It also deals intelligently with hyphenated words by joining them back together.
The process of fixing up the OCR involves parsing XML, it is slower than just reading a text file. Fortunately the books are stored on a few hundred computers in our petabox storage cluster. I’m able to run the XML parsing and OCR tidy stage on the computers in the cluster in parallel.
Next I take this data and feed it into solr. My solr schema is pretty simple. The fields are just the Internet Archive identifier and the text of book in a field called body. When we show a search inside results page we can pull the book title from the Open Library database. It might be quicker to generate result pages if all the data need to display them were in the search index, but that would mean updating the index whenever one of these piece of information changes. It is much simpler to only update the search index if the output from OCR changes.
The books we scan are in many different languages. It would be nice to provide stemming for all of these languages, but at first I’m loading them into a single field with minimal English stemming rules. The body field where the text of the book goes is compressed, this keeps the index smaller. The field has termVectors=”true” set, this improves the speed of highlighting, which is important to us.
In the solr config I had to increase the value of maxFieldLength. By default only the first 100,000 words in a field are indexed. Many of our books are longer than this.
Hathi Trust are doing some very similar work, they spotted that solr was using a lot of memory, they were able to fix it by increasing the value of termInfosIndexDivisor. This was a helpful tip. I set termInfosIndexDivisor to 4 and our memory usage dropped from 13GB to 7GB.
Unlike the Hathi Trust we’re not using solr sharding, we’ve loaded just over 2 million books into a single 3TB index.
Most searches take about a second, but some complex searches can take 10 seconds or longer. I’m working on speeding these up.
In a later blog post I’ll write about how we use the results we get from solr.