Loading the text of 2 million books into solr

Here at the Internet Archive we’ve scanned millions of books. One of our challenges is helping people find books they want to read. The bibliographic records can be searched in Open Library. If somebody knows the title they can find the book. It would be useful if all the text inside the book was also searchable. This is a problem I’ve been working on.

I started by taking the output in XML from our OCR and tidying up the results. The OCR program only considers individual pages, so paragraphs that span a page were split up. A phrase search that crosses the page boundary wouldn’t match. My code detects paragraphs broken across pages or columns and recombines them. It also deals intelligently with hyphenated words by joining them back together.

The process of fixing up the OCR involves parsing XML, it is slower than just reading a text file. Fortunately the books are stored on a few hundred computers in our petabox storage cluster. I’m able to run the XML parsing and OCR tidy stage on the computers in the cluster in parallel.

Next I take this data and feed it into solr. My solr schema is pretty simple. The fields are just the Internet Archive identifier and the text of book in a field called body. When we show a search inside results page we can pull the book title from the Open Library database. It might be quicker to generate result pages if all the data need to display them were in the search index, but that would mean updating the index whenever one of these piece of information changes. It is much simpler to only update the search index if the output from OCR changes.

Continue reading

Happy New Year! And, lists…

Wishing you and yours a very happy 2011, and hoping you enjoyed your holidays!

I wanted to take a moment to show you some of the fantastic Lists we’ve noticed over the last few weeks since the new feature launched.

Continue reading

Lists are here!

Open Library is happy to report that the Lists feature is here, for your collection-building enjoyment! You can create and share lists that include authors, works (all editions of a specific title), specific editions, and subjects.

Add to List icon

While browsing around, you can see when an item has been added to somebody’s list – just look underneath the Add to List button.  Here are some items that have been listed so far:

Continue reading