Here at the Internet Archive we’ve scanned millions of books. One of our challenges is helping people find books they want to read. The bibliographic records can be searched in Open Library. If somebody knows the title they can find the book. It would be useful if all the text inside the book was also searchable. This is a problem I’ve been working on.
I started by taking the output in XML from our OCR and tidying up the results. The OCR program only considers individual pages, so paragraphs that span a page were split up. A phrase search that crosses the page boundary wouldn’t match. My code detects paragraphs broken across pages or columns and recombines them. It also deals intelligently with hyphenated words by joining them back together.
The process of fixing up the OCR involves parsing XML, it is slower than just reading a text file. Fortunately the books are stored on a few hundred computers in our petabox storage cluster. I’m able to run the XML parsing and OCR tidy stage on the computers in the cluster in parallel.
Next I take this data and feed it into solr. My solr schema is pretty simple. The fields are just the Internet Archive identifier and the text of book in a field called body. When we show a search inside results page we can pull the book title from the Open Library database. It might be quicker to generate result pages if all the data need to display them were in the search index, but that would mean updating the index whenever one of these piece of information changes. It is much simpler to only update the search index if the output from OCR changes.
Continue reading →
Open Library is doing some maintenance. Back in a bit.
09:35AM: And we are back.
There can be more than way to say the same thing, for example gramophone record, phonograph record and vinyl records. When libraries write catalog records they pick one of these terms and sticks to it, they use what is known as a ‘controlled vocabulary‘. This makes it easier to browse library catalogs.
Traditionally it has been thought that patrons want to browse by author and subject headings, so these fields have been controlled. The data in these fields can be used in other ways, Ross Singer has been experimenting with geographic subject headings.
Publisher is an uncontrolled field. Penguin and Penguin Books are the same publisher, but their name has been entered in catalog records differently, making it difficult to browse by publisher.
A workaround is to use the ISBN field in the catalog record. Almost every book published since 1970 has an ISBN. English-language books start with a 0 or 1, followed by a variable-length publisher code, item number and finally a checksum digit.
For example: 0-14-043531-X
0 = English language
14 = Publisher code
043531 = Item number
X = checksum
We are able to build a list of ISBN publisher codes by picking the most popular publisher name, as it appears in library records, for each code. Using ISBN we can start the process of making publisher a controlled field.
I’ve just found three books by J. Peter May with descriptions of mathematics notation in the title:
- E [sign for infinity] ring spaces and E [sign for infinity] ring spectra
- [The mathematical expression for infinite loop] ring spaces and [The mathematical expression for infinite loop] ring spectra
- E [infinity subscript] ring spaces and E [infinity subscript] ring spectra
It is difficult to write software that can tell these titles are the same, yet they are. The Open Library book merging code has failed to recognize these are all the same book, and so we have three records in the database, where there should be one.
I wonder how often descriptions of mathematical notation appear in bibliographic records; this is the first time I’ve seen it.