A while back, Ben Gimpert wrote a guest post for us called Open Library Ore, explaining how he had begun to hack on the massive full text corpus on the Internet Archive, practising various Natural Language Processing techniques to begin to teach machines to glean topics of books by sheer letter crunching. Turns out the elements in the ore are beginning to emerge, particularly in the form of a dataset available for download under Attribution-Noncommercial-Share Alike 3.0 CC license… Please, if you know SQL, why not download the dataset and see what you can find out? We’d love to hear any discoveries you make, perhaps in the comments?
Ben says: Back in April I posted about taxonomies and loading Internet Archive books into a database tailored for natural language processing. The goal is to build a machine-learning model that automatically categorizes some text. In this case, we want to use the Open Library mapping between Internet Archive book and Library of Congress topic to train a statistical model of what sorts of language turns up in documents about a topic. Authors use phrases like “district-attorney” more when writing about the law than science, and the real work is inferring these relationships with code. People smarter than I am build knowledge-based models, code that looks at semantic structures like parts -of- speech and grammar. I take a cruder approach, and hope that if we feed a machine-learning model a large enough number of books, a statistical model will be “good enough.”
Why all this effort? Besides being a very hard problem and therefore an intrinsically good hacking project, there are a handful of applications in my lazy file. Obviously one would be an automatic tagger of new Internet Archive books. [Ed: Yes, please!!] Before a human being has had the time to intelligently tag a recently-uploaded book with topics, an automatic model could fill in as a stopgap. Another application might be automatically organizing email by topic, and not simply by keyword, sender, or date. Imagine an application that speaks IMAP, and organizes my email by topic on-the-fly. These models need not be perfect. Speaking as someone with a nefarious background in trading and finance, a 51% accuracy would be a decent start.
The database I have been hacking is stable and no longer so embarrassing as to keep private. So I just uploaded a MySQL dump of the data on the Internet Archive, under a Creative Commons license.
Once you download the dump and pipe the three split SQL files on your MySQL server, you will end up with a database containing about 72,000 Open Library books. (The database is called “tree”, for obscure reasons.) Each of the books in the database is ready for a statistical approach to natural language processing applications. Here is how the books have been processed and prepared:
- First I downloaded an Open Library dump. Then I selected just those editions cross-referenced to both the Library of Congress classification scheme and to a plain ‘ole ASCII text file at the Internet Archive. This ensured that I would have the actual text of each book as well as its topic. I ignored non-English editions, for now.
- I downloaded each book from the Internet Archive, using a polite Ruby web spider.
- Next each book’s punctuation and capitalization was stripped away. This is the first heuristic (educated guess, really) for reducing the amount of data the system must analyze, while trying to keep the classification accuracy decent.
- Then a book’s text was parsed into single words (“1grams”). Storing just 1grams in the database is another heuristic, but not terrible because good machine-learning algorithms can create 2grams on -the- fly. The “district-attorney” example above is a 2gram.
- Next an English-language stop word list of uninformative words was used to ignore certain 1grams. Articles like “a” and “the” are filtered away.
- Then each 1gram was stemmed, to remove plural and adjective suffixes. This is a serious heuristic, since no stemming algorithm is perfect. Consider the difference between “middle-ages” and “middle-aged” for an example. However my experience from other projects is that stemming is more boon than curse.
- Finally a table was built to rank the 1grams by popularity. Later we can restrict our models to just the most common few thousand 1grams, which would force our books into a smaller language dimensional space. Building this table and indexes took almost a week on my beleaguered Linux machine!
Hopefully the database is useful straight away, but there is sure to be a few problems with data duplication and normalization. Please drop me a line if you find anything glaring.
Next time, I will talk about actually training a machine-learning model using all this data. My implementation of Breiman’s random forest algorithm has not been scaling well, so I am experimenting with the Vowpal Wabbit project. This funnily named creature provides scalability guarantees, since it learns on -the- fly and uses the hashing trick for mapping ngrams into a feature space. (Hashing is better than a bloom filter for machine-learning, since models can adapt to collisions.) My Vowpal Wabbit -based models are so far nothing to write home about, but they get better every night I have some spare cycles for Open Library NLP hacking.