Ben Gimpert is a friend of the Open Library. He and I got together over lunch a few months ago to talk about big data, statistical natural language processing, and extracting meaning from Open Library programmatically. His efforts are beginning to bear some really interesting fruit, and while we work out how we might be able to present it online, we thought you might be interested to hear what he’s been up to…
Ben says: Back in the dawn of time (2004), I founded a one-hit startup to classify arbitrary web pages by their topic. The business inspiration was a floundering sales pitch by a company doing this for three letter agencies, the Department of Defense and other sealed-bid sorts. My geek funny bone was struck by a What the Hack? talk on using zip compression fingerprints and entropic information to compare documents.
This is a very hard problem. We quickly spiral down to doing serious natural language processing and dimensionality reduction. The startup’s alpha release read the web page currently in your browser, and told you what it was about. My own lazy file has a hundred uses for this tech, one of which is passively offering the user a “pages like this?” list. The project got some interest from old friends at Britannica, but its accuracy was just not up to snuff. Somehow the topic “synaesthesia” seemed to turn up for every web page. This pointed to a classic bottleneck in statistical NLP, machine-learning, data mining or whatever the hip buzzword is this week — finding good, authoritative data.
Garbage in, garbage out as they say. A large, free, edited taxonomy or folksonomy of topic-labeled text remains a rarity. If you are pragmatic and willing to look beyond the “semantic web” squabbling, there are a few contenders. Yahoo was moving away from its own topic hierarchy, Dmoz had too little coverage, and the Usenet is too spammy. The entries in an encyclopedia might do a good enough job of defining an edited taxonomy of knowledge. However a single language document per topic made for an overfit model. In the terminolgy of machine-learning, the models were unfortunately high on “variance” as opposed to “bias.” Hence our Wikipedia article on synaesthesia seemed to match everything. (There is a quip here about a sci-fi synesthetes robot involantarily experiencing itself, but I digress.)
Fast forward a few years, when my day job is statistical NLP for machine-learning models of financial securities. I write code that reads the text of company news, and then trades a prediction of what will happen in the stock and stock options markets. A lunch with the amazing George suggested that I dust off my old project, and use the Open Library’s subjects mapping on the Library of Congress topic taxonomy as my authoritative data.
To give myself some motivation to learn the Open Library’s architecture and API, I wrote a script to solve an easy but nagging problem for Open Library. In this case, some Ruby code to pull down Goodreads IDs and tag Open Library editions via the ISBN middleman. Now, this project has finished by Open Library’s official coding badass.
Next, I wrote some Ruby to build a hierarchical mapping from an LCCO code like “BF885” into a tuple like <Philosophy, Psychology, Religion / Psychology / Physiognomy, Phrenology>. The LoC’s PDFs containing the hierarchical mapping are bit messy, but this part was easy work.
Then, I downloaded a recent dump of the Open Library data, and filtered to those editions that had both an LCCO code and a link to the public domain text of the edition on the Internet Archive. I also removed non-English language editions, for now. While the number of Open Library editions is in the millions, the number of matches to my filter was a more managable 75,000. This is several orders of magnitude more authoritative data than the Wikipedia concise articles from years ago.
Armed with this huge mapping of public domain books to topic, I built a MySQL architecture to hold the data. This system is hipper than a cache because of some significant pre-processing. The books need to be downloaded from the Internet Archive; split into words and phrases, or the “n-grams” in the machine-learning world; stripped of words such as “the” and “an,” which are light on meaning (“stop words”), cleaned of plurals and other suffixes (“stemming”); and then stored with at least a half-hearted attempt at normalization. Perhaps using a new object database like CouchDB would have simplified the work, but that will be version 2.0.
As of the 29th of April, I have loaded about 56,000 editions into 40GB of MySQL tables and indexes. The population is obviously skewed toward books in the public domain, but there are still a couple fun statistics: Massachusetts has the most history, California and Ohio the most state law, and zoology is the most covered science.
While the loading chugs along in another window, I am building a dimensionality reduction system. Working with text in the machine-learning space means tens of thousands of independent variables (“features”), and trimming the uninformative with a first pass. One naïve approach, which is probably a no-brainer, would be to drop any features that are included in at most two or three editions. This heuristic assumes that extremely rare words and phrases are not informative in choosing a edition’s topic, relative to all others in the population. Logistic regression and entropic information gain are the next step up in demensionality reduction complexity.
Ironically, the model training itself should be easy since my implementation of Breiman’s random forests parallelizes decently and is disk-based. I will train a forest of trees for each topic, with simple binary labels. So each document will be either about “Phrenology” or “Not Phrenology.” I will start with training the most general classifiers, trying to differentiate “Law” from “Not Law” before tackling “Massachusetts Law.” Maybe I should fire up a giant EC2 instance and see how it does on the unreduced documents…
George says: So, yay! Thank you, Ben! Really interesting stuff! We’re expecting to have a whopper dataset worthy of exploration/visualization online some time soon. In general, we’re poking and prodding at all the subjects we have in Open Library, and enjoying stomping around our new Subject pages. Open Library advisor, Karen Coyle wrote about Social Aspects of Subject Headings the other day too. Definitely worth a look, if you’re interested in classification, etc.
Also, we’re really pleased to have had the help of Otis from Good Reads and Ben, of course, to weave Good Reads identifiers into Open Library records where there was an ISBN crossover. That meant over 3 million updates, and now we’re just trying to work out where to put the link back to Open Library from Good Reads. Pleased to report we’re in the process of doing the same thing with Tim from LibraryThing. Woo!