Archive for April, 2010

Open Library Ore

By George Oates

Ben Gimpert - click to visit Ben's websiteBen Gimpert is a friend of the Open Library. He and I got together over lunch a few months ago to talk about big data, statistical natural language processing, and extracting meaning from Open Library programmatically. His efforts are beginning to bear some really interesting fruit, and while we work out how we might be able to present it online, we thought you might be interested to hear what he’s been up to…

Ben says: Back in the dawn of time (2004), I founded a one-hit startup to classify arbitrary web pages by their topic. The business inspiration was a floundering sales pitch by a company doing this for three letter agencies, the Department of Defense and other sealed-bid sorts. My geek funny bone was struck by a What the Hack? talk on using zip compression fingerprints and entropic information to compare documents.

This is a very hard problem. We quickly spiral down to doing serious natural language processing and dimensionality reduction. The startup’s alpha release read the web page currently in your browser, and told you what it was about. My own lazy file has a hundred uses for this tech, one of which is passively offering the user a “pages like this?” list. The project got some interest from old friends at Britannica, but its accuracy was just not up to snuff. Somehow the topic “synaesthesia” seemed to turn up for every web page. This pointed to a classic bottleneck in statistical NLP, machine-learning, data mining or whatever the hip buzzword is this week — finding good, authoritative data.

Garbage in, garbage out as they say. A large, free, edited taxonomy or folksonomy of topic-labeled text remains a rarity. If you are pragmatic and willing to look beyond the “semantic web” squabbling, there are a few contenders. Yahoo was moving away from its own topic hierarchy, Dmoz had too little coverage, and the Usenet is too spammy. The entries in an encyclopedia might do a good enough job of defining an edited taxonomy of knowledge. However a single language document per topic made for an overfit model. In the terminolgy of machine-learning, the models were unfortunately high on “variance” as opposed to “bias.” Hence our Wikipedia article on synaesthesia seemed to match everything. (There is a quip here about a sci-fi synesthetes robot involantarily experiencing itself, but I digress.)

Fast forward a few years, when my day job is statistical NLP for machine-learning models of financial securities. I write code that reads the text of company news, and then trades a prediction of what will happen in the stock and stock options markets. A lunch with the amazing George suggested that I dust off my old project, and use the Open Library’s subjects mapping on the Library of Congress topic taxonomy as my authoritative data.

To give myself some motivation to learn the Open Library’s architecture and API, I wrote a script to solve an easy but nagging problem for Open Library. In this case, some Ruby code to pull down Goodreads IDs and tag Open Library editions via the ISBN middleman. Now, this project has finished by Open Library’s official coding badass.

Next, I wrote some Ruby to build a hierarchical mapping from an LCCO code like “BF885” into a tuple like <Philosophy, Psychology, Religion / Psychology / Physiognomy, Phrenology>. The LoC’s PDFs containing the hierarchical mapping are bit messy, but this part was easy work.

Then, I downloaded a recent dump of the Open Library data, and filtered to those editions that had both an LCCO code and a link to the public domain text of the edition on the Internet Archive. I also removed non-English language editions, for now. While the number of Open Library editions is in the millions, the number of matches to my filter was a more managable 75,000. This is several orders of magnitude more authoritative data than the Wikipedia concise articles from years ago.

Armed with this huge mapping of public domain books to topic, I built a MySQL architecture to hold the data. This system is hipper than a cache because of some significant pre-processing. The books need to be downloaded from the Internet Archive; split into words and phrases, or the “n-grams” in the machine-learning world; stripped of words such as “the” and “an,” which are light on meaning (“stop words”), cleaned of plurals and other suffixes (“stemming”); and then stored with at least a half-hearted attempt at normalization. Perhaps using a new object database like CouchDB would have simplified the work, but that will be version 2.0.

As of the 29th of April, I have loaded about 56,000 editions into 40GB of MySQL tables and indexes. The population is obviously skewed toward books in the public domain, but there are still a couple fun statistics: Massachusetts has the most history, California and Ohio the most state law, and zoology is the most covered science.

While the loading chugs along in another window, I am building a dimensionality reduction system. Working with text in the machine-learning space means tens of thousands of independent variables (“features”), and trimming the uninformative with a first pass. One naïve approach, which is probably a no-brainer, would be to drop any features that are included in at most two or three editions. This heuristic assumes that extremely rare words and phrases are not informative in choosing a edition’s topic, relative to all others in the population. Logistic regression and entropic information gain are the next step up in demensionality reduction complexity.

Ironically, the model training itself should be easy since my implementation of Breiman’s random forests parallelizes decently and is disk-based. I will train a forest of trees for each topic, with simple binary labels. So each document will be either about “Phrenology” or “Not Phrenology.” I will start with training the most general classifiers, trying to differentiate “Law” from “Not Law” before tackling “Massachusetts Law.” Maybe I should fire up a giant EC2 instance and see how it does on the unreduced documents…

George says: So, yay! Thank you, Ben! Really interesting stuff! We’re expecting to have a whopper dataset worthy of exploration/visualization online some time soon. In general, we’re poking and prodding at all the subjects we have in Open Library, and enjoying stomping around our new Subject pages. Open Library advisor, Karen Coyle wrote about Social Aspects of Subject Headings the other day too. Definitely worth a look, if you’re interested in classification, etc.

Also, we’re really pleased to have had the help of Otis from Good Reads and Ben, of course, to weave Good Reads identifiers into Open Library records where there was an ISBN crossover. That meant over 3 million updates, and now we’re just trying to work out where to put the link back to Open Library from Good Reads. Pleased to report we’re in the process of doing the same thing with Tim from LibraryThing. Woo!

Alice for the iPad

By George Oates

Exciting stuff! Revealed through @BibliOdyssey’s Twitter stream, with a link to a great post by Tali Krakowsky about the changing nature and new potential of pop-up books. What sort of physicality can books have on screen? Not just page-turning animations anymore!

Thumbnail View in BookReader!

By mang

We’re pleased to introduce a new thumbnail view for the Internet Archive BookReader. The thumbnail view gives you a quick visual impression of a book by seeing thumbnails of many pages at once. It’s a great way to quickly scan through a book.

Here’s how it looks for a book about the painter Goya:

The thumbnail view also makes it easy to pick out particular pages of interest, for example if you were trying to find the Burrowing Owl in Bird life in an Arctic Spring. Hint: here’s what he looks like:

You might also try looking at Old English colour prints or some of the other books about color prints.

This feature was submitted by Stephanie Collett of the California Digital Library via our BookReader GitHub account. It’s great to have this feature come in from the open source community building around the BookReader!

We're Hiring!

By George Oates

The Open Library team is seeking an experienced Python developer to join our small, experienced team. Born in 2007, Open Library is a large, wiki-editable library catalog and all our data and software is open. We want to enhance the way data moves in and out of Open Library by building features that make it simple for people to contribute records to the library as well as extracting them. We want to connect our records to as many online resources as possible, to be the locus for information about books online.

You will be responsible for core application development (running a system called Infogami) as well as development of new website features. You will review and enhance the Open Library’s current API offering, as well as looking out on to the broader web to find and develop useful API integrations back into Open Library. Learn more at the Open Library system at http://www.openlibrary.org/developers.

Must haves:

  • Software engineering experience, 3-5 years
  • Mad Python skillz
  • Applied use of PostgreSQL, Ubuntu/Linux, JavaScript/AJAX
  • Demonstrable working code online
  • Experience with triplestore database architecture; RDF/XML formats
  • Experience with open-source development projects and practice
  • Ability to work under your own supervision towards a shared outcome
  • Excellent communication skills, both written and verbal

Desirable:

  • Wikipedia hacks
  • Experience using GitHub or similar
  • Demonstrable, creative API integration projects, preferably with mashes from more than one system
  • A presence in the Python community
  • An interest in excellent user interface design
  • Experience working with SOLR/Lucene
  • Experience with data processing (we have millions of records)!
  • Experience working in teams dispersed around the world
  • Interest in data visualisation
  • Located in, or prepared to relocate to San Francisco

We’re working towards big goals at Open Library. The online presence of books is a very interesting space at the moment, ripe for an innovative outlook and wide integration with all sorts of other systems. If you enjoy breaking new ground, iterative development and huge datasets, please let us know!

How To Apply
Please send your resume and cover letter to jennifer@archive.org with the subject line “Open Library Engineer”. We thank all applicants for their interest, but advise that only those selected for an interview will be contacted. No phone calls please.

About the Internet Archive
The Internet Archive is a non-profit digital library committed to preserving the world’s digital cultural artifacts. Used by over 6 million people, this resource is becoming part of how the Internet works. Our job is to put the best humanity has to offer within reach of students, educators and the general public. Find out more about our organization and web archive at www.archive.org.

The Internet Archive is an equal opportunity employer.