Archive for August, 2010

Reading in the Sun!

By George Oates

Pixel Qi screen vs MacBook in direct sunlight
Photo by Raj Kumar

We’ve just had a visit from Mary Lou and John from Pixel Qi, showing off their amazing new screens that can operate in two modes: with the backlight, like a normal LCD, or with the backlight off, like a highly reflective “epaper” display that uses an incredible 80% less power than a “standard” display.

Mary Lou is one of the founders of One Laptop Per Child (which Pixel Qi collaborates with these days), and the OLPC is one of the biggest distribution channels for Internet Archive books.

It's a Merge-Fest!

By George Oates

Thank you to everyone who’s merged an author or two since we launched the feature on Monday! The response has been excitingly wonderful – there have been about 200 merges run, with a record 31,246 edits for the last 7 days! And, not just by staff!

You can see all the merges as they happen from the new sub-section of Recent Changes:

http://openlibrary.org/recentchanges/merge-authors

I must say I was quite pleased to find Somerset Maugham in need of so much merge love. Check out all his alternate names! It’s so satisfying when you find a juicy one like that.

Onward!

Duplicate Authors? Wave your Magic Wand!

By George Oates

In your wanderings around Open Library, you may occasionally have seen two records for a person you know to be a single author, like Brooks, Terry & Terry Brooks.

Look for the Magic Wand around the site to start merging!

Today, we’re releasing a new feature to help you merge those two separate Terry entries into one. This, in turn, will update all the Works listed under each Terry and try to reconcile each Work by each Author to try to reconcile a tighter list of Works for the newly merged Terry. Magic!

Try a search for your favourite author now, browse recent author merges, or read on…

A few things bear explaining:

  • The merge feature works on the idea of a Master author and its Duplicates. As you do the merge, it will be up to you to elect the most suitable Master. We select the author record with the most Works as the default, but you can change that
  • Only people with an Open Library account can merge authors
  • Updating the search engine after a merge takes a little while at the moment, up to about 10 minutes, so you won’t see the list of the new Master’s Works updated immediately. We’re looking to speed this up, but are very happy to release this as a “minimum viable product.” As I mentioned, merging an author with either lots of works, lots of editions, or both, takes a long time to update, so please be patient.
  • Duplicate authors’ names will be saved as an alternate on the Master record. For example, the (new) Master record for H. P. Lovecraft now lists alternates like Howard Philips Lovecraft, H. P Lovecraft, Howard P. Lovecraft and H.P Lovecraft. These alternates are often just subtle differences in spacing or capitalization, and we’re hoping they might prove useful later if we begin to stockpile them now.
  • If you’re in any doubt about whether or not to merge an author, don’t. It’s possible you might come across an odd-looking author name like August (re: H. P. Lovecraft) Derleth or H. P. (introduction by Lin Carter) (with Harry Houdini on Pharoahs) Lovecraft in a search for H. P. Lovecraft… these are trickier, because they’re noting contributors in the author name. Ideally, those contributors would be siphoned out into the contributors field per edition, and not merged into the H. P. Lovecraft Master. That would be a loss of information. So, it’s probably easier to just leave those long, odd “authors” alone for now.

I’ve actually found it really fun to test this new feature. I found a useful directory listing of authors on Yahoo of a ton of authors that I began to merge in Open Library. By referring to an external list like this, I could just move from one to the next, rather than trying to come up with authors to search for.

We’ve also bundled another enhancement into this release: Recent Changes V2: There’s a new little bit of navigation to the recent changes page, so you can see things like all the authors merged on 8/16/2010, or all the bot edits made in June 2010. We’re looking forward to adding other bits and pieces to these new filtered views, for example, all the new ebooks made available on a certain day, or all the new covers uploaded in a certain month. Perhaps these could also have feeds available too, so you could subscribe to a feed of changes to keep your version of the Open Library dataset up-to-date.

As well as Recent Changes V2, we’ve introduced the concept of “save_many” for transactions that contain lots of little updates. This is a performance improvement, and entered as a single line in Recent Changes – look for the little “expand” link to open up the contents of the save_many transaction.

So, why not have a shot at merging two duplicate authors? The best place to start is the Author search page.

Anyhoo, we’re excited to show you the first major feature we’ve rolled out since the launch of the redesign back in May, and we’re excited to see what you make of it. Go forth and merge!

Easy permanent links to book page images

By mang

We just launched a new image permalinks feature for downloading and linking to page images of books hosted on the Internet Archive. Using a page image permalink makes it easier to references the contents of a book hosted on the Archive without having to know the details of how or where the book is stored. Since a book’s data could be moved around within the multiple petabytes of data in the Archive at any time the permalinks provide a consistent and stable way to access the page images.

Here are a few quick examples. For each of these URLs you would add http://www.archive.org/download/{item identifier} to the beginning (hover over an image to see its full URL).

Referencing the cover image for a book at thumbnail size:
/page/cover_thumb.jpg

You can also request other sizes and rotations (in 90 degree increments):
/page/n194_rotate90_medium.jpg
Vertical Migration of Plankton

The full list of options is given in the Downloading / Linking Page Images section of the Book URLs developer documentation.

We’re hoping that the image permalinks will make it easier for people to access the wealth of books hosted on the Archive and stimulate new uses of the images. Let us know if you do something cool!

Improved Set Up for Developers

By George Oates

Over the last few months, a handful of the developers at the Internet Archive have begun working more closely with Open Library code, where previously, the project was more isolated and had really only been worked on by the core team of, well, two: Anand & Edward. Apart from more fun collaborating with colleagues across the Archive, this increased exposure of the Open Library code base has been profoundly useful for the project. Apart from the very useful fresh perspectives and questions, it’s also led to an improved toolset for getting a developer’s instance of Open Library up and running on your local machine – so important when you’re trying to find your way around a new system.

The cherry on top is an install script for Linux, written by Raj Kumar, on top of the awesome work done by Michael Ang (Mang) to prepare for our recent Lending launch. The updated docs are here:

http://openlibrary.org/dev/docs/setup

This is a bit of a milestone for us – making the codebase more accessible and easier to work with is something we’ve wanted for ages, so it’s nice to see it well on its way.

Open Library Ore: A MySQL data dump is available

By George Oates

A while back, Ben Gimpert - click to visit Ben's websiteBen Gimpert wrote a guest post for us called Open Library Ore, explaining how he had begun to hack on the massive full text corpus on the Internet Archive, practising various Natural Language Processing techniques to begin to teach machines to glean topics of books by sheer letter crunching. Turns out the elements in the ore are beginning to emerge, particularly in the form of a dataset available for download under Attribution-Noncommercial-Share Alike 3.0 CC license… Please, if you know SQL, why not download the dataset and see what you can find out? We’d love to hear any discoveries you make, perhaps in the comments?

Ben says: Back in April I posted about taxonomies and loading Internet Archive books into a database tailored for natural language processing. The goal is to build a machine-learning model that automatically categorizes some text. In this case, we want to use the Open Library mapping between Internet Archive book and Library of Congress topic to train a statistical model of what sorts of language turns up in documents about a topic. Authors use phrases like “district-attorney” more when writing about the law than science, and the real work is inferring these relationships with code. People smarter than I am build knowledge-based models, code that looks at semantic structures like parts -of- speech and grammar. I take a cruder approach, and hope that if we feed a machine-learning model a large enough number of books, a statistical model will be “good enough.”

Why all this effort? Besides being a very hard problem and therefore an intrinsically good hacking project, there are a handful of applications in my lazy file. Obviously one would be an automatic tagger of new Internet Archive books. [Ed: Yes, please!!] Before a human being has had the time to intelligently tag a recently-uploaded book with topics, an automatic model could fill in as a stopgap. Another application might be automatically organizing email by topic, and not simply by keyword, sender, or date. Imagine an application that speaks IMAP, and organizes my email by topic on-the-fly. These models need not be perfect. Speaking as someone with a nefarious background in trading and finance, a 51% accuracy would be a decent start.

The database I have been hacking is stable and no longer so embarrassing as to keep private. So I just uploaded a MySQL dump of the data on the Internet Archive, under a Creative Commons license.

Once you download the dump and pipe the three split SQL files on your MySQL server, you will end up with a database containing about 72,000 Open Library books. (The database is called “tree”, for obscure reasons.) Each of the books in the database is ready for a statistical approach to natural language processing applications. Here is how the books have been processed and prepared:

  • First I downloaded an Open Library dump. Then I selected just those editions cross-referenced to both the Library of Congress classification scheme and to a plain ‘ole ASCII text file at the Internet Archive. This ensured that I would have the actual text of each book as well as its topic. I ignored non-English editions, for now.
  • I downloaded each book from the Internet Archive, using a polite Ruby web spider.
  • Next each book’s punctuation and capitalization was stripped away. This is the first heuristic (educated guess, really) for reducing the amount of data the system must analyze, while trying to keep the classification accuracy decent.
  • Then a book’s text was parsed into single words (“1grams”). Storing just 1grams in the database is another heuristic, but not terrible because good machine-learning algorithms can create 2grams on -the- fly. The “district-attorney” example above is a 2gram.
  • Next an English-language stop word list of uninformative words was used to ignore certain 1grams. Articles like “a” and “the” are filtered away.
  • Then each 1gram was stemmed, to remove plural and adjective suffixes. This is a serious heuristic, since no stemming algorithm is perfect. Consider the difference between “middle-ages” and “middle-aged” for an example. However my experience from other projects is that stemming is more boon than curse.
  • Finally a table was built to rank the 1grams by popularity. Later we can restrict our models to just the most common few thousand 1grams, which would force our books into a smaller language dimensional space. Building this table and indexes took almost a week on my beleaguered Linux machine!

Hopefully the database is useful straight away, but there is sure to be a few problems with data duplication and normalization. Please drop me a line if you find anything glaring.

Next time, I will talk about actually training a machine-learning model using all this data. My implementation of Breiman’s random forest algorithm has not been scaling well, so I am experimenting with the Vowpal Wabbit project. This funnily named creature provides scalability guarantees, since it learns on -the- fly and uses the hashing trick for mapping ngrams into a feature space. (Hashing is better than a bloom filter for machine-learning, since models can adapt to collisions.) My Vowpal Wabbit -based models are so far nothing to write home about, but they get better every night I have some spare cycles for Open Library NLP hacking.