Category Archives: Bulk Access

The Open Book Genome Project

We’ve all heard the advice, don’t judge a book by its cover. But then how should we go about identifying books which are good for us? The secret depends on understanding two things:

  1. What is a book?
  2. What are our preferences?

We can’t easily answer the second question without understanding the first one. But we can help by being good library listeners and trying to provide tools, such as the Reading Log and Lists, to help patrons record and discover books they like. Since everyone is different, the second question is key to understanding why patrons like these books and making Open Library as useful as possible to patrons.

What is a book?

As we’ve explored before, determining whether something is a book is a deceptively difficult task, even for librarians. It’s a bound thing made of paper, right? But what about audiobooks and ebooks? Ok, books have ISBNs right? But many formats can have ISBNs and books published before 1967 won’t have one. And what about yearbooks? Is a yearbook a book? Is a dictionary a book? What about a phonebook? A price guide? An atlas? There are entire organizations, like the San Francisco Center for the Book, dedicated to exploring and pushing the limits of the book format.

In some ways, it’s easier to answer this question about humans than books because every human is built according to a specific genetic blueprint called DNA. We all have DNA, what make us unique are the variations of more than 20,000 genes that our DNA are made of, which help encode for characteristics like hair and eye color. In 1990, an international research group called the Human Genome Project (HGP) began sequencing the human genome to definitively uncover, “nature’s complete genetic blueprint for building a human being”. The result, which completed in 2003, was a compelling answer of, “what is a human?”.

Nine years later, Will Glaser & Tim Westergren drew inspiration from HGP and launched a similar effort called the Music Genome Project, using trained experts to classify and label music according to a taxonomy of characteristics, like genre and tempo. This system became the engine which powers song recommendations for Pandora Radio.

Circa 2003, Aaron Stanton, Matt Monroe, Sidian Jones, and Dan Bowen adapted the idea of Pandora to books, creating a book recommendation service called BookLamp. Under the hood, they devised a Book Genome Project which combined computers and crowds to “identify, track, measure, and study the multitude of features that make up a book”.

Their system analyzed books and surfaced insights about their structure, themes, age-appropriateness, and even pace, bringing us withing grasping distance of the answer to our question: What is a book?

BookLamps-Theme-Currents-for-Carrie

Sadly, the project did not release their data, was acquired by Apple in 2014, and subsequently discontinued. But they left an exciting treasure map for others to follow.

And follow, others did. In 2006, a project called the Open Music Genome Project attempted to create a public, open, community alternative to Pandora’s Music Genome Project. We thought this was a beautiful gesture and a great opportunity for Open Library; perhaps we could facilitate public book insights which any project in the ecosystem could use to create their own answer for, “what is a book?”. We also found inspiration from complimentary projects like StoryGraph, which elegantly crowd sources book tags from patrons to help you, “choose your next book based on your mood and your favorite topics and themes”, HaithiTrust Research Center (HTRC) which has led the way in making book data available to researchers, and the Open Syllabus Project which is surfacing useful academic books based on their usage across college curriculum.

Introducing the Open Book Genome Project

Over the last several months, we’ve been talking to communities, conducting research, speaking with some of the teams behind these innovative projects, and building experiments to shape a non-profit adaptation of these approaches called the Open Book Genome Project (OBGP).

Our hope is that this Open Book Genome Project will help responsibly make book data more useful and accessible to the public: to power book recommendations, to compare books based on their similarities and differences, to produce more accurate summaries, to calculate reading levels to match audiences to books, to surface citations and urls mentioned within books, and more.

OBGP hopes to achieve these things by employing a two pronged approach which readers may continue learning about in following two blog posts:

  1. The Sequencer – a community-engineered bot which reads millions of Internet Archive books and extracts key insights for public consumption.
  2. Community Reviews – a new crowd-sourced book tagging system which empowers readers to collaboratively classify & share structured reviews of books.

Or hear an overview of the OBGP in this half-hour tech talk:

Google Summer of Code 2020: Adoption by Book Lovers

by Tabish Shaikh & Mek

OpenLibrary.org,the world’s best-kept library secret: Let’s make it easier for book lovers to discover and get started with Open Library.

Hi, my name is Tabish Shaikh and this summer I participated in the Google Summer of Code program with Open Library to develop improvements which will help book lovers discover and use OpenLibrary.org.

Continue reading

Bulk Access to OCR for 1 Million Books

The Internet Archive provides bulk access to the over one million public domain books in its Texts Collection. The entire collection is over 0.5 petabytes, which includes raw camera images, cropped and skewed images, PDFs, and raw OCR data. The Internet Archive is scanning 1000 books/day, so this collection is always growing.
 
The OCR data is in a few different formats: character-based, word-based, and plain text. The word-based format, called DJVU XML, is very useful for researchers analyzing the collection. It contains coordinates for every word, as well as markup to indicate page, column, and paragraph structure. This is also the format that we for making searchable PDFs and DjVU files, and for powering flipbook search.

The OCR files are much smaller than the image files that make up the bulk of the collection, but the full set of OCR results is still 10s of terabytes. Using carefully-selected query parameters, you can use the advanced xml search page to choose the part of the collection that is most useful to you. The Advanced Search returns results in XML, JSON, or CSV format. You can then use a script to download the results.

Here is an example of bulk download of the OCR results for the Biodiversity Heritage Library collection. As of November 23, 2008, this collection contains 22694 texts with OCR results in DJVU XML file. We first download a CSV file from the Advanced Search page that contains Internet Archive identifiers that correspond to items in the collection, and then we use the script below to download the DJVU XML files. Be sure you have at least 200GB of free disc space before running this script!

The Biodiversity Heritage Library collection is pretty small, which is why I chose it for this example. To get the bulk of the Texts Collection, change the collection parameter to “americana” (719,156 items) or “toronto” (154,857 items). You will need a few terabytes of storage to hold it all.

To speed up download, run a couple of these scripts in parallel. Be sure to set startnum and endnum appropriately. You should be able to overcome latency issues and saturate your incoming internet connection by using enough parallel downloads. Enjoy!

#!/usr/bin/python2.5

#Copyright(c)2008 Internet Archive. Software license GPL version 3.

# This script downloads the OCR results for the Internet Archive's 
# Bioversity Heritage Library collection. It is meant to show you how to 
# download large datasets from archive.org.

# The csv file that drives the downloader can be fetched using the url below. 
# Change the collection parameter in the url to americana or toronto to get the
# bulk of the Internet Archive's scanned books. The advanced search page
# can help you tune the query string for accessing the rest of the collection:
# http://www.archive.org/advancedsearch.php

# Multiple downloaders can be run at the same time using the startnum and endnum
# parameters, which will help to speed up the download process.

import csv
import os
import commands
import sys

csvfile = "biodiversity.csv"
#download biodiveristy.csv with
#wget 'http://www.archive.org/advancedsearch.php?q=collection%3A%28biodiversity%29+AND+format%3A%28djvu+xml%29&fl%5B%5D=identifier&rows=1000000&fmt=csv&xmlsearch=Search'


startnum = 0
endnum   = 999999

reader = csv.reader(open(csvfile, "rb"))
reader.next() #the first row is a header

for i in range(startnum):
    reader.next()

filenum = startnum

for row in reader:
    id = row[0]
    dirnum = "%09d"%filenum
    print "downloading file #%s, id=%s" % (dirnum, id)

    #place 1000 items per directory
    assert filenum<1000000
    parentdir = dirnum[0:3]
    subdir    = dirnum[3:6]
    path = '%s/%s' % (parentdir, subdir)
    
    if not os.path.exists(path):
        os.makedirs(path)
        
    
    url = "http://www.archive.org/download/%s/%s_djvu.xml" % (id, id)
    dlpath = "%s/%s_djvu.xml"%(path, id)
    
    if not os.path.exists(dlpath):
        #urllib.urlretrieve(url, dlpath)
        #use rate limiting to be nicer to the cluster
        (status, output) = commands.getstatusoutput("""wget '%s' -O '%s' --limit-rate=250k --user-agent='IA Bulk Download Script' -q""" % (url, dlpath))
        assert 0 == status
    else:
        print "\talready downloaded, skipping..."

    filenum+=1
    if (filenum > endnum):
        sys.exit()