OCLC pushes back policy to fall, 2009

Fellow OpenLibrarian, Karen Coyle reports:

OCLC has just announced that it is pushing back the date on which the new record use and transfer policy will take effect. The actual new date isn’t known, but the announcement says:

In order to allow sufficient time for feedback and discussion, implementation of the Policy will be delayed until the third quarter of the 2009 calendar year.

OCLC will form a “review board” to solicit info from members and others, and to advise the OCLC board of trustees about the policy. Jennifer Younger will chair this committee.

Read more at Karen’s blog, Coyle’s InFormation

Happy New Year!

Best wishes for the New Year from the Open Library team!

This year has been a busy year for us, and we’re happy to have met a lot of our goals for 2008:

  • This year we dramatically grew the site. Open Library now has 23 million pages about books and authors.
  • We added Full-Text search, and now have more than 1 million books in our full-text search engine.
  • We launched Scan-On-Demand, which allows you request a public domain book from the Boston Public Library to be scanned. Open Library will deliver a PDF of the book to you in about a week, and the digital copy will be available for others to read online as well.
  • We launched a new open source AJAX bookreader, which is now in beta testing on the archive.org site.
  • We launched a JSON API, a book covers API, and a Javascript API to help developers interact with Open Library.

We’re looking forward to making the site even better in 2009! We hope you have a great year!

OpenBook WordPress Plugin

The OpenBook WordPress Plugin by John Miedema can be used to easily reference books from inside your WordPress blog and automatically pull covers and book data from Open Library.

We’re quite excited to see people using Open Library and building new tools using our public APIs.

Here’s an example of using the plugin:

[openbook booknumber=”0143034650″]

This book was pulled in by using this tag in the WordPress post:
[openbook booknumber="0143034650"]

The “booknumber” in this case is an ISBN book identifier and is causing a search behind the scenes.

You can also use the Open Library opaque IDs to refer to a book when the ISBN isn’t available (e.g. for books published before ISBN existed!)

In that case your link would look like this:
[openbook booknumber="/b/OL14015131M"]

Many of the Open Library books (particularly older ones) do not have a cover associated with them, but you can add one from the Open Library page for that book!

The OpenBook article in the Code4Lib journal describes some of the design decisions and implementation. (Note: the article mentions the old [openbook]0864921535[/openbook] way of using the openbook shortcode. The newer [openbook booknumber="0864921535"] style should now be used instead.)

Bulk Access to OCR for 1 Million Books

The Internet Archive provides bulk access to the over one million public domain books in its Texts Collection. The entire collection is over 0.5 petabytes, which includes raw camera images, cropped and skewed images, PDFs, and raw OCR data. The Internet Archive is scanning 1000 books/day, so this collection is always growing.
 
The OCR data is in a few different formats: character-based, word-based, and plain text. The word-based format, called DJVU XML, is very useful for researchers analyzing the collection. It contains coordinates for every word, as well as markup to indicate page, column, and paragraph structure. This is also the format that we for making searchable PDFs and DjVU files, and for powering flipbook search.

The OCR files are much smaller than the image files that make up the bulk of the collection, but the full set of OCR results is still 10s of terabytes. Using carefully-selected query parameters, you can use the advanced xml search page to choose the part of the collection that is most useful to you. The Advanced Search returns results in XML, JSON, or CSV format. You can then use a script to download the results.

Here is an example of bulk download of the OCR results for the Biodiversity Heritage Library collection. As of November 23, 2008, this collection contains 22694 texts with OCR results in DJVU XML file. We first download a CSV file from the Advanced Search page that contains Internet Archive identifiers that correspond to items in the collection, and then we use the script below to download the DJVU XML files. Be sure you have at least 200GB of free disc space before running this script!

The Biodiversity Heritage Library collection is pretty small, which is why I chose it for this example. To get the bulk of the Texts Collection, change the collection parameter to “americana” (719,156 items) or “toronto” (154,857 items). You will need a few terabytes of storage to hold it all.

To speed up download, run a couple of these scripts in parallel. Be sure to set startnum and endnum appropriately. You should be able to overcome latency issues and saturate your incoming internet connection by using enough parallel downloads. Enjoy!

#!/usr/bin/python2.5

#Copyright(c)2008 Internet Archive. Software license GPL version 3.

# This script downloads the OCR results for the Internet Archive's 
# Bioversity Heritage Library collection. It is meant to show you how to 
# download large datasets from archive.org.

# The csv file that drives the downloader can be fetched using the url below. 
# Change the collection parameter in the url to americana or toronto to get the
# bulk of the Internet Archive's scanned books. The advanced search page
# can help you tune the query string for accessing the rest of the collection:
# http://www.archive.org/advancedsearch.php

# Multiple downloaders can be run at the same time using the startnum and endnum
# parameters, which will help to speed up the download process.

import csv
import os
import commands
import sys

csvfile = "biodiversity.csv"
#download biodiveristy.csv with
#wget 'http://www.archive.org/advancedsearch.php?q=collection%3A%28biodiversity%29+AND+format%3A%28djvu+xml%29&fl%5B%5D=identifier&rows=1000000&fmt=csv&xmlsearch=Search'


startnum = 0
endnum   = 999999

reader = csv.reader(open(csvfile, "rb"))
reader.next() #the first row is a header

for i in range(startnum):
    reader.next()

filenum = startnum

for row in reader:
    id = row[0]
    dirnum = "%09d"%filenum
    print "downloading file #%s, id=%s" % (dirnum, id)

    #place 1000 items per directory
    assert filenum<1000000
    parentdir = dirnum[0:3]
    subdir    = dirnum[3:6]
    path = '%s/%s' % (parentdir, subdir)
    
    if not os.path.exists(path):
        os.makedirs(path)
        
    
    url = "http://www.archive.org/download/%s/%s_djvu.xml" % (id, id)
    dlpath = "%s/%s_djvu.xml"%(path, id)
    
    if not os.path.exists(dlpath):
        #urllib.urlretrieve(url, dlpath)
        #use rate limiting to be nicer to the cluster
        (status, output) = commands.getstatusoutput("""wget '%s' -O '%s' --limit-rate=250k --user-agent='IA Bulk Download Script' -q""" % (url, dlpath))
        assert 0 == status
    else:
        print "\talready downloaded, skipping..."

    filenum+=1
    if (filenum > endnum):
        sys.exit()