Open Library beta

One web page for every book.


Archive for November 2008


Bulk Access to OCR for 1 Million Books

November 24th, 2008 — 08:18 am

The Internet Archive provides bulk access to the over one million public domain books in its Texts Collection. The entire collection is over 0.5 petabytes, which includes raw camera images, cropped and skewed images, PDFs, and raw OCR data. The Internet Archive is scanning 1000 books/day, so this collection is always growing.
 
The OCR data is in a few different formats: character-based, word-based, and plain text. The word-based format, called DJVU XML, is very useful for researchers analyzing the collection. It contains coordinates for every word, as well as markup to indicate page, column, and paragraph structure. This is also the format that we for making searchable PDFs and DjVU files, and for powering flipbook search.

The OCR files are much smaller than the image files that make up the bulk of the collection, but the full set of OCR results is still 10s of terabytes. Using carefully-selected query parameters, you can use the advanced xml search page to choose the part of the collection that is most useful to you. The Advanced Search returns results in XML, JSON, or CSV format. You can then use a script to download the results.

Here is an example of bulk download of the OCR results for the Biodiversity Heritage Library collection. As of November 23, 2008, this collection contains 22694 texts with OCR results in DJVU XML file. We first download a CSV file from the Advanced Search page that contains Internet Archive identifiers that correspond to items in the collection, and then we use the script below to download the DJVU XML files. Be sure you have at least 200GB of free disc space before running this script!

The Biodiversity Heritage Library collection is pretty small, which is why I chose it for this example. To get the bulk of the Texts Collection, change the collection parameter to “americana” (719,156 items) or “toronto” (154,857 items). You will need a few terabytes of storage to hold it all.

To speed up download, run a couple of these scripts in parallel. Be sure to set startnum and endnum appropriately. You should be able to overcome latency issues and saturate your incoming internet connection by using enough parallel downloads. Enjoy!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#!/usr/bin/python2.5
 
#Copyright(c)2008 Internet Archive. Software license GPL version 3.
 
# This script downloads the OCR results for the Internet Archive's 
# Bioversity Heritage Library collection. It is meant to show you how to 
# download large datasets from archive.org.
 
# The csv file that drives the downloader can be fetched using the url below. 
# Change the collection parameter in the url to americana or toronto to get the
# bulk of the Internet Archive's scanned books. The advanced search page
# can help you tune the query string for accessing the rest of the collection:
# http://www.archive.org/advancedsearch.php
 
# Multiple downloaders can be run at the same time using the startnum and endnum
# parameters, which will help to speed up the download process.
 
import csv
import os
import commands
import sys
 
csvfile = "biodiversity.csv"
#download biodiveristy.csv with
#wget 'http://www.archive.org/advancedsearch.php?q=collection%3A%28biodiversity%29+AND+format%3A%28djvu+xml%29&fl%5B%5D=identifier&rows=1000000&fmt=csv&xmlsearch=Search'
 
 
startnum = 0
endnum   = 999999
 
reader = csv.reader(open(csvfile, "rb"))
reader.next() #the first row is a header
 
for i in range(startnum):
    reader.next()
 
filenum = startnum
 
for row in reader:
    id = row[0]
    dirnum = "%09d"%filenum
    print "downloading file #%s, id=%s" % (dirnum, id)
 
    #place 1000 items per directory
    assert filenum<1000000
    parentdir = dirnum[0:3]
    subdir    = dirnum[3:6]
    path = '%s/%s' % (parentdir, subdir)
 
    if not os.path.exists(path):
        os.makedirs(path)
 
 
    url = "http://www.archive.org/download/%s/%s_djvu.xml" % (id, id)
    dlpath = "%s/%s_djvu.xml"%(path, id)
 
    if not os.path.exists(dlpath):
        #urllib.urlretrieve(url, dlpath)
        #use rate limiting to be nicer to the cluster
        (status, output) = commands.getstatusoutput("""wget '%s' -O '%s' --limit-rate=250k --user-agent='IA Bulk Download Script' -q""" % (url, dlpath))
        assert 0 == status
    else:
        print "\talready downloaded, skipping..."
 
    filenum+=1
    if (filenum > endnum):
        sys.exit()

3 comments » | Bulk Access, OCR

OLPC Bookreader Demonstration

November 18th, 2008 — 07:32 pm

Open Library recently launched a web demonstration designed to illustrate how Internet Archive book collections can be viewed on the OLPC XO Laptop.

We invite you to take a look!

It is still in the early stages, and is built on open source software using an embeddable AJAX reader and a software component known as Carousel which scrolls through the collection of books on the right.

We would like to have feedback on how to improve the user experience of this demonstration and its underlying components for users of the OLPC XO and Open Library.

There are a few known bugs in the collections (i.e., Carousel) navigation:

  • A user cannot quickly scroll to the top and bottom of the collection
  • A user cannot view multiple collections, and sub-collections, or filter book selections on a variable such as author or subject
  • More books should be revealed in the collections carousel on screen rotate
  • Book titles should be truncated so they do not break into two lines (this interferes with the viewable area of the carousel)

At this time, it is not fully functional in tablet mode on the XO yet - this has dependencies on the GnuBook reader (the embeddable reader that enables scrolling through the individual book pages). We need it to:

  • Respond to the arrow keys when in tablet mode (book and collection navigation)
  • The book should zoom to fit width in the book viewable area
  • A two-page view on default is preferable to a single page view

We created an OLPC bundle for browsing books offline on the XO. It currently contains 5 books, and uses low resolution images to improve download speed:

Demo: http://openlibrary.org/static/olpc_bundle/openlibrary/
Bundle: http://openlibrary.org/static/olpc_bundle/openlibrary.xol

We encourage anyone interested in these two projects to help betatest the software components. The primary focus is on GnuBook:

Documentation: http://openlibrary.org/dev/docs/bookreader
Bug Tracker: https://bugs.launchpad.net/gnubook/
Source Code: http://github.com/openlibrary/bookreader/tree/master

We also have a demonstration of the GnuBook reader without the Carousel navigation here.

Enjoy! (And be sure to check out a 2008 Google Summer of Code project for a Sugar app book viewer by Aleksander Kalev!)

4 comments » | Open Source

One Webpage For Every Book Ever Published!

November 18th, 2008 — 12:32 am

Welcome to the Open Library Blog. This site will have the latest news about openlibrary.org.

1
2
#!/usr/bin/python
print "Hello World"

1 comment » | News