«  Confirming the Google Mobile Phone OS Main Drilling down into the Google Book Search image problem  »


Ryan Shaw left a great comment below under the item about University of Michigan's Paul Courant's defense of Michigan's partnership with Google. Here is the text of his post about the image formats that Google is delivering to libraries and what is at stake.

Once again, the details matter.

Libraries Look a Gift Horse in the Mouth Filed under: books, future, library (related links) — ryan @ 9:51 pm

UC Berkeley I-School professor Paul Duguid is quoted in an article from tomorrow's NYT about libraries rejecting Google's digitization program in favor of working with the Internet Archive. The article focuses on Google's clauses against allowing other commercial search engines to index the scans, but doesn't mention another aspect of the deals which is worse: the OCR output of Google-scanned books isn't made available to the participating libraries or to the public. Thus researchers who need digitized corpuses for developing information retrieval or natural language processing technology can't make use of their own university libraries' resources. This isn't the case with books scanned by the Internet Archive, the OCR output of which are made available to everyone. Fortunately UC Berkeley is one of the libraries working with the Internet Archive's scanning program, and the OCR output of those scans is proving to be very useful for my own research. As Clifford Lynch has written, providing access to library resources must go beyond simply making them available to human readers, toward making them available to be computed upon. Kudos to the libraries who are realizing this and choosing to work with the Internet Archive.

arrow

Comments (6)

This isn't true, the libraries ARE getting the OCR. See the contracts at:
http://books.google.com/googlebooks/partners.html

Besides, if you have the scan, you can OCR it yourself -- after all, the Google OCR is derived from the scan. What most libraries do not get is the map that tells you where the word is on the page. Again, they should be able to derive this from the scan, but it's more processing.

Leslie Johnston on November 8, 2007 5:09 PM:

The OCR data is indeed included with the page images.

It's also not true that the OCR text is never available to the public. For works in the public domain, GBS offers a "View Plain Text" option where you can read the OCR instead of the browsing the page images. Here's a link to "The Poetical Works of Sir Walter Scott" with such a link: http://books.google.com/books?id=iza0kGRRvEEC&printsec;=frontcover

Karen, Leslie, you are not making a distinction between plain text and machine-readable XML output of OCR, which typically has far more information. OCA makes the latter available, Google seems to only make the former available. Sure, maybe the cash-strapped libraries could reconstruct the original OCR output with "more processing," but Google is the one with the world's most powerful supercomputer at its disposal, not the libraries.

geez ryan, you've repeated this
all over cyberspace, and it's weak.

it's simple to program the routine
to determine location information.

and the libraries can do o.c.r. too.

plus, in the long run, libraries will
be serving _text_, not _scans_, so
you're placing too much emphasis
-- _far_ too much -- on that data...

-bowerbird

Please allow me to introduce two projects from Secret Studio that you may be interested in.

http://www.pointmore.com has about 20 thousand books (mostly) from Google Book Search that we have recognized and made available for download.

http://www.ultrapedia.com is a search engine we have created for searching through these books - currently standing at about six million pages. Please take a look at my site. I launched it on 1st January.

PS Hello again bowerbird - we keep meeting in the same forums ;-)

Frank

nice post but it could be more better if Ryan should discribe the fulll topic

Post a comment

We had to crank up the spam filter so it may take a little while to appear. Thanks.

A book in progress by

Siva Vaidhyanathan

Siva Vaidhyanathan

This blog, the result of a collaboration between myself and the Institute for the Future of the Book, is dedicated to exploring the process of writing a critical interpretation of the actions and intentions behind the cultural behemoth that is Google, Inc. The book will answer three key questions: What does the world look like through the lens of Google?; How is Google's ubiquity affecting the production and dissemination of knowledge?; and how has the corporation altered the rules and practices that govern other companies, institutions, and states? [more]

» Send links, questions and ideas:
siva [at] googlizationofeverything [dot] com

» To reach me for a press query, please write to SIVAMEDIA ut POBOX dut COM

» To reach me for a speaking invitation, please write to SIVASPEAK ut POBOX dut COM

» Visit my main blog: SIVACRACY.NET

» More about me

Topics

Like the Mind of God (38 posts)

All the World's Information (45 posts)

What If Big Ads Don't Work (18 posts)

Don't Be Evil (14 posts)

Is Google a Library? (68 posts)

Challenging Big Media (37 posts)

The Dossier (33 posts)

Global Google (8 posts)

Google Earth (4 posts)

A Public Utility? (27 posts)

About this Book (18 posts)

RSS Feed icon  RSS Feed


Powered by Movable Type 3.35