«  Should jurors Google? TMI and the justice system Main Marketwatch: "5 reasons Cuil Won't Kill Google"  »


John Wilkin thinks not:

Our hidden digital libraries 1

Two of my very talented colleagues, Kat Hagedorn and Josh Santelli, just published a nice piece in D-Lib entitled "Google Still Not Indexing Hidden Web URLs." Kat and Josh and I have discussed this problem off and on, stimulated in part by our frustrations in getting the OAI data collected by OAIster into search services like Google.
2

In preparation for a recent talk in China (the challenges and opportunities for digital libraries), I talked to Kat and Josh about the extent of OAIster data not findable through standard web searches. That so much of our digital library content is not findable through standard search engines has always been a troublesome issue, and I would have expected that with the passage of time, this particular problem have been solved. It hasn't, and that has made me wonder about what we do in digital libraries and how we do it.
3

Kat's and Josh's numbers are compelling. OAIster focuses on the hidden web--resources not typically stored as files in a crawlable web directory--and so OAIster, with its 16 million records, is a particularly good resource for finding digital library resources. Kat and Josh conclude that more than 55% of the content in OAIster can't be found in Google.
4

As much as I like Kat's and Josh's analysis, I draw a different conclusion from the data. They write that, "[g]iven the resulting numbers from our investigation, it seems that Google needs to do much more to gather hidden resources." This perspective is one many of us share. We're inclined to point a finger at Google (or other search engines) and wish they tried harder to look into our arcane systems. We believe that if only Google and others had a deeper appreciation of our content or tried harder, this problem would go away. I've been fortunate enough to be able to try to advance this argument one-on-one with the heads of Google and Google Scholar, and their responses are similar--too much trouble for the value of the content. As time has passed, I've come to agree.
5

Complexity in digital library resources is at the heart of our work, and is frankly one reason why many of us find the work so interesting. Anyone who thinks that the best way to store the 400,000 pages (140 million words) of the texts in the Patrologia Latina is as a bunch of static web pages knows nothing of the uses or users of that resource or what's involved in managing it. Similarly, to effectively manage the tens of thousands of records for a run-of-the-mill image collection, you can't store them as individual HTML pages lacking well-defined fields and relationships. These things are obvious to people in our profession.
6

We often go wrong, however, when we try to share our love of complexity with the consumers. We've come to understand that success in building our systems involves making complicated uses possible without at the same time requiring the user to have a complicated understanding of the resource. What we must also learn is that a simplified rendering of the content, so that it can be easily found by the search engines, is not an unfortunate compromise, but rather a necessary part of our work. ...

I completely concur with John and the Google folks. This is why we cannot continue to depend on Google to do the work of "organizing the world's information and making it universally accessible." Read John's essay. It's very important.

arrow

Comments (1)

Jardinero1 on July 28, 2008 12:00 PM:

When I read a piece like this I wonder the same five thoughts:
1. Prior to Google, seven or eight years ago, was any of this searchable? Could you even define this as a problem?
2. Why the alarm? Is Google engaging in coercion? Is anyone being forced to use Google?
3. Isn't it a good thing, that there are a few areas where Google is not spoonfeeding us our heart's desire on demand.
4. From a risk management point of view, centralization is a very, very bad thing. I am glad Google is choosing not to "organize all the world's information" in some areas.
5. From an economic point of view, this represents an opportunity for other market participants to enter.

Post a comment

We had to crank up the spam filter so it may take a little while to appear. Thanks.

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

A book in progress by

Siva Vaidhyanathan

Siva Vaidhyanathan

This blog, the result of a collaboration between myself and the Institute for the Future of the Book, is dedicated to exploring the process of writing a critical interpretation of the actions and intentions behind the cultural behemoth that is Google, Inc. The book will answer three key questions: What does the world look like through the lens of Google?; How is Google's ubiquity affecting the production and dissemination of knowledge?; and how has the corporation altered the rules and practices that govern other companies, institutions, and states? [more]

» Send links, questions and ideas:
siva [at] googlizationofeverything [dot] com

» To reach me for a press query, please write to SIVAMEDIA ut POBOX dut COM

» To reach me for a speaking invitation, please write to SIVASPEAK ut POBOX dut COM

» Visit my main blog: SIVACRACY.NET

» More about me

Topics

Like the Mind of God (57 posts)

All the World's Information (75 posts)

What If Big Ads Don't Work (20 posts)

Don't Be Evil (16 posts)

Is Google a Library? (84 posts)

Challenging Big Media (46 posts)

The Dossier (49 posts)

Global Google (26 posts)

Google Earth (6 posts)

A Public Utility? (37 posts)

About this Book (28 posts)

RSS Feed icon  RSS Feed


Powered by Movable Type 3.35