The Battle Over Digital Archives: Who Owns the Internet's Collective Knowledge?

The internet, born out of DARPA’s ambitious vision for a robust communication network, has evolved into a global community-driven platform. It began as a means for academics to share information, conduct research, and engage in lively discussions. Over time, this ethos of community-driven content creation and information sharing became deeply ingrained in the fabric of the internet. From mega-sites like Geocities, AOL, Yahoo, to Tripod, millions of users contributed content, generating an astonishing amount of data. This content emerged from various sources, including discussion groups, mailing lists, newsgroups, and content sites like Suite101 and Webseed. Amidst the chaos, even the occasional visitor could stumble upon valuable nuggets of knowledge and opinion.

With the rise of search engines and directories catering to this ever-expanding internet landscape, it was only a matter of time before they targeted this vast content repository. One such comprehensive player was Deja. It meticulously crawled and indexed the rapidly growing Usenet scene, which comprised tens of thousands of daily messages. By the time Google acquired it, Deja’s archives held over 500 million messages, cross-referenced in every conceivable way and covering an astonishing array of topics.

Google, now the dominant search engine, had outpaced veteran competitors like Northern Lights, Fast, and Alta Vista. Its colossal database, comprising over 1.3 billion web pages, its innovative caching technology (effectively transforming it into one of the world’s largest libraries), and its site-ranking algorithm based on popularity and linking, made it unbeatable. However, Google’s integration of the treasure trove of Deja into its search interface faced significant challenges, prompting a protest movement.

Behind the bickering and often abrasive exchanges (sometimes bordering on the deranged, racist, or even stalking) within Usenet groups lay a more profound question: Who owns the content generated by the global public on computers funded by taxpayers? Can a commercial entity lay claim to and monopolize the collective efforts of millions of individuals worldwide? Or should such intellectual property remain in the public domain, perhaps maintained by public institutions like the Library of Congress? Should open-source movements gain access to Deja’s source code to launch Deja II? And who holds the copyright to these messages (in theory, the authors)? Google, like Deja before it, offers compilations of this content, the copyright to which it neither possesses nor can possess. The very concept of intellectual property is at the heart of this virtual conflict.

To address concerns, Google provided free access to the content of Deja archives for alternative (non-Google) archiving systems. However, it remained silent on sharing the search programming code and the user interface. A burgeoning open-source group, Dela News, has formed, though it’s unclear who will bear the enormous storage and processing costs such a project would entail. Dela aims to have a physical copy of the archive entrusted to a dot org.

This brings up a myriad of equally intriguing issues. The Deja Usenet search technology, programming code, and systems are intertwined and nearly indistinguishable from the Usenet archive itself. Without these elements—both structural and dynamic—there can be no archive and no way to extract meaningful information from the chaotic Usenet environment. In this case, the information lies not in the content but in the organization and classification of raw data. This is why open-source proponents demand that Google share both content and the tools to access it. Google’s abrupt deactivation of Deja in February only exacerbated the frustrations of die-hard Deja users.

The Usenet is not solely a haven for unsavory elements; it also hosts thousands of academically rigorous and research-oriented discussion groups. It documents over twenty years of wisdom and erudition from scholars all over the world, offering valuable insights and expertise. The Usenet serves as a historical record of Western intellectual history over the last three decades, making it invaluable. However, Google’s decision to sever internal links between Deja messages threatens the integrity of this hyperlinked resource unless an alternative (and costly) solution emerges.

Google has aimed to provide better, faster, and more comprehensive access to the entire archive. Still, its clash with the more combative side of the open-source movement has surfaced long-suppressed issues. Ironically, this confrontation may be the most significant contribution of an otherwise inopportune transaction.