The Externalities of Search 2.0: The Emerging Privacy Threats when the Drive for the Perfect Search Engine meets Web 2.0
spacer

spacer



spacer
Web search engines have emerged as ubiquitous and vital tools for the successful navigation of the growing online informational sphere. As Google puts it, the goal is to “organize the world’s information and make it universally accessible and useful” and to create the “perfect search engine” that provides only intuitive, personalized, and relevant results. Meanwhile, the so–called Web 2.0 phenomenon has blossomed based, largely, on the faith in the power of the networked masses to capture, process, and mashup one’s personal information flows in order to make them more useful, social, and meaningful. The (inevitable) combining of Google’s suite of information–seeking products with Web 2.0 infrastructures – what I call Search 2.0 – intends to capture the best of both technical systems for the touted benefit of users. By capturing the information flowing across Web 2.0, search engines can better predict users’ needs and wants, and deliver more relevant and meaningful results. While intended to enhance mobility in the online sphere, this paper argues that the drive for Search 2.0 necessarily requires the widespread monitoring and aggregation of a users’ online personal and intellectual activities, bringing with it particular externalities, such as threats to informational privacy while online.

Contents

Introduction
The Drive for the Perfect Search Engine
Web 2.0 and Personal Information Flows
Search 2.0: The Perfect Search Engine Meets Web 2.0
Externalities of Search 2.0
Potential Effects of Search 2.0
Conclusion

 


 

Introduction

The rhetoric surrounding Web 2.0 infrastructures presents certain cultural claims about media, identity, and technology. It suggests that everyone can and should use new Internet technologies to organize and share information, to interact within communities, and to express oneself. It promises to empower creativity, to democratize media production, and to celebrate the individual while also relishing the power of collaboration and social networks. Web sites such as Flickr, Wikipedia, del.icio.us, MySpace, and YouTube are all part of this second–generation Internet phenomenon, which has spurred a variety of new services and communities – and venture capitalist dollars. But Web 2.0 also embodies a set of unintended consequences emerging from the resultant blurring of the boundaries between Web users and producers, consumption and participation, authority and amateurism, play and work, data and the network, reality and virtuality.

The focus of this article is the unintended consequence of the increased flow of personal information across Web 2.0 infrastructures, and in particular, the efforts by Web search engines to crawl and aggregate this data in order to build profiles, predict intentions, and deliver personalized products and services. This drive the perfect search engine through the capture of personal information flowing across the networks – the quest for Search 2.0 – brings with it particular value externalities, such as the privacy of individuals’ online intellectual activities. This article argues that the externalities of Search 2.0 represent a new and powerful infrastructure of data surveillance – otherwise referred to as “dataveillance” (Clarke, 1988) – for the aggregation of one’s online information–seeking activities, inflaming a growing environment of discipline and social control.

This article is divided into five sections [1]. The first section describes the quest for the “perfects search engine,” with the requisite components of the “perfect reach” and the “perfect recall.” The next section introduces various quintessential Web 2.0 applications, and how they are increasingly being incorporated by search engines – either through indexing or integrating the applications themselves – to fuel the perfect search engine, resulting in what I call Search 2.0. The third section reveals two key externalities of Search 2.0, which leads to the potential effects of Search 2.0 outlined in the fourth section. Finally, the article outlines possible spaces for intervention, including the value–conscious design of future Search 2.0 platforms in order to mitigate its externalities.

 

spacer

The Drive for the Perfect Search Engine

Since the first search engines started to provide a way of interfacing with the content on the Web, there has been a drive for the “perfect search engine,” one that has indexed all available information and provides fast and relevant results (see Kushmerick, 1998; Andrews, 1999; Gussow, 1999; Mostafa, 2005). A perfect search engine would deliver intuitive results based on users’ past searches and general browsing history (Pitkow, et al., 2002; Teevan, et al., 2005), knowing, for example, whether a search for the keywords “Washington” and “apple” is meant to help a user locate Apple Computer stores in Washington, D.C. or nutritional information about the Washington variety of the fruit. Search engine companies have clear financial incentives for achieving the “perfect search”: receiving personalized search results might contribute to a user’s allegiance to a particular search engine service, increasing exposure to that site’s advertising partners as well as improving chances the user would use fee–based services. Similarly, search engines can charge higher advertising rates when ads are accurately placed before the eyes of users with relevant needs and interests (i.e., someone shopping for computers rather than fruit) (Hansell, 2005).

Web journalist John Battelle summarizes how such a perfect search engine might work:

Imagine the ability to ask any question and get not just an accurate answer, but your perfect answer – an answer that suits the context and intent of your question, an answer that is informed by who you are and why you might be asking. The engine providing this answer is capable of incorporating all the world’s knowledge to the task at hand – be it captured in text, video, or audio. It’s capable of discerning between straightforward requests – who was the third president of the United States? – and more nuanced ones – under what circumstances did the third president of the United States foreswear his views on slavery?

This perfect search also has perfect recall – it knows what you’ve seen, and can discern between a journey of discovery – where you want to find something new – and recovery – where you want to find something you’ve seen before. (Battelle, 2004)

To attain such an omnipresent and omniscient ideal, search engines must have both “perfect reach” in order to provide access to all available information on the Web and “perfect recall” in order to deliver personalized and relevant results that are informed by who the searcher is.

Perfect Reach

To achieve the reach necessary for the realization of Search 2.0, Web search engines amass enormous indices of the Web’s content. Expanding beyond just HTML–based Web pages, search engines providers have indexed a wide variety of media found on the Web, including images, video files, PDFs and other computer documents. For example, in 2005 Yahoo! claimed to have indexed over 20 billion items, including over 19.2 billion Web documents, 1.6 billion images, and over 50 million audio and video files (Mayer, 2005). The increasing sophistication and reach of Web crawler and indexing technology provide search engine companies the means to obtain an increasingly perfect reach, indexing an incredible diversity of content types available on the Internet and World Wide Web. In addition to expansive and diverse searchable indexes, today’s search engines also obtain a “perfect reach” by developing various tools and services to help users organize and use information in contexts not considered traditional Web searching. These include communication and social networking platforms, personal data management, financial data management, shopping and product research, computer file management, and enhanced Internet browsing.

Combining these two aspects of the perfect reach – expansive searchable indexes and diverse information organization products – the perfect search engine empowers users to search, find, and relate to nearly any all forms of information they need in their everyday lives. The reach of the perfect search engines allows users to search and access nearly all content on the Web, and also enables them to communicate, navigate, shop, and organize their lives, both online and off.

Perfect Recall

Complimenting the perfect reach of the perfect search engine is the desire of search engine providers to obtain perfect recall of each individual searcher, allowing the personalization of both services and advertising. To achieve this perfect recall, Web search engines must be able to identity and understand searchers’ intellectual wants, needs and desires when they perform information seeking tasks online. In order to discern the context and intent of a search for “Washington apple,” for example, the perfect search engine would know if the searcher has shown interest in computer products and lives in the Washington D.C. area, or whether she spends time online searching for recipes and various food items.

The primary means for search engines to obtain perfect recall is to monitor and track users’ search habits and history (see, for example, Pitkow, et al., 2002; Speretta, 2004; Teevan, et al., 2005). To gather users’ search histories, most Web search engines maintain detailed server logs recording each Web search request processed through their search engine, the pages viewed, and the results clicked (see, for example, Google, 2005a; IAC Search & Media, 2005; Yahoo!, 2006). Google, for example, records the originating IP address, cookie ID, date and time, search terms, results clicked for of the 100 million search requests processed daily (Google, 2005b).

Logging this array of enhances a search engine’s ability to reconstruct a particular user’s search activities in support of obtaining perfect recall. For example, by cross–referencing the IP address each request sent to the server along with the particular page being requested and other server log data, it is possible to find out which pages, and in which sequence, a particular IP address has visited. When asked, “Given a list of search terms, can Google produce a list of people who searched for that term, identified by IP address and/or Google cookie value?” and “Given an IP address or Google cookie value, can Google produce a list of the terms searched by the user of that IP address or cookie value?”, Google responded in the affirmative to both questions, confirming the general ability of search providers to track a particular user’s (or, at least, a particular browser or IP address) activity through such logs (Battelle, 2006a; 2006b).

The practice of collecting and retaining search query data in support of attaining “perfect recall” has not escaped controversy. In January 2006, it was revealed that, as part of the government’s effort to uphold an online pornography law, the U.S. Department of Justice had asked a federal judge to compel the Web search engine Google to turn over records on millions of its users’ search queries (Hafner and Richtel, 2006; Mintz, 2006). Google resisted, but three of its competitors, America Online (AOL), Microsoft, and Yahoo!, complied with similar government subpoenas of their search records (Hafner and Richtel, 2006). Later that year, AOL released over 20 million search queries from 658,000 of its users to the public in an attempt to support academic research on search engine query analysis (Hansell, 2006). Despite AOL’s attempts to anonymize the data, individual users remained identifiable based solely on their search histories, which included search terms matching users’ names, social security numbers, addresses, phone numbers, and other personally identifiable information (McCullagh, 2006a).

These cases brought search query retention practices into a more public light, creating anxiety among many searchers about the presence of such systematic monitoring of their online information–seeking activities (Barbaro and Zeller, 2006; Hansell, 2006; McCullagh, 2006a), and leading news organizations to investigate and report on the information search engines routinely collect from their users (Glasner, 2005; Ackerman, 2006). In turn, various advocacy groups have criticized the extent to which Web search engines are able to track and collect search queries, often with little knowledge by the users themselves (see, for example, Electronic Frontier Foundation, 2007; Privacy International, 2007), while both European and U.S. government regulators have started to investigate search engine query retention practices and policies (Associated Press, 2007; Lohr, 2007).

Yet, while public attention has recently focused on the industry practice of archiving users’ Web search queries in server logs, less attention has been paid to how search engine providers are able to monitor and aggregate activity across their growing array of products and services. Most notably, search companies like Google and Yahoo! have taken great steps to add the latest trend of Web services to their information infrastructures: Web 2.0.

 

spacer

Web 2.0 and Personal Information Flows

In 2004, Tim O’Reilly and Dale Dougherty of O’Reilly Media (a company known for its information technology–related books and conferences) sought to describe the common features of various Web companies that survived the “dot–com burst” of the late 1990s (O’Reilly, 2005). The companies – and their services and technologies – that survived, they argued, all had certain characteristics in common: they were collaborative, interactive, dynamic, user–centered, network–based, and data–rich. To describe this emerging trend in Web technologies and services, they coined the term “Web 2.0,” a concept that has been hailed as the “new wisdom of the Web” (Levy and Stone, 2006) and “a new cultural force based on mass collaboration” (Kelly, 2005).

While Web 2.0 has not been universally embraced – some deride it as merely a hyped–up buzzword (Boutin, 2006), “millenialist rhetoric” (Carr, 2006), and even an extension of Marxist ideology that is “inherently dangerous for the vitality of culture and the arts” (Keen, 2006) – the concept does seem to encapsulate the growing trend of user–generated and user–driven Web technologies. Popular Web sites such as Flickr, Wikipedia, del.icio.us, Facebook, and YouTube are all part of this second–generation Internet phenomenon, featuring user–generated content, opportunities for collaboration and harnessing collective intelligence, and relatively open platforms for anyone to participate, modify (mash–up) or share content (via RSS feeds, APIs, and the like).

Much of Web 2.0 is based upon – indeed built upon – increased personal information flows online. Inherent in Web 2.0 evangelism is an overall faith in the logic of the networked masses to be vehicle to provide meaning to your otherwise solitary existence – to give up your information to the Web, and allow various services, APIs, and communities capture, process, and mashup your information flows to make them more useful, more social, and more meaningful. For example, users of Web 2.0 are encouraged to put as much of their lives as possible online, to divulge and share their personal lives through blogs or on Live Journal, their professional development on LinkedIn, share bookmarks of favorite Web sites on del.icio.us, upload the music they listen to on last.fm, detail their friendships on Facebook and MySpace, share their appointments and social events on UpComing, where they are traveling on Dopplr, where they’ve connected to wi–fi on Plazer, just to name a few.

The prevalence of open flows of personal information on and across Web 2.0 platforms have prompted both general concerns over user privacy (see, for example, Barnes, 2006; George, 2006; Harris, 2006; Solove, 2007), as well as explorations into whether expectations of privacy online are shifting towards acceptance – or at least ambivalence – to the sharing of personal information in these contexts, especially among younger users (see, for example, Lenhart and Madden, 2007; Nussbaum, 2007). Often missing from these vital investigations and debates, however, is recognition of the growing integration of Web 2.0 platforms – and the personal information flows they contain – with the power of Web search engines: the emergence of Search 2.0.

 

spacer

Search 2.0: The Perfect Search Engine Meets Web 2.0

In their pursuit of the perfect search engine, search providers have increasingly capitalized on the growing Web 2.0 infrastructure to compliment both the reach of the search engine’s indexes, as well as the user information fueling their perfect recall. Enhancing their perfect reach, many search engines incorporate the information flows from Web 2.0 applications directly into their searchable indexes. For example, a Google search for an individual’s name routinely returns Facebook and LinkedIn profile pages, and even the minute and often personal details shared with friends through the Web 2.0 service Twitter. Taking Search 2.0 one step further, Yahoo!, through the purchase of Web 2.0 properties like Flickr and del.icio.us, has integrated user–generated photos and folksonomies of bookmarks directly into their search engine results (Yahoo!, 2007; Sullivan, 2008).

Yahoo!’s purchase and integration of these two popular Web 2.0 services also contributes to their ability to attain the perfect recall necessary for the perfect search engine. Recalling that search providers typically track user activity in order to personalize results and target advertising, adding various Web 2.0 technologies into their suite of products allows search provides to amass even more detailed records of user actions and interests. Requiring users to create Yahoo! accounts to use Web 2.0 services such as Flickr or UpComing, Yahoo! can add user data about their photos and social events, respectively, to their vast search history logs. Similarly, by linking Web 2.0 products, such as Orkut, Dodgeball, Picasa and YouTube, to traditional Google Accounts (see Weinberg, 2005), Google can amass much more detailed and personal information about users of these services, including their personal interests (Orkut), the places they visit (Dodgeball), the photos they share (Picasa), and the videos they enjoy (YouTube). In short, Search 2.0 empowers search providers to capture the personal information flows inherent in Web 2.0 applications and link them to users’ other search activities, resulting in the ability to amass detailed and comprehensive records of users online activities.

 

spacer

Externalities of Search 2.0

In their effort to achieve the perfect search engine, search providers such as Google and Yahoo! have captured many of the personal information flows inherent within the new Web 2.0 infrastructures within their searchable indexes, as well as integrating Web 2.0 platforms directly into their suite of products. The result is Search 2.0, a powerful Web search information infrastructure that promises to provide more extensive and relevant search results and information management services to users. But not without a price. Inherent in the Search 2.0 infrastructure are two key externalities: one, the deterioration of what I call “privacy via obscurity” of one’s personal information online; and two, the concentrated surveillance, capture, and aggregation of one’s online intellectual and social activities by a single provider.

Lack of “Privacy via Obscurity”

The notion of “Googling” someone has become common practice. People use search engines to learn about prospective blind dates (Lobron, 2006). Almost one in four Web users have searched online for information about co–workers or business contacts (Sharma, 2004), and employers are Googling prospective employees before making hiring decisions (Weiss, 2006). Through the powerful reach of search engines, obscure pieces of personal information – such as court records in the archives of a county government building, e–mail messages sent a decade ago to a now–defunct discussion forum, or a newsletter from an obscure social club – are increasingly retrievable by a simple keyword search. As a result, any “privacy via obscurity” that generally kept such information from public view has been diminished.

The personal information flows normally relegated to particular Web 2.0 platforms have similarly become broadly accessible via search engine’s desire to expand their reach by including these flows in their search able indexes. Bits of personal information previously thought to exist merely on relatively obscure Web 2.0 platforms such as Twitter or Plazes, or even the early Facebook [2], are now increasingly available to anyone searching through Google or Yahoo!. As a result, the playful or investigative searching done by potential dates or employers can now reveal much more personal insights. The consequences can be significant: job applicants have lost offers due to postings on social networking sites (Lewis, 2006), others have lost existing jobs (Czekaj, 2007), and social networking sites have been used for dozens of criminal and other police investigations. By integrating the information flows from disparate – and often obscure – Web 2.0 services into the indexes of popular search engines, any notion of “privacy via obscurity” is diminished, and the availability of these personal information flows for disciplinary or discriminatory activity increases.

Concentrated Surveillance of Online Activities

While the potential harms that emerge once Web 2.0–related personal data streams are indexed and searchable within the major Web search engines are significant, they are matched – if not exceeded – by the externalities of the integration of Web 2.0 applications within search company’s suite of products. By offering their users Web 2.0 services, search providers are increasingly able to track users’ social and intellectual activities across these innovative services, adding the personal information flows within Web 2.0 to the stores of information can leverage for personalized services and advertising. This represents a significant shift in the norms of personal information flow online. Previously, a person’s social and intellectual activities were distributed across multiple Web 2.0 applications scattered across the Web. But with the drive towards Search 2.0, single entities, such as Google or Yahoo!, have the means of monitoring, collecting and aggregating an increasing amount of one’s online social and intellectual activities. Search 2.0’s ability to collect and aggregate a wide array of personal and intellectual information about its users now extends beyond just what website a user searches for (the original goal of the “perfect recall”) to potentially include detailed demographic and profile information on linked social networking sites, the friends in one’s social networks, the photos shared (and the tags used to describe them), the various websites bookmarked (and, again, the descriptive tags), the RSS feeds subscribed to, and so on.

 

spacer

Potential Effects of Search 2.0

In their quest for Search 2.0, Web search engines have gained the ability to track, capture, and aggregate a wealth of personal information stemming from the increased flow of personal information made available by growing use and reliance on Web 2.0–based applications. The full effects and consequences of the emerging Search 2.0 infrastructure are difficult to predict, but potentially include the exercise of disciplinary power against users, the panoptic sorting of users, and the general invisibility and inescapability of Search 2.0’s impact on users’ online activities.

Disciplinary Power

Clive Norris warns of how infrastructures of dataveillance could be used to “[render] visualization meaningful for the basis of disciplinary social control” [3]. Instances of how users of Search 2.0 were made visible for the exercise of disciplinary power include a court ordering Google to provide the complete contents of a user’s Gmail account, including e–mail messages he thought were deleted (McCullagh, 2006b) and the introduction of evidence that a suspected murderer performed a Google search for the words “neck snap break” (Cohen, 2005), the Brazilian government asking Google to release data on users of its Orkut social networking site to help authorities investigate potential use of the site for illegal activities (Downie, 2006), or Yahoo! providing e–mail and other account data to Chinese officials, resulting in the jailing of dissidents within that country (Olesen, 2005; Schonfeld, 2006). The possibility of search providers providing detailed Search 2.0 data to government bodies for disciplinary action has reached new heights within the United States with the passage of the USA PATRIOT Act, greatly expanding the ability of law enforcement to access such records, while restricting the source of the records from disclosing any such request has even been made [4]. Given the recent discovery of the National Security Agency having direct access to citizens’ telecommunication activities (Singel, 2006), fears that the personal information flows inherent in Search 2.0 could similarly fall into government hands become all too real.

Panoptic Sorting

Search 2.0’s infrastructure of dataveillance also spawns instances of “panoptic sorting” where users of search engines are identified, assessed and classified “to coordinate and control their access to the goods and services that define life in the modern capitalist economy” [5]. Google, like most for–profit search engine providers, is financially motivated collect as much information as possible about each user: receiving personalized search results might contribute to a user’s allegiance to a particular search engine service, increasing exposure to that site’s advertising partners as well as improving chances the user would use fee–based services. Similarly, search engines can charge higher advertising rates when ads are accurately placed before the eyes of users with relevant needs and interests (Hansell, 2005). Through the panoptic gaze of its diverse suite of products – fueled by the growing Web 2.0 portion of their offerings – search providers capture as much information as possible about an individual’s behavior, and considers it to be potentially useful in the profiling and categorization of a user’s potential economic value: recognizing that targeted advertising will be the “growth engine of Google for a very long time”, Google CEO Eric Schmidt stressed the importance of collecting user information, acknowledging that “Google knows a lot about the person surfing, especially if they have used personal search or logged into a service such as Gmail” (Miller, 2006). Beyond Gmail, the personal information flows gleaned from search providers’ Web 2.0 offerings fuel a more detailed panoptic sorting if their users.

Invisibility and Allure of Search 2.0

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.