33 Bits of Entropy

A Critical Look at Decentralized Personal Data Architectures

I have a new paper with the above title, currently under peer review, with Vincent Toubiana, Solon Barocas, Helen Nissenbaum and Dan Boneh (the Adnostic gang). We argue that distributed social networking, personal data stores, vendor relationship management, etc. — movements that we see as closely related in spirit, and which we collectively term “decentralized personal data architectures” — aren’t quite the panacea that they’ve been made out to be.

The paper is only a synopsis of our work so far — in our notes we have over 80 projects, papers and proposals that we’ve studied, so we intend to follow up with a more complete analysis. For now, our goal is to kick off a discussion and give the community something to think about. The paper was a lot of fun to write, and we hope you will enjoy reading it. We recognize that many of our views and conclusions may be controversial, and we welcome comments.

Abstract:

While the Internet was conceived as a decentralized network, the most widely used web applications today tend toward centralization. Control increasingly rests with centralized service providers who, as a consequence, have also amassed unprecedented amounts of data about the behaviors and personalities of individuals.

Developers, regulators, and consumer advocates have looked to alternative decentralized architectures as the natural response to threats posed by these centralized services. The result has been a great variety of solutions that include personal data stores (PDS), infomediaries, Vendor Relationship Management (VRM) systems, and federated and distributed social networks. And yet, for all these efforts, decentralized personal data architectures have seen little adoption.

This position paper attempts to account for these failures, challenging the accepted wisdom in the web community on the feasibility and desirability of these approaches. We start with a historical discussion of the development of various categories of decentralized personal data architectures. Then we survey the main ideas to illustrate the common themes among these efforts. We tease apart the design characteristics of these systems from the social values that they (are intended to) promote. We use this understanding to point out numerous drawbacks of the decentralization paradigm, some inherent and others incidental. We end with recommendations for designers of these systems for working towards goals that are achievable, but perhaps more limited in scope and ambition.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

February 21, 2012 at 8:27 am Leave a comment

Is Writing Style Sufficient to Deanonymize Material Posted Online?

I have a new paper appearing at IEEE S&P with Hristo Paskov, Neil Gong, John Bethencourt, Emil Stefanov, Richard Shin and Dawn Song on Internet-scale authorship identification based on stylometry, i.e., analysis of writing style. Stylometric identification exploits the fact that we all have a ‘fingerprint’ based on our stylistic choices and idiosyncrasies with the written word. To quote from my previous post speculating on the possibility of Internet-scale authorship identification:

Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.

The basic idea that people have distinctive writing styles is very well-known and well-understood, and there is an extremely long line of research on this topic. This research began in modern form in the early 1960s when statisticians Mosteller and Wallace determined the authorship of the disputed Federalist papers, and were featured in TIME magazine. It is never easy to make a significant contribution in a heavily studied area. No surprise, then, that my initial blog post was written about three years ago, and the Stanford-Berkeley collaboration began in earnest over two years ago.

Impact. So what exactly did we achieve? Our research has dramatically increased the number of authors that can be distinguished using writing-style analysis: from about 300 to 100,000. More importantly, the accuracy of our algorithms drops off gently as the number of authors increases, so we can be confident that they will continue to perform well as we scale the problem even further. Our work is therefore the first time that stylometry has been shown to have to have serious implications for online anonymity.[1]

Anonymity and free speech have been intertwined throughout history. For example, anonymous discourse was essential to the debates that gave birth to the United States Constitution. Yet a right to anonymity is meaningless if an anonymous author’s identity can be unmasked by adversaries. While there have been many attempts to legally force service providers and other intermediaries to reveal the identity of anonymous users, courts have generally upheld the right to anonymity. But what if authors can be identified based on nothing but a comparison of the content they publish to other web content they have previously authored?

Experiments. Our experimental methodology is set up to directly address this question. Our primary data source was the ICWSM 2009 Spinn3r Blog Dataset, a large collection of blog posts made available to researchers by Spinn3r.com, a provider of blog-related commercial data feeds. To test the identifiability of an author, we remove a random k (typically 3) posts from the corresponding blog and treat it as if those posts are anonymous, and apply our algorithm to try to determine which blog it came from. In these experiments, the labeled (identified) and unlabled (anonymous) texts are drawn from the same context. We call this post-to-blog matching.

In some applications of stylometric authorship recognition, the context for the identified and anonymous text might be the same. This was the case in the famous study of the federalist papers — each author hid his name from some of his papers, but wrote about the same topic. In the blogging scenario, an author might decide to selectively distribute a few particularly sensitive posts anonymously through a different channel. But in other cases, the unlabeled text might be political speech, whereas the only available labeled text by the same author might be a cooking blog, i.e., the labeled and unlabeled text might come from different contexts. Context encompasses much more than topic: the tone might be formal or informal; the author might be in a different mental state (e.g., more emotional) in one context versus the other, etc.

We feel that it is crucial for authorship recognition techniques to be validated in a cross-context setting. Previous work has fallen short in this regard because of the difficulty of finding a suitable dataset. We were able to obtain about 2,000 pairs (and a few triples, etc.) of blogs, each pair written by the same author, by looking at a dataset of 3.5 million Google profiles and searching for users who listed more than one blog in the ‘websites’ field.[2] We are thankful to Daniele Perito for sharing this dataset. We added these blogs to the Spinn3r blog dataset to bring the total to 100,000. Using this data, we performed experiments as follows: remove one of a pair of blogs written by the same author, and use it as unlabeled text. The goal is to find the other blog written by the same author. We call this blog-to-blog matching. Note that although the number of blog pairs is only a few thousand, we match each anonymous blog against all 99,999 other blogs.

Results. Our baseline result is that in the post-to-blog experiments, the author was correctly identified 20% of the time. This means that when our algorithm uses three anonymously published blog posts to rank the possible authors in descending order of probability, the top guess is correct 20% of the time.

But it gets better from there. In 35% of cases, the correct author is one of the top 20 guesses. Why does this matter? Because in practice, algorithmic analysis probably won’t be the only step in authorship recognition, and will instead be used to produce a shortlist for further investigation. A manual examination may incorporate several characteristics that the automated analysis does not, such as choice of topic (our algorithms are scrupulously “topic-free”). Location is another signal that can be used: for example, if we were trying to identify the author of the once-anonymous blog Washingtonienne we’d know that she almost certainly resides in or around Washington, D.C. Alternately, a powerful adversary such as law enforcement may require Blogger, WordPress, or another popular blog host to reveal the login times of the top suspects, which could be correlated with the timing of posts on the anonymous blog to confirm a match.

We can also improve the accuracy significantly over the baseline of 20% for authors for whom we have more than an average number of labeled or unlabeled blog posts. For example, with 40–50 labeled posts to work with (the average is 20 posts per author), the accuracy goes up to 30–35%.

An important capability is confidence estimation, i.e., modifying the algorithm to also output a score reflecting its degree of confidence in the prediction. We measure the efficacy of confidence estimation via the standard machine-learning metrics of precision and recall. We find that we can improve precision from 20% to over 80% with only a halving of recall. In plain English, what these numbers mean is: the algorithm does not always attempt to identify an author, but when it does, it finds the right author 80% of the time. Overall, it identifies 10% (half of 20%) of authors correctly, i.e., 10,000 out of the 100,000 authors in our dataset. Strong as these numbers are, it is important to keep in mind that in a real-life deanonymization attack on a specific target, it is likely that confidence can be greatly improved through methods discussed above — topic, manual inspection, etc.

We confirmed that our techniques work in a cross-context setting (i.e., blog-to-blog experiments), although the accuracy is lower (~12%). Confidence estimation works really well in this setting as well and boosts accuracy to over 50% with a halving of recall. Finally, we also manually verified that in cross-context matching we find pairs of blogs that are hard for humans to match based on topic or writing style; we describe three such pairs in an appendix to the paper. For detailed graphs as well as a variety of other experimental results, see the paper.

We see our results as establishing early lower bounds on the efficacy of large-scale stylometric authorship recognition. Having cracked the scale barrier, we expect accuracy improvements to come easier in the future. In particular, we report experiments in the paper showing that a combination of two very different classifiers works better than either, but there is a lot more mileage to squeeze from this approach, given that ensembles of classifiers are known to work well for most machine-learning problems. Also, there is much work to be done in terms of analyzing which aspects of writing style are preserved across contexts, and using this understanding to improve accuracy in that setting.

Techniques. Now let’s look in more detail at the techniques I’ve hinted at above. The author identification task proceeds in two steps: feature extraction and classification. In the feature extraction stage, we reduce each blog post to a sequence of about 1,200 numerical features (a “feature vector”) that acts as a fingerprint. These features fall into various lexical and grammatical categories. Two example features: the frequency of uppercase words, the number of words that occur exactly once in the text. While we mostly used the same set of features that the authors of the Writeprints paper did, we also came up with a new set of features that involved analyzing the grammatical parse trees of sentences.

An important component of feature extraction is to ensure that our analysis was purely stylistic. We do this in two ways: first, we preprocess the blog posts to filter out signatures, markup, or anything that might not be directly entered by a human. Second, we restrict our features to those that bear little resemblance to the topic of discussion. In particular, our word-based features are limited to stylistic “function words” that we list in an appendix to the paper.

In the classification stage, we algorithmically “learn” a characterization of each author (from the set of feature vectors corresponding to the posts written by that author). Given a set of feature vectors from an unknown author, we use the learned characterizations to decide which author it most likely corresponds to. For example, viewing each feature vector as a point in a high-dimensional space, the learning algorithm might try to find a “hyperplane” that separates the points corresponding to one author from those of every other author, and the decision algorithm might determine, given a set of hyperplanes corresponding to each known author, which hyperplane best separates the unknown author from the rest.

We made several innovations that allowed us to achieve the accuracy levels that we did. First, contrary to some previous authors who hypothesized that only relatively straightforward “lazy” classifiers work for this type of problem, we were able to avoid various pitfalls and use more high-powered machinery. Second, we developed new techniques for confidence estimation, including a measure very similar to “eccentricity” used in the Netflix paper. Third, we developed techniques to improve the performance (speed) of our classifiers, detailed in the paper. This is a research contribution by itself, but it also enabled us to rapidly iterate the development of our algorithms and optimize them.

In an earlier article, I noted that we don’t yet have as rigorous an understanding of deanonymization algorithms as we would like. I see this paper as a significant step in that direction. In my series on fingerprinting, I pointed out that in numerous domains, researchers have considered classification/deanonymization problems with tens of classes, with implications for forensics and security-enhancing applications, but that to explore the privacy-infringing/surveillance applications the methods need to be tweaked to be able to deal with a much larger number of classes. Our work shows how to do that, and we believe that insights from our paper will be generally applicable to numerous problems in the privacy space.

Concluding thoughts. We’ve thrown open the doors for the study of writing-style based deanonymization that can be carried out on an Internet-wide scale, and our research demonstrates that the threat is already real. We believe that our techniques are valuable by themselves as well.

The good news for authors who would like to protect themselves against deanonymization, it appears that manually changing one’s style is enough to throw off these attacks. Developing fully automated methods to hide traces of one’s writing style remains a challenge. For now, few people are aware of the existence of these attacks and defenses; all the sensitive text that has already been anonymously written is also at risk of deanonymization.

[1] A team from Israel have studied authorship recognition with 10,000 authors. While this is interesting and impressive work, and bears some similarities with ours, they do not restrict themselves to stylistic analysis, and therefore the method is comparatively limited in scope. Incidentally, they have been in the news recently for some related work.

[2] Although the fraction of users who listed even a single blog in their Google profile was small, there were more than 2,000 users who listed multiple. We did not use the full number that was available.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

February 20, 2012 at 9:40 am 2 comments

An Update on Career Plans and Some Observations on the Nature of Research

I’ve had a wonderful time at Stanford these last couple of years, but it’s time to move on. I’m currently in the middle of my job search, looking for faculty and other research positions. In the next month or two I will be interviewing at several places. It’s been an interesting journey.

My Ph.D. years in Austin were productive and blissful. When I finished and came West, I knew I enjoyed research tremendously, but there were many aspects of research culture that made me worry if I’d fit in. I hoped my postdoc would give me some clarity.

Happily, that’s exactly what happened, especially after I started being an active participant in program committees and other community activities. It’s been an enlightening and humbling experience. I’ve come to realize that in many cases, there are perfectly good reasons why frequently-criticized aspects of the culture are just the way they are. Certainly there are still facets that are far from ideal, but my overall view of the culture of scientific research and the value of research to society is dramatically more positive than it was when I graduated.

Let me illustrate. One of my major complaints when I was in grad school was that almost nobody does interdisciplinary research (which is true — the percentage of research papers that span different disciplines is tiny). Then I actually tried doing it, and came to the obvious-in-retrospect realization that collaborating with people who don’t speak your language is hard.

Make no mistake, I’m as committed to cross-disciplinary research as I ever was (I just finished writing a grant proposal with Prof’s Helen Nissenbaum and Deirdre Mulligan). I’ve gradually been getting better at it and I expect to do a lot of it in my career. But if a researcher makes a decision to stick to their sub-discipline, I can’t really fault them for that.

As another example, consider the lack of a “publish-then-filter” model for research papers, a whole two decades after the Web made it technologically straightforward. Many people find this incomprehensibly backward and inefficient. Academia.edu founder Richard Price wrote an article two days ago arguing that the future of peer review will look like a mix of Pagerank and Twitter. Three years ago, that could have been me talking. Today my view is very different.

Science is not a popularity contest; Pagerank is irrelevant as a peer-review mechanism. Basically, scientific peer review is the only process that exists for systematically separating truths from untruths. Like democracy, it has its problems, but at least it works. Social media is probably the worst analogy — it seems to be better at amplifying falsehoods than facts. Wikipedia-style crowdsourcing has its strengths, but it can hit-or-miss.

To be clear, I think peer review is probably going to change; I would like it to be done in public, for one. But even this simple change is fraught with difficulty — how would you ensure that reviewers aren’t influenced by each others’ reviews? This is an important factor in the current system. During my program committee meetings, I came to realize just how many of these little procedures for minimizing bias are built into the system and how seriously people take the spirit of this process. Revamping peer review while keeping what works is going to be slow and challenging.

Moving on, some of my other concerns have been disappearing due to recent events. Restrictive publisher copyrights are a perfect example. I have more of a problem with this than most researchers do — I did my Master’s in India, which means I’ve been on the other side of the paywall. But it looks like that pot may finally have boiled over. I think it’s only a matter of time now before open access becomes the norm in all disciplines.

There are certainly areas where the status quo is not great and not getting any better. Today if a researcher makes a discovery that’s not significant enough to write a paper about, they choose not to share that discovery at all. Unfortunately, this is the rational behavior for a self-interested researcher, because there is no way to get credit for anything other than published papers. Michael Neilsen’s excellent book exploring the future of networked science gives me some hope that change may be on the horizon.

I hope this post has given you a more nuanced appreciation of the nature of scientific research. Misconceptions about research and especially about academia seem to be widespread among the people I talk to both online and offline; I harbored a few myself during my Ph.D., as I said earlier. So I’m thinking of doing posts like this one on a semi-regular basis on this blog or on Google+. But that will probably have to wait until after my job search is done.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

February 7, 2012 at 11:05 am 2 comments

Printer Dots, Pervasive Tracking and the Transparent Society

So far in the fingerprinting series, we’ve seen how a variety of objects and physical devices [1, 2, 3, 4], often even supposedly identical ones, can be uniquely fingerprinted. This article is non-technical; it is an opinion on some philosophical questions about tracking and surveillance.

Here’s a fascinating example of tracking that’s all around you but that you’re probably unaware of:

Color laser printers and photocopiers print small yellow dots on every page for tracking purposes.

My source for this is the EFF’s Seth Schoen, who has made his presentation on the subject available.

The dots are not normally visible, but can be seen by a variety of methods such as shining a blue LED flashlight, magnification under a microscope or scanning the document with a commodity scanner. The pattern of dots typically encodes the device serial number and a timestamp; some parts of the code are yet unidentified. There are interesting differences between the codes used by different manufacturers. [1] Some examples are shown in the pictures. There’s a lot more information in the presentation.

Pattern of dots from three different printers: Epson, HP LaserJet and Canon.

Schoen says the dots could have been the result of the Secret Service pressuring printer manufacturers to cooperate, going back as far as the 1980s. The EFF’s Freedom of Information Act request on the matter from 2005 has been “mired in bureaucracy.”

The EFF as well as the Seeing Yellow project would like to see these dots gone. The EFF has consistently argued against pervasive tracking. In this article on biometric surveillance, they say:

EFF believes that perfect tracking is inimical to a free society. A society in which everyone’s actions are tracked is not, in principle, free. It may be a livable society, but would not be our society.

Eloquently stated. You don’t have to be a privacy advocate to see that there are problems with mass surveillance, especially by the State. But I’d like to ask the question: can we really hope to stave off a surveillance society forever, or are efforts like the Seeing Yellow project just buying time?

My opinion is that it impossible to put the genie back into the bottle — the cost of tracking every person, object and activity will continue to drop exponentially. I hope the present series of articles has convinced you that even if privacy advocates are successful in preventing the deployment of explicit tracking mechanisms, just about everything around you is inherently trackable. [2]

And even if we can prevent the State from setting up a surveillance infrastructure, there are undeniable commercial benefits in tracking everything that’s trackable, which means that private actors will deploy this infrastructure, as they’ve done with online tracking. If history is any indication, most people will happily allow themselves to be tracked in exchange for free or discounted services. From there it’s a simple step for the government to obtain the records of any person of interest.

If we accept that we cannot stop the invention and use of tracking technologies, what are our choices? Our best hope, I believe, is a world in which the ability to conduct tracking and surveillance is symmetrically distributed, a society in which ordinary citizens can and do turn the spotlight on those in power, keeping that power in check. On the other hand, a world in which only the government, large corporations and the rich are able to utilize these technologies, but themselves hide under a veil of secrecy, would be a true dystopia.

Another important principle is for those who do conduct tracking to be required to be transparent about it, to have social and legal processes in place to determine what uses are acceptable, and to allow opting out in contexts where that makes sense. Because ultimately what matters in terms of societal freedom is not surveillance itself, but how surveillance affects the balance of power. To be sure, the society I describe — pervasive but transparent tracking, accessible to everyone, and with limited opt-outs — would be different from ours, and would take some adjusting to, but that doesn’t make it worse than ours.

I am hardly the first to make this argument. A similar position was first prominently articulated by David Brin his 1999 book Transparent Society. What the last decade has shown is just how inevitable pervasive tracking is. For example, Brin focused too much on cameras and assumed that tracking people indoors would always be infeasible. That view seems almost quaint today.

Let me be clear: I have absolutely no beef with efforts to oppose pervasive tracking. Even if being watched all of the time is our eventual destiny, society won’t be ready for it any time soon — these changes take decades if not generations. The pace at which the industry wants us to make us switch to “living in public” is far faster than we’re capable of. Buying time is therefore extremely valuable.

That said, embracing the Transparent Society view has important consequences for civil libertarians. It suggests working toward an achievable if sub-optimal goal instead of an ideal but impossible one. It also suggests that the “democratization of surveillance” should be encouraged rather than feared.

Here are some currently hot privacy and civil-liberties issues that I think will have a significant impact on the distribution of power in a ubiquitous-surveillance society: the right to videotape on-duty police officers and other public officials, transparent government initiatives including FOIA requests, and closer to my own interests, the Do Not Track opt-out mechanism, and tools like FourthParty which have helped illuminate the dark world of online tracking.

Let me close by calling out one battle in particular. Throughout this series, we’ve seen that fingerprinting techniques have security-enhancing applications (such as forensics), as well as privacy-infringing ones, but that most research papers on fingerprinting consider only the former question. I believe the primary reason is that funding is for the most part available only for the former type of research and not for the latter. However, we need a culture of research into privacy-infringing technologies, whether funded by federal grants or otherwise, in order to achieve the goals of symmetry and transparency in tracking.

[1] Note that this is just an encoding and not encryption. The current system allows anyone to read the dots; public-key encryption would allow at least nominally restricting the decoding ability to only law-enforcement personnel, but there is no evidence that this is being done.

[2] This is analogous to the cookies-vs-fingerprinting issue in online tracking, and why cookie-blocking alone is not sufficient to escape tracking.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

October 18, 2011 at 11:35 am 4 comments

Everything Has a Fingerprint — Don’t Forget Scanners and Printers

Previous articles in this series looked at fingerprinting of blank paper, digital cameras and RFID chips. This article will discuss scanners and printers, rounding out the topic of physical-device fingerprinting.

To readers who’ve followed the series so far, it should come as no surprise that scanners can be fingerprinted, and this can be used to match an image to the device that scanned it. Scanners capture images via a process similar to digital cameras, so the underlying principle used in fingerprinting is the same: characteristic ‘pattern noise’ in the sensor array as well as idiosyncracies of the algorithms used in the post-processing pipeline. The former is device-specific whereas the latter is make/model specific.

There are two important differences, however, that make scanner fingerprinting more difficult: first, scanner sensor arrays are one-dimensional (the sensor moves along the length of the device to generate the image), which means that there is much less entropy available from sensor imperfections. Second, the paper may not be placed in the same part of the scanner bed each time, which rules out a straightforward pixel-wise comparison.

A group at Purdue has been very active in this area, as well as in printer identification, which I will discuss later in this article. These two papers are very relevant for our purposes. The application they have in mind is forensics; in this context, it can be assumed that the investigator has physical possession of the scanner to generate a fingerprint against which a scanned image of unknown or uncertain origin can be tested.

To extract 1-dimensional noise from a 2-dimensional scanned image, the authors first extract 2-dimensional noise, in a process similar to what is used in camera fingerprinting, and then they collapse each noise pattern into a single row, which is the average of all the rows. Simple enough.

Dealing with the other problem, the lack of synchronicity, is trickier. There are broadly two approaches: 1. try to synchronize the image by trying various alignments 2. extract fingerprints using statistical features of the image that are robust against desynchronization. The authors use the latter approach, mainly moment-based features of the noise vector.

Here are the results. At the native resolution of scanners, 1200–4800 dpi, they were able to distinguish between 4 scanners with an average accuracy of 96%, including a pair with identical make and model. In subsequent work, they improved the feature extraction to be able to handle images that are reduced to 200 dpi, which is typically the resolution used for saving and emailing images. While they achieved 99.9% accuracy in classifying 10 scanners, they can no longer distinguish devices of identical make and model.

The authors claim that a correlation based approach — searching for the right alignment between two images, and then directly comparing the noise vectors — won’t work. I am skeptical about this claim. The fact that it hasn’t worked so far doesn’t mean it can’t be made to work. If it does work, it is likely to give far higher accuracies and be able to distinguish between a much larger number of devices.

The privacy implications of scanner fingerprinting are of an analogous nature to digital camera fingerprinting: a whistleblower exposing scanned documents may be deanonymized. However, I would judge the risk to be much lower: scanners usually aren’t personal devices, and a labeled corpus of images scanned by a particular device is typically not available to outsiders.

The Purdue group have also worked on printer identification, both laser and inkjet. In laser printers, one prominent type of observable signature arising from printer artifacts is banding — alternating light and dark horizontal bands. The bands are subtle and not noticeable to the human eye. But they are easily algorithmically detectable, constituting a 1–2% deviation from average intensity.

Fourier Transform of greyscale amplitudes of a background fill (printed with an HP LaserJet)

Banding can be demonstrated by printing a constant grey background image, scanning it, measuring the row-wise average intensities and taking the Fourier Transform of the resulting 1-dimensional vector. One such plot is shown here: the two peaks (132 and 150 cycles/inch) constitute the signature of the printer. The amount of entropy here is small — the two peak frequencies — and unsurprisingly the authors believe that the technique is good enough to distinguish between printer models but not individual printers.

Detecting banding in printed text is difficult because the power of the signal dominates the power of the noise. Instead the authors classify individual letters. By extracting a set of statistical features and applying an SVM classifier, they show that instances of the letter ‘e’ from 10 different printers can be correctly classified with an accuracy of over 90%.

Needless to say, by combining the classification results from all the ‘e’s in a typical document, th