We Don’t Know What ‘Personal Data’ Means

It’s Not Just What We Tell Them. It’s What They Infer.

Many of us seem to think that “personal data” is a straightforward concept.  In discussions about Facebook, Cambridge Analytica, GDPR, and the rest of the data-drenched world we live in now, we proceed from the assumption that personal data means something like “data about myself that I provide to a platform.” Personal data means my birthdate, my gender, my family connections, my political affiliations. It is this data that needs special protection and that we should be particularly concerned about providing to online services.

This is partly true, but in another way it can be seriously misleading. Further, its misleading aspects trickle down into too much of our thinking about how we can and should protect our personal or private data, and more broadly our political institutions. The fact is that personal data must be understood as a much larger and even more invasive class of information than the straightforward items we might think.

A key to understanding this can be found in a 2014 report by Martin Abrams, Executive Director of the Information Accountability Foundation (a somewhat pro-regulation industry think tank), called “The Origins of Personal Data and Its Implications for Governance.” Abrams offers a fairly straightforward description of four different types of personal data: provided data, which “originates via direct actions taken by the individual in which he or she is fully aware of actions that led to the data origination”; observed data, which is “simply what is observed and recorded,” a category which includes an enormous range of data points: “one may observe where the individual came from, what he or she looks at, how often he or she looks at it, and even the length of pauses”; derived data, “derived in a fairly mechanical fashion from other data and becomes a new data element related to the individual”; and inferred data, “the product of a probability-based analytic process.”

To this list we’d need to add at least two more categories: anonymized data and aggregate data. Anonymized data is data that in some way has the identifying information, for example a person’s name, stripped off of it; unlike the other categories, the GDPR does directly address anonymized and pseudonymized data, noting in GDPR Recital 26 that the “regulation does not therefore concern the processing of such anonymous information”; this might be more comforting if it were not clear that “true data anonymization is an extremely high bar, and data controllers often fall short of actually anonymizing data.”

Aggregate data, as I’m using the term here, refers to data that is collected at the level of the group, but does not allow drilling down to specific individuals. In both cases, the lack of direct personal identification may not interfere with the ability to target individuals, even if they can’t necessarily be targeted by name (although in the case of anonymized data, a major concern is that advances in technology all too often make it very possible to de-anonymize what had once been anonymized). GDPR’s impact on aggregate data is one of the areas of the regulation that remains unclear.

To understand why this data that may on the surface seem “impersonal,” but can in fact be highly targeted at us as individuals, consider, for example, one form of analysis that is found all over the Facebook/Cambridge Analytica story, the so-called Big Five personality traits that sometimes go by the acronym OCEAN: openness, conscientiousness, extroversion, agreeableness, and neuroticism. Researchers and advertisers focus on the Big Five because they appear to give marketers and behavioral manipulators particularly strong tools with which to target us.

A recent New York Times story takes some examples from the research of Michael Kosinski, a Stanford University researcher whose work is often placed at the center of these conversations, and which may have been used by Cambridge Analytica. While Kosinski might be a particularly good salesperson for the techniques he employs, we do not have to accept at face value that everything he says is correct to see that the general methods he uses are widely employed, and seem to have significant validity.

Kosinski provided the Times with inferences he made about individuals’ OCEAN scores based on nothing more than Facebook “like” data. He generated these inferences by taking a group of experimental participants, ranking their own “like” scores against their own OCEAN scores, and then using machine learning to infer probabilities about the associations of their likes with their OCEAN ranks. In addition to their OCEAN rankings, in some places Kosinski has claimed even more for this technique; in a famous segment in Jamie Bartlett’s recent Secrets of Silicon Valley documentary, Kosinski correctly infers Bartlett’s religious background from his Facebook likes. This is also exactly the kind of aggregate, inferential data about us that the infamous “This Is Your Digital Life” app Facebook has said Cambridge Analytica used to gather information about not just individuals but our friends.

Even in the examples Kosinski gives, it is tempting to draw causal relationships between the proffered data and the inferred data. Perhaps people who like A Clockwork Orange are particularly interested in alternate versions of reality, so perhaps this makes them “open” to new experiences; perhaps Marilyn Manson gives off visual or aural cues for being neurotic. This kind of reasoning is just the mistake we need to avoid. These causal relationships may even exist, but it does not matter, and that is not what Kosinski’s techniques aim to discover. The software that decides that more “open” people like A Clockwork Orange is not asking why that is the case; it just is.

The fact that some of these data points look like they have causal relationships to the inferred personality traits is misleading. This was just a bunch of data with many points (that is, thousands of different things people could “like”) so that it was possible to create rich mappings between that set of data and an experimental set between people whose characteristics on the OCEAN scale are well-known, and their own “likes.”

The same kind of comparisons can be done with any set of data. It could be done with numbers, colors, the speed of mouse clicks, the timber of voice, and many, many other kinds of data.

These facts create a huge conundrum for privacy advocates. We might think we have told Facebook that we like A Clockwork Orange, and that that is the end of the story. But what if, by telling them that fact, we have also told them that we are gay, or have cirrhosis of the liver, or always vote against granting zoning permits to large box stores? What if we tell them that not even by something concrete like “liking” a movie, but simply by the speed and direction characteristics of how we use our computer mouse?

It is critical to understand that it does not matter whether this information is accurate at the level of each specific individual. Data collectors know it is not entirely accurate. Again, people mistake the correlation for a linear, causal relationship: “if I like hiking as a hobby, I am conscientious.” They then tend to evaluate that relationship on grounds of whether or not it makes sense to them. But this is a mistake. What is accurate about these techniques involves statistical inference. That inference is something along the lines of “75% of those who report they like hiking rank high on the ‘openness’ scale.” Of course it is wrong in some cases. That does not matter. If an advertiser or political operative wants to affect behavior, they look for triggers that motivate people who score highly on the “openness” score. Then they offer products, services, and manipulative media with similar scores.

That is, they don’t have to know why a point of data implies another point of data. They only have to know that it does, within a certain degree of accuracy. This is why descriptions like inferential, aggregate and anonymized data must be of primary concern in understanding what we typically mean by “privacy.”

Research into big data analytics is replete with examples of the kinds of inferential and derived data that we need to understand better. Data scientist and mathematician Cathy O’Neil’s important 2016 book Weapons of Math Destruction includes many examples. Some of her examples make sense, even if they offend pretty much any sense of justice we might have: the use of personality tests to screen job applicants based on inferences made about them rather than anything about their work history or qualifications (105-6), or O’Neil’s own experience building a system to determine a shopper’s likelihood to purchase based on their behavior clicking on various ads (46-7). Others derive inferences from data apparently remote from the subject, such as O’Neil’s citation of a Consumer Reports investigation that found a Florida auto insurer basing insurance rates more heavily on credit scores than on accident records (164-5). In that case, observed data (use of credit) is converted into derived data (the details of the credit report) and then, via big data analysis, converted into inferential data (likelihood of making an insurance claim).

Automating Equality: How High-Tech Tools Profile, Police, and Punish the Poor, a 2017 volume by political scientist and activist Virginia Eubanks, contains many examples of derived and inferential data being used to harm already-vulnerable members of society, such as an Allegheny County (in Pennsylvania) algorithm that attempted to predict which families were likely to be at high risk for child abuse, but which was based on big data analysis of hundreds of variables, many of which have no proven direct relationship to child abuse, and which ended up disproportionately targeting African American families (Chapter 4). In Chapter of the book Eubanks writes about massive data collection programs in Los Angeles intended to provide housing resources to the homeless, but which end up favoring some individuals over others for reasons that remain opaque to human case workers, and not clearly based on the directly relevant observations about clients that those case workers prefer to make life-changing decisions.

In the Cambridge Analytica case, inferential data appears to play a key role. When David Carroll and I requested our data from the company under UK law, the most interesting part of the results was a table of 10 ranked hot-button political issues. No information was provided about how this data was produced, but it clearly cannot have been provided data, as it is not data I have directly provided to anyone; I have not even thought about these issues in this form, and if the data is correct, much of it is news to me. The data is likely not observed, since that would require there to be a forum in which I had taken actions to indicate the relative importance of these issues to me, and I can’t think of any forum in which I’ve done anything close to that. So that leaves inferential and derived data, and both Carroll and I and the analysts we’ve been working with presume that this data is in fact inferred from some large body of data (Cambridge Analytica has at some points claimed to hold upwards of 4000 data points on individuals, on which it performs what Kosinski and others call “psychographics,” which is just the kind of inferential personal data I’ve been talking about used to determine very specific aspects of an individual’s personality, including their susceptibility to various kinds of behavioral manipulation). While it is hard to judge the accuracy of the ranked list precisely (in part because we don’t really know how it was meant to be used), overall it seems quite accurate, and thus offers a fairly complete inferred and/or derived political profile of me based on provided and observed data that likely only had at best a partial relationship to my political leanings.

CA SCL data

Yes, we should be very concerned about putting direct personal data out onto social media. Obviously, putting “Democrat” or even “#Resist” in your public Twitter profile tells anyone who asks what party we are in. We should be asking hard questions about whether it is wise to allow even that minimal kind of declaration in public and whether it is wise to allow it to be stored in any form, and by whom. But perhaps even more seriously, and much less obviously, we need to be asking who is allowed to process and store information like that, regardless of where they got it from, even if they did not get it directly from us.

A side note: academics and activists sometimes protest the inaccessibility of some kinds of data due to the importance of understanding what companies like Facebook are doing with our data. That’s an important conversation to have, but it’s worth noting that both Kosinski and Alexander Kogan, another researcher at the heart of the Cambridge Analytica story, got access to the data they used because they were academics.

In his testimony before the US House of Representatives Energy and Commerce Committee on April 11, 2018, Facebook CEO Mark Zuckerberg offered the following reassurance to Facebook users:

The content that you share, you put there. You can take it down at any time. The information that we collect, you can choose to have us not collect. You can delete any of it, and, of course, you can leave Facebook if you want.

At first glance, this might seem to cover everything users would care about. But read the language closely. The content “users share” and the content that Facebook “collects” name much thinner segments of Facebook’s user data than the words might seem to suggest.

Just taking Zuckerberg’s language literally, “the content you share” sounds like provided data, and “the information that we collect” sounds like some unspecified mix of provided and observed data.

Mark Zuckerberg Data

But what about derived, inferred, and aggregate data?

What this kind of data can do for those who want to manipulate us is unknown, but its potential for harm is too clear to be overlooked. The existing regulations and enforcement agreements imposed on Facebook and other data brokers have proven insufficient. If there is one takeaway from the Cambridge Analytica story and the Facebook hearings and so on, it is that democracies, and that means democratic governments, need to get a handle on these phenomena right away, because the general public does not and cannot know the extent to which giving away apparently “impersonal” data might, in fact, reveal our most intimate secrets.

Further, as a few commentators have noted, Facebook and Google are the most visible tips of a huge iceberg. The hundreds of data brokers whose entire business consists in selling data about us that we never directly gave to them may be even more concerning, in part because their actions are so much more hidden from the public. Companies like Acxiom aggregate, analyze and sell data, both for advertising and for a wide range of other activities that impact us in many ways we don’t understand at all well enough, up to and including the “social credit score” that the Chinese government appears to be developing to track and control many aspects of public behavior.  Possibly even worse, the data fuels the activities of full-scale surveillance companies like Peter Thiel’s Palantir, with which Mark Zuckerberg himself said in his Congressional testimony declared he “isn’t that familiar,” despite Thiel being a very visible and outspoken early Facebook investor mentor to Zuckerberg. Facebook itself has a disturbing interest in the data of people who have not signed up for the service, which just illustrates its similarity to data brokers like Acxiom.

If Facebook and Google and the data brokers were to say, “you can obtain, and if you choose to, delete, all the data we have about you,” or better yet, “you have to actively opt-in to give us your data and agree to the processing we do with it,” that might go a long way toward addressing the kind of concerns I and others have been raising for a long time about what is happening with surveillance and behavioral manipulation in digital technology. But would that even be enough? Is it clear that data “about” me is all the data that is directly attached to my name, or whatever other unique personal identifier Facebook uses? Would these companies even be able to stay in business if they offered users that much control?

Even the much-vaunted and very important GDPR is not at all as clear as it could be about the different kinds of data.  If we are to rein in the massive invasions of our privacy found in social media, we need to understand much more clearly and specifically what that data is, and what social media companies and data brokers and even academics do with it.

This entry was posted in "social media", privacy, rhetoric of computation and tagged , , , , , , , , , , , , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

One Trackback

Post a Comment

You must be logged in to post a comment.

This site uses Akismet to reduce spam. Learn how your comment data is processed.