Bad Data “For Good”: How Data Brokers Try to Hide in Academic Research

Fight Censorship, Share This Post!

When data broker SafeGraph got caught selling location information on Planned Parenthood visitors, it had a public relations trick up its sleeve. After the company agreed to remove family planning center data from its platforms in response to public outcry, CEO Auren Hoffman tried to flip the narrative: he claimed that his company’s harvesting and sharing of sensitive data was, in fact, an engine for beneficial research on abortion access. He even argued that SafeGraph’s post-scandal removal of the clinic data was the real problem: “Once we decided to take it down, we had hundreds of researchers complain to us about…taking that data away from them.” Of course, when pressed, Hoffman could not name any individual researchers or institutions.

SafeGraph is not alone among location data brokers in trying to “research wash” its privacy-invasive business model and data through academic work. Other shady actors like Veraset, Cuebiq, Spectus, and X-Mode also operate so-called “data for good” programs with academics, and have seized on the pandemic to expand them. These data brokers provide location data to academic researchers across disciplines, with resulting publications appearing in peer-reviewed venues as prestigious as Nature and the Proceedings of the National Academy of Sciences. These companies’ data is so widely used in human mobility research—from epidemic forecasting and emergency response to urban planning and business development—that the literature has progressed to meta-studies comparing, for example, Spectus, X-Mode, and Veraset datasets

Data brokers variously claim to be bringing “transparency” to tech or “democratizing access to data.” But these data sharing programs are nothing more than data brokers’ attempts to control the narrative around their unpopular and non-consensual business practices. Critical academic research must not become reliant on profit-driven data pipelines that endanger the safety, privacy, and economic opportunities of millions of people without any meaningful consent. 

Data Brokers Do Not Provide Opt-In, Anonymous Data

Location data brokers do not come close to meeting human subject research standards. This starts with the fact that meaningful opt-in consent is consistently missing from their business practices. In fact, Google concluded that SafeGraph’s practices were so out of line that it banned any apps using the company’s code from its Play Store, and both Apple and Google banned X-Mode from their respective app stores. 

Data brokers frequently argue that the data they collect is “opt-in” because a user has agreed to share it with an app—even though the overwhelming majority of users have no idea that it’s being sold on the side to data brokers who in turn sell to businesses, governments, and others. Technically, it is true that users have to opt in to sharing location data with, say, a weather app before it will give them localized forecasts. But no reasonable person believes that this constitutes blanket consent for the laundry list of data sharing, selling, and analysis that any number of shadowy third parties are conducting in the background. 

No privacy-preserving aggregation protocols can justify collecting location data from people without their consent.

On top of being collected and shared without consent, the data feeding into data brokers’ products can easily be linked to identifiable people. The companies claim their data is anonymized, but there’s simply no such thing as anonymous location data. Information about where a person has been is itself enough to re-identify them: one widely cited study from 2013 found that researchers could uniquely characterize 50% of people using only two randomly chosen time and location data points. Data brokers today collect sensitive user data from a wide variety of sources, including hidden tracking in the background of mobile apps. While techniques vary and are often hidden behind layers of non-disclosure agreements (or NDAs), the resulting raw data they collect and process is based on sensitive, individual location traces.

Aggregating location data can sometimes preserve individual privacy, given appropriate parameters that take into account the number of people represented in the data set and its granularity. But no privacy-preserving aggregation protocols can justify the initial collection of location data from people without their voluntary, meaningful opt-in consent, especially when that location data is then exploited for profit and PR spin. 

Data brokers’ products are notoriously easy to re-identify, especially when combined with other data sets. And combining datasets is exactly what some academic studies are doing. Published studies have combined data broker location datasets with Census data, real-time Google Maps traffic estimates, and local household surveys and state Department of Transportation data. While researchers appear to be simply building the most reliable and comprehensive possible datasets for their work, this kind of merging is also the first step someone would take if they wanted to re-identify the data. 

NDAs, NDAs, NDAs

Data brokers are not good sources of information about data brokers, and researchers should be suspicious of any claims they make about the data they provide. As Cracked Labs researcher Wolfie Cristl puts it, what data brokers have to offer is “potentially flawed, biased, untrustworthy, or even fraudulent.” 

Some researchers incorrectly describe the data they receive from data brokers. For example, one paper describes SafeGraph data as “anonymized human mobility data” or “foot traffic data from opt-in smartphone GPS tracking.” Another describes Spectus as providing “anonymous, privacy-compliant location data” with an “ironclad privacy framework.” Again, this location data is not opt-in, not anonymized, and not privacy-compliant.

Other researchers make internally contradictory claims about location data. One Nature paper characterizes Veraset’s location data as achieving the impossible feat of being both “fine-grained” and “anonymous.” This paper further states it used such specific data points as “anonymized device IDs” and “the timestamps, and precise geographical coordinates of dwelling points” where a device spends more than 5 minutes. Such fine-grained data cannot be anonymous. 

All of this should be a red flag for Institutional Review Boards, which need visibility into whether data brokers actually obtain consent.

A Veraset Data Access Agreement obtained by EFF includes a Publicity Clause, giving Veraset control over how its partners may disclose Veraset’s involvement in publications. This includes Veraset’s prerogative to approve language or remain anonymous as the data source. While the Veraset Agreement we’ve seen was with a municipal government, its suggested language appears in multiple academic publications, which suggests a similar agreement may be in play with academics.

A similar pattern appears in papers using X-Mode data: some use nearly verbatim language to describe the company. They even claim its NDA is a good thing for privacy and security, stating: “All researchers processed and analyzed the data under a non-disclosure agreement and were obligated to not share data further and not to attempt to re-identify data.” But those same NDAs prevent academics, journalists, and others in civil society from understanding data brokers’ business practices, or identifying the web of data aggregators, ad tech exchanges, and mobile apps that their data stores are built on.

All of this should be a red flag for Institutional Review Boards, which review proposed human subject research and need visibility into whether and how data brokers and their partners actually obtain consent from users. Likewise, academics themselves need to be able to confirm the integrity and provenance of the data on which their work relies.

From Insurance Against Bad Press to Accountable Transparency

Data sharing programs with academics are only the tip of the iceberg. To paper over the dangerous role they play in the online data ecosystem, data brokers forge relationships not only with academic institutions and researchers, but also with government authorities, journalists and reporters, and non-profit organizations. 

The question of how to balance data transparency with user privacy is not a new one, and it can’t be left to the Verasets and X-Modes of the world to answer. Academic data sharing programs will continue to function as disingenuous PR operations until companies are subjected to data privacy and transparency requirements. While SafeGraph claims its data could pave the way for impactful research in abortion access, the fact remains that the very same data puts actual abortion seekers, providers, and advocates in danger, especially in the wake of Dobbs. The sensitive data location data brokers deal in should only be collected and used with specific, informed consent, and subjects must have the right to withdraw that consent at any time. No such consent currently exists.

We need comprehensive federal consumer data privacy legislation to enforce these standards, with a private right of action to empower ordinary people to bring their own lawsuits against data brokers who violate their privacy rights. Moreover, we must pull back the NDAs to allow research investigating these data brokers themselves: their business practices, their partners, how their data can be abused, and how to protect the people whom data brokers are putting in harm’s way.


Fight Censorship, Share This Post!

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.