Is Data Collection Evil?

Privacy or data, a convenient false dichotomy

With a lot of caveats, the short answer is yes. And yet, we [Cliqz] do it, at a scale.

Data from people, which is not the same as personal data, is an extremely efficient and often the only way to build XX. In Cliqz’s case, XX is a search engine and a privacy-preserving browser.

Evilness does not necessarily come out of the data itself but rather from how and why it is collected. There is great resistance for fine-grained discussions about this topic because we are trapped on a false dichotomy: data or privacy.

The proponents from the privacy angle can only lose if they vouch for data. The cost to benefit ratio pushes them to maximalist positions: if data then no privacy. It does not matter if the risk is small, the cost to their reputation for failing is tremendous. The proponents from the data angle also suffer from the wrong incentives. They will collect the data they need, that much is clear. But since they will be accused of compromising on privacy regardless, they have no incentive to collect responsibly, or to limit themselves to just the data they need. Rinse and repeat, positions castle. Everybody is happy but privacy goes down the drain.

In this article we hope to convince you that you can have both data and privacy. It is a complicated relation, true, but they are not mutually exclusive if the right compromises are made.

Data Collection in Search

Let us start with the need for data, going straight to the point: one cannot build a competitive search without data collected from people. The claim is somewhat controversial, but it can be backed up.

Crawling and indexing without taking into account the popularity of web pages is wasteful. It is possible, but much more costly. And attention to costs is important to build a competitive system; there are lots of places where the money is more needed. Determining the quality of a page is very hard, but if you can monitor how people interact with it, it becomes much easier. Knowing which pages people visit after a query is vital. This kind of data is collected by Cliqz through a system that we named Human Web.

Incumbent search engines like Google, Bing and Yandex rely on this kind of data too. In their case it might be out of convenience or cost-reduction, but for us, it is a necessity to keep up with competitors.

The competitive advantage of data is so strong that the industry cannot regulate itself. External actors are called to intervene, but in our humble opinion, it will not really fix the issue.

First, large organizations are the only ones with enough resources to deal with the overhead caused by regulation. If one looks at the early months of GDPR in the EU - Google’s and Facebook’s trackers, unlike most others, did not lose “market share”[1].

Second, large organizations’ deep pockets allow them to choose whether to follow regulation or not: if the gains from breaking the law are higher than the costs, the rational course of action is to pay the fine. Google has already paid multiple fines in the range of billions in Europe alone.

Finally, even if under some form of hypothetical regulation, data collection were to be totally banned or extremely handicapped, then existing big data monopolies would be the sole owners of something that is now scarce. This is a good place to be in.

If you have read up to this point some names might have crossed your mind: DuckDuckGo, Qwant, StartPage, etc.

They are search engines, claim to not collect data, and they have good enough quality to be competitive. Does it mean that the claim is false?

Not really, what happens is that the premise is wrong. None of the aforementioned search engines is really a search engine, at least not an independent one. They rely on the backend of the big players for the heavy lifting, and those are the ones collecting data. Without Yandex, Microsoft and Bing, they would not be as competitive as they are, they might not even exist. While we have great respect for the products they build, there is a catch-22 situation where the “alternative” search engines fundamentally depend on the companies they are trying to be an alternative to. Furthermore, the privacy protection offered is somewhat problematic. They do help their users to protect their privacy, but without fixing the underlying privacy problem[2]. It’s like hiding the problem under the rug instead of fixing it.

Building a search engine is an extremely complex and tedious task engineering-wise and a black-hole for money business-wise. Downloading the common crawl, spawning an Elasticsearch cluster all seasoned with embeddings, transformers and regression trees will not do it. That will give you a search engine; but not one that people would stop using Google for. Keep tuned if you want to know how we build our search engine, there will be plenty of articles in the coming month. Also, feel free to disregard the unsolicited advice of not trying to build a search engine without collecting data; if you are brave enough, go ahead and try, we will help in any way we can. Someone ought to succeed, perhaps it won’t be Cliqz, but we will be very happy if someone else does. We all would be better off.

Data Collection in Browsers

At Cliqz we build both a search engine and a browser. If you think of a browser as a shell for an HTML rendering engine, then no data collection should be needed. But if you consider the browser as a user agent, then data collection is an important ingredient for must-have features like tracking-protection, security, etc. We will have time to delve into details about this in upcoming articles.

How to Do Data Collection

We hope that we made our point and that you agree that data is needed, a necessary evil if you will. We also introduced the dichotomy of data or privacy, when it is false and when it is true depends greatly on how data is collected.

Allow us to introduce an example to illustrate the point. Let’s say that you want to implement a cool new feature like: how busy a business is over the course of a day.

Basically, what you want to do is a query like this:

Count how many different people are in a given location [the business] for longer than a certain time range [10 minutes] by hour.

If you have access to people’s location you will instrument the OS / router / device / browser / app / extension to send the following data: <lat, long, timestamp, uid> and wait. Most likely, however, someone else had the need for this data in the past, so the instrumentation and data is already available and can be re-used.

Needless to say, the raw location data collected as per described is not private. There are multiple ways to determine the real person behind that random uiduid [3]. But that will not stop you, after all, the feature you have been requested to implement is totally legitimate and for a good cause.

The problem is that the data can be re-used ad infinitum, and there is no way to guarantee that there will not be an ethically questionable use-case.

Consider this query over the raw location data:

For all different people that were at location X at time T, tell me the current location.

Certainly a bit creepier, even dangerous for your personal safety. It is true that the people targeted could be criminals or terrorists, but they can also be people under protection, dissidents, members of a minority. This is way too dangerous if you ask us:

Tell me the locations at night hours [to infer the address] of all people that have been at least twice on this list of locations [ church | trade union | abortion clinic ] in the last year so that we can put up some billboards.

Privacy is not only about individuals, it can endanger society as a whole. Americans have recently become aware of how propaganda and false information can be weaponized - if only because they were the targets for once.

People, the writer of this text included, accept to be used as a sensor in return of free products and services. But we can only trust that the data we provided will only be used for good.

Let’s say that you trust the companies that collect this location data (Google and AT&T/Verizon if you are an Android user in the US, plus many others if you have apps installed). Do you extend the trust to 3rd parties? Do you trust the ethics and professionalism of all those employees with access to the data? Data can be leaked, or hacked. And last but not least, do you trust that they will resist the pressure of certain government agencies? If you still do not have doubts, let us pose one last question. Do you believe that this will still be true in 20 years? Governments can change (for the worse) even in the EU, see Spain, Poland or Hungary. Companies can be sold or go bankrupt, see Yahoo or Fitbit. Last and not least, institutions evolve, Google removed the don’t be evil motto, for coherence, we suppose.

There are so many things that can go wrong, that resorting to trust is the only way to sleep at night. But we are giving a blank check, hoping, that it will never be cashed.

What would a potential alternative be? Instead of collecting the location of people, why not make the sensor smarter and only emit the signal you want when the conditions of the query are met:

On the cellphone, if stationary for more than 10 minutes on a rarely seen location, then, after a random timeout send a message back to the collector with <lat, long, event_time_capped>.

And then,

On the collector, count the number of messages that are duplicated, which will give you how many people were there, without having to send any uiduid, so no historical record of people’s movement is ever collected.

There is no need to send an explicit uiduid that would link all messages from the same person, hence, the illegitimate use-cases from before would not be possible.

Are they probably private? No. It cannot be formally proved, but looks much safer. We welcome you to read the upcoming articles on Human Web and Human Web Proxy Network (HPN) if you are interested in details. De-anonymization attacks, counter-measures, description of the data feed to Cliqz’s search engine, fraud prevention, relation to differential privacy, all of these topics and more will be addressed.

Why Don’t Others Do This, and Why Cliqz Does?

That we cannot answer categorically, but we can share our experience on collecting data this way, and some thoughts on why it is not widespread:

  • It is more difficult to implement and debug.
  • Deployments are extremely heavy, shipping logic to clients is cumbersome.
  • Prototyping and research are orders of magnitude more complex.
  • You can only measure a priori known hypotheses (the conditions to be met). No more fishing expeditions to see what can be extracted from the data, neither manually (statistics) nor algorithmically (AI/ML).
  • You cannot apply them retrospectively, the data to satisfy your use case should not exist. So it might take weeks to be able to evaluate the outcome.
  • Good luck convincing your manager that you will spend two months on what could be done in four days.

Any new query (or task) that requires data, will need to be implemented, deployed, awaited and finally analyzed. Any mistake or iteration entails starting from scratch. Compare this against the current modus operandi: access to collect-all-you-can data-sets on Databricks (hopefully anonymized, even more hopefully using differential privacy).

There is one last catch: whatever is collected is not re-usable. To be more precise, should not be re-usable, as the data should only satisfy the use-case that it was intended for.

You might think that our approach is like shooting yourself in the foot, and you are absolutely right. The methodology is cumbersome, inconvenient and full of drawbacks. There is only one advantage: privacy.

The dichotomy of data or privacy is indeed false, but only if you pay a hefty price.

We have multiple reasons to force this upon ourselves,

We do not do it for marketing. We started to collect data like this in 2014, back when privacy was even more niche than it is today.

Love for [over]-engineering. We must admit, that this has something to do with it, but only a small percentage.

Social responsibility. That’s the bulk of it. Every Cliqzer has a different take on it, but personally, I am not doing it out of altruism but rather out of fear. Some of us are old enough to have seen datasets that should not exist, and yet, those datasets exist. We would not want this kind of data to ever exist at Cliqz. Our family, friends, and ourselves are using it.

We will not defeat global surveillance capitalism, certainly not on our own. But one thing is sure, we do not want to become a contributor to it. Neither by action nor by omission.

Allow us to end with a fitting quote,

The world is a dangerous place, not because of those who do evil, but because of those who look on and do nothing.” - Albert Einstein

Not aiming at our colleagues at DuckDuckGo, Qwant or StartPage. They are small enough to not be guilty of not trying. But others are not. You know who you are, get to work.

Footnotes


  1. Data collected by Cliqz allows us to create services like WhoTracksMe. The open-source aggregated data enabled studies like What happened after GDPR. ↩︎

  2. Bing’s monetization API requires the IP of the user to be forwarded. Hence, Microsoft has the ability to collect query sessions. According to them, the data has a very short retention period and is only used to fight fraud. We must trust that Microsoft is doing as they claim, there is no way to verify. Anyone using the Bing API to power their services, must compromise the privacy of their users or breach the terms of service. That is the price to pay for not being independent. Of course terms of service can change, but not necessarily for the better. For instance, DuckDuckGo has access to Bing via today’s extinct Yahoo, which did not force to send the user’s IP. How this is going to play out in the future is unknown, but for sure, it is safer to be independent. ↩︎

  3. There are many papers on the topic. Just a couple for reference: side-channels, statistical inference. The uiduid allows for records to be linked, if only one of these records leaks the identity, the full history is compromised. One way to think about this is probabilistically, if the chance of one record to be compromised is very small p=0.001p = 0.001, but you keep trying, eventually it will hit. And only one bad record compromises the full history for that uiduid. ↩︎