Blocking tracking without blocking trackers

privacy

web

tracking

Cliqz's self-learning anti-tracking: Past, Present, Future

December 19th, 2019

As we mentioned at the start of this blog series, Cliqz set out from the beginning to build products with both data and privacy. In “Is Data Collection Evil?” and the “Human Web” posts we showed that linking together information about users’ actions on the Web is a dangerous privacy minefield, and the steps we take to ensure that we do not do that. At the same time, as a user browses the Web, hundreds or even thousands of companies are collecting this kind of data, so-called online trackers. This post covers how we developed the most advanced anti-tracking system currently available, how we leveraged that to publish the largest public dataset on the behaviour of trackers across the Web^[1], and how we continue to improve our protection to counter constant changes in the tracking landscape.

Past

Back in 2015 online tracking was ubiquitous and easy. All major browsers (with the notable exception of Safari) shipped with no limitations on third-party cookies and storage per default, and tracking protection was limited to a handful of browser extensions like Ghostery and Privacy Badger, or Adblockers with ‘privacy’ blocklists.

At the time, we were shipping our search as part of a Firefox extension. Having developed a principle of forbidding record linkage in our own data collection, we wanted to extend this protection to the rest of our users’ web browsing. To align with our principles, the intention was not to prevent ‘safe’ data being collected, but to instead prevent the collection of profiles. This meant that we would try to avoid blocking requests, and instead remove identifiers such as cookies and browser fingerprints.

Protecting from online tracking follows the pareto principle^[2]—that for roughly 20% of the effort you get 80% protection. In this case, 80% of protection comes from simply blocking third-party cookies. This was the first thing Cliqz anti-tracking did.

Data-driven Fingerprint Protection

The second component of the Cliqz anti-tracking that we shipped in 2015 was our “Unsafe Data Removal”, or anti-fingerprinting protection. Browser fingerprinting is the process of generating unique user identifiers via a stateless method. Rather than storing data in the browser like a cookie, fingerprinting runs a function that analyses the user’s browser environment to output a unique value. This method became infamous due to its ability to work across browser profiles, in private mode and more.

The idea behind this component came with the realisation that a browser fingerprint is only useful if it is unique to an individual user. Therefore, given a potential fingerprint value sent by a tracker, if we can find enough examples of it being sent from other users’ devices, we know that this cannot be used for record linkage, and is therefore safe in this regard. Anything that cannot be proven safe should be removed.

This was a departure from other anti-fingerprinting methods at the time, such as Tor’s, that focused on minimising the entropy that can be generated from various browser APIs. Rather than focusing on preventing fingerprints being generated, we detected, and blocked them from being sent back to the tracker server. As this method is agnostic of the actual fingerprinting technique, it does not require us to constantly vet each new API added to the browser, and is thus low maintenance in the long term.

When loading a page in the browser we analyse each request to third-party domains on that page. We look at the HTTP headers and URL parameters in order to extract a set of key-value pairs of data that is being sent by that tracker. Any values deemed to be not safe will be stripped from the outgoing request.

How do we determine if a value is safe? A value is safe if it cannot be used as a user-identifier to link messages for a specific user. Therefore, we can immediately exclude some values locally, if we can determine they are not persistent, or lack the required entropy:

If a value has not been seen previously for the (third party, key) pair, it is safe.
If a value is too short, i.e. has too little entropy to be a UID, it is safe.
If more than 3 different values have been seen for a (third party, key) pair over a two-day period, then the value is not persistent, and therefore safe.

If we cannot determine safety locally, we can check if a value is not unique by looking in the global safe set. This is a set of keys and values that have been observed by a quorum of users and are thus determined to be safe. This set is calculated by aggregating the key-value tuples we extracted from the tracker URL earlier over the entire user-base. Note: Data is split into minimal tuple sets and data values hashed before sending to prevent both record linkage and collection of sensitive user data—no plaintext values will leave user devices. We employ client side rate limiting (enforced on the server side with HPN) in order to count unique users per (tracker, key, value) tuple without the need for a user identifier.

The global safe set, along with a dynamically generated list of domains we consider to be trackers, is updated daily on our backend, and served as a single Bloom Filter^[3] to all clients. This allows the client to check against the global safe set and tracker list in real time while loading URLs. Using a Bloom Filter allows us to reduce the in memory size of the list—it is usually less than 0.5 MB.

Further technical details of this system can be found in our paper^[4], previous blog posts, and the open-source code. Note, that this system is not intended to provide 100% protection against fingerprinting—from this description you can probably come up with a simple attack to circumvent it. However, it is very effective against fingerprinting as it is commonly deployed today, and without the onerous limitations of other anti-fingerprinting techniques, like Firefox and Tor’s resistFingerpinting^[5]. When the attack model is that we have to allow malicious actors to run code (i.e. JavaScript)—NoScript is not a viable option for 99% of web users—we will always be on the back-foot.

WhoTracks.Me

Our anti-tracking system went live for users of the Cliqz browser extension in 2015, and the first version of the Cliqz desktop browser shipped in 2016 with anti-tracking on by default. We also published a paper, Tracking the Trackers^[4:1], at the WWW 2016 conference with our methodology. This paper included a detailed analysis of tracking behaviour in the wild, collected over 350,000 page loads observed by our system.

While the 2016 article was useful to get a picture of just how prevalent tracking is across the Web, it was just a snapshot. Our data-driven protection collects data on the presence of trackers across the Web, in order to generate dynamic tracker lists that respond to threats as they appear. We realised that we could use this data to create a regularly updated transparency report on trackers and websites using trackers. This could then be used for longitudinal analyses of tracking trends.

This project was realised in 2017, when we launched whotracks.me, the largest public resource on online tracking to date. We also published the data alongside the site, to allow researchers to dig around for interesting trends. This proved very useful, for example to measure the impact of the introduction of the GDPR on tracking in the EU.

Tracking Facebook’s reach over 2.5 years [source]

Alongside the website, we also published a paper^[6], outlining how we collect the data behind the site, and expanding on how we prevent any record linkage in this data set.

WhoTracks.Me continues to be updated on a monthly basis, and each month we analyse around 1.5 billion page loads. In total the data covers 2.5 years and almost 20 billion page loads, making it the most comprehensive data source on online tracking that we know of.

Ghostery

In 2017, Ghostery was acquired by Cliqz, and we immediately set to work combining Ghostery’s detailed tracker database with Cliqz’s technologies. The result was Ghostery 8, that combined the detailed reporting and blocking control of Ghostery with Cliqz’s system. This was also a boon for WhoTracks.Me, which could include a much larger number of pages in its analyses.

Present

In 2017 Apple’s Webkit team announced “Intelligent Tracking Prevention” (ITP) for Safari, a system for limiting storage access for trackers which would be enabled in Safari by default. In September this year Firefox turned on tracking protection by default. The new Chromium-based Microsoft Edge browser will ship with tracking prevention on by default. In short: the consensus among browser vendors^[7] is that anti-tracking should now be a core browser feature, and it should be on by default—something we have been doing at Cliqz from the start.

In this environment it may seem that the value proposition of Cliqz and Ghostery’s protection is diminished. However, this is not the case:

Firefox and Edge storage blocking all work on a blacklist system. This means a fixed list of domains are considered as trackers and have their storage access constrained in third-party contexts. In Firefox and Edge’s cases, this list is manually curated, meaning that it is slow to update to new domains being used for tracking. We also constrain cookies for tracking domains, but using a list updated daily based on the real activity and presence of trackers, as observed by our users.
Firefox and Edge will by default block requests to domains identified as fingerprinters or cryptominers by Disconnect. These categories currently contain 106 and 18 entries respectively, many of which are no longer active. In contrast our anti-fingerprinting system is currently removing identifiers from over 1.3k domains (10x more).
The Disconnect blocklist used by Firefox and Edge has limited coverage (particularly for trackers used outside of the US), and is updated irregularly. Nowadays, with increasing usage of tracker and ad blockers, trackers, and in particular less scrupulous ones, are turning over domains faster and faster in order to stay ahead of the blocklists. In this environment, our data-driven approach gives us a significant advantage.

It should be noted that as more users now have tracking protection while browsing the web, it makes it easier for others to implement tracking protection. ITP has already shifted developer assumptions about when cookies and storage are available in third-party contexts, which in turn means there are fewer cases where sites break because of cookies being blocked.

Future

The Cliqz anti-tracking system has been continually tweaked over the years to improve performance, reduce false positives and breakages, and improve protection. Alongside this we continue to develop new components to complement the existing protection.

Most recently we have been looking at the browser’s cookie storage retention policy. By default, the browser will store a cookie for as long as the site issuing asks for it to be stored for, even if it is over 50 years. Safari’s ITP introduced the idea that maybe we should not always trust sites on cookie retention, and sometimes purge them earlier. We introduced a new cookie policy in the Cliqz browser which applies some policies to limit cookie retention. For example, if a cookie is set on a tracker domain (e.g. if a user disables protection for a site with trackers), then the cookie will be deleted after 1 hour. If the user has actually visited this domain as a first party, this window is extended, up to 30 days. This system acts as a safety net, cleaning up any cookies that may get through the standard cookie blocking.

This can also be extended to non-trackers. Most sites you visit will drop a first-party cookie which won’t expire for several years after your visit, and with third-party cookies becoming less useful for trackers, many have started dumping their identifiers on first-party domains. Most users would not expect to be remembered by a site over a year after visiting. We set this to a more reasonable 30 days, which is refreshed if you visit the site again.

The recent changes to browser defaults have shifted the tracking landscape significantly, so we must be continuously on the lookout for new techniques being used to circumvent the current protections. One such example is CNAME masking, which hides third-party trackers behind first-party hostnames to avoid being blocked. On Firefox we can detect this as a DNS API is available to browser extensions, however on Chrome this will not be possible.

This speaks to one of the past and future challenges of building anti-tracking in the browser: platform limitations. Up to this point Firefox bootstrap APIs, and then WebExtensions APIs gave us the ability to build these protections up with relative ease. It is easy to hook into browser internals to block requests and modify cookies, for example. Some are more tricky, but possible: Modifying cookie retention requires a parallel cookie database to add a cookie creation date. Most issues we have had have been on mobile, where the Webview APIs are more limited. The anti-tracking system described here cannot be implemented for our iOS browser due to the limitations of the WKWebview APIs—it does not provide information about individual requests made. Chrome’s Manifest V3 proposals threaten the same fate for our ability to do anti-tracking on that platform.

Conclusion

We built our anti-tracking with simple principles in mind: Prevent trackers linking user data cross-site, prefer not blocking requests if possible, and minimise site breakage. These principles were designed to build in incentives to use privacy-by-design approaches. If trackers managed to solve their use-case without identifiers our system would not affect them. We deployed this protection on by default in 2015, and now the majority of mainstream browsers are following suit.

You may be wondering why we don’t just block trackers by default. Most adblockers support privacy lists such as EasyPrivacy that claim to block tracking requests. This is only partially effective against tracking, for the following reasons:

The tracking ecosystem is highly dynamic (as shown by the WhoTracks.Me data), and new trackers appear constantly. While list maintainers are very quick to add new entries, they can only add what they can see. In the past we have seen trackers deliberately rolling new domains to keep ahead of blocklists, more on this in a future post.
Blocking is an all-or-nothing protection. There is no middle ground where a request can be allowed but user identifiers removed. A classic example of this is Google’s Recaptcha. If it is blocked many tasks become impossible for users to complete. If not blocked, all loads of pages with Recaptcha are linked to your Google account (it is loaded from google.com).

Thus, as we will elaborate more on in tomorrow’s post, we believe in having a robust, on-by-default, base tracking protection that doesn’t block. We then use blocking as a supplementary tool for users who want either more aggressive protection, or the speed and quality-of-life improvements that blocking can bring.

We can hope that there is now a critical mass sufficient to change behaviours from trackers and publishers—current evidence suggests they haven’t given up trying to fight it yet. However, the web moves slowly (in some ways), and the mistake of allowing third-party cookies everywhere will leave lasting legacy.

Over the last 4 years of building technology to protect against online tracking, as well as provide transparency and tools^[8] to help users control what companies are doing with their data, we have seen a dramatic shift in attitudes, both from consumers and regulators. Regulation is now in-place to bring tracking into line, and a vocal minority of users are holding companies to task about data usage. While anti-tracking will still be needed for a while yet, we can perhaps (naively) hope for a future where users and their data are respected.

References

WhoTracks.Me. Data available on Github ↩︎
https://en.wikipedia.org/wiki/Pareto_principle ↩︎
https://en.wikipedia.org/wiki/Bloom_filter ↩︎
Z. Yu, S. Macbeth, K. Modi, J.M. Pujol “Tracking the Trackers”, WWW 2016 [pdf] ↩︎ ↩︎
Tor attempts to make all user-agents homogeneous, the downside being that everyone must use the same window size, live in the same timezone and disable other useful web features. ↩︎
A. Karaj, S. Macbeth, R. Berson, J.M. Pujol, “WhoTracks.Me: Shedding light on the opaque world of online tracking”. [pdf] ↩︎
Except Google. The Chrome team have indicated that they would like the web to be more private, but blocking cookies should not be the way to do it, as they’re needed for targeted advertising. ↩︎
We built an open-source browser extension to allow users to view and modify their GDPR consent options for some websites. More recently, we added a feature to the Cliqz browser to automatically apply a user’s consent settings when presented with a consent popup. ↩︎