privacy

crypto

zero-knowledge-proof

fraud-detection

Human Web Proxy Network (HPN)

Preventing record linkage for improved anonymity. An introduction to the HPN protocol.

December 4th, 2019

With Human Web, users can contribute data that will be used to build services such as Cliqz search, anti-tracking, and WhoTracks.me. By design, we do not collect any data element that is associated to a user by preventing record linkage (see the previous article on Human Web if you are not familiar with record-linkability and re-identifiability). We need to collect data, but it has to be anonymous.

But how can we provide anonymity? And how can you, as a user, ensure that we stick to our promises?

This blog post will cover two parts:

Understand the risks: Why is the ability to decide whether messages are from the same sender a threat to data anonymity?
Learn about HPN: How it prevents record linkage based on network fingerprinting, but still helps to detect malicious data injection.

If you came from the previous blog post on Human Web and are only interested in the HPN (Human Web Proxy Network) protocol, you can skip the first part, and jump right to part 2.

Part 1: Understanding the dangers of record linkage

Data can be valuable, but it comes with responsibilities. Unfortunately, we live in times when data breaches affecting millions of people are becoming more and more frequent. Once data is stolen, you lose control over how it is used.

There are situations where you have to store sensitive data:

Compliance with anti-money laundering regulations (KYC)
Medical data (health records of patients)

Fortunately, in Human Web data, there is no need for that. All of our data driven services (e.g. search, anti-tracking, even the advertisements) are designed to operate exclusively on anonymous data. It is a deliberate business decision, even if it comes with trade-offs.

It is our belief that the best protection is to keep sensitive data out of our system.

However, the answer is not to stigmatize data collection per se. When data can be collected in an ethically responsible fashion, which respects the privacy of people, the benefits that data can bring to society will outweigh the risks. We discussed that topic in Is Data Collection Evil?. Nevertheless, data anonymization is a hard problem and easy to get wrong.

An early, infamous example happened in 2006 when AOL published a set of search data. Although AOL did not include the identity of the users, it was often possible to find out their real name by looking at the user’s search history.

De-anonymization attacks also become more potent once attackers get hold of additional datasets (Background Knowledge Attack). In a thought experiment, imagine an attacker has the abilities of a company such as Google, Amazon, or Facebook. In combination with existing data sources, de-anonymization becomes easier.

As we consider it impossible to control the privacy side-effects once separate messages can be systematically linked to the same sender, attaching session information (e.g. device ids) to messages is strictly forbidden in Human Web.

Requirements for messages

Unlinkability of messages is a necessary property for anonymity in Human Web, although on its own it is not sufficient. In addition, messages have to be free of meta information that can be used to de-anonymize a user. As an obvious example, messages must not contain identifiers. Otherwise, linking messages becomes straightforward, and given enough messages, de-anonymization becomes possible.

In reality, violations of this requirement can be more subtle. It can be a combination of parameters that together can be used for fingerprinting. It is closely related to the observation that k-anonymization has known weaknesses for high-dimensional datasets.

For that reason, messages should be as simple as possible:

Data minimization: Include only the minimum amount of information for a given use case.
Atomicity: Splitting unrelated information over multiple messages is better than sending it in one combined message.

Interestingly, the second technique is sometimes misinterpreted by users. Sending multiple messages can be perceived as sending more data, although the technique itself is intended to reduce the amount of information being shared. Note that, in both cases, the server ends up with the same data points. The difference is only that in the first case, the server also knows that all data came from the same device, while in the second case, this information is lost.

Note that Human Web implements additional heuristics which messages have to pass; otherwise, those messages will be dropped. We will not go into detail, but to learn more about it, the techniques are described in the previous blog post on Human Web.

For the rest of the article we will assume that individual messages are safe. What we need now is an anonymous channel to transfer our message to the server.

Part 2: Preventing network fingerprinting

In the first part, we saw the dangers of data collection systems that allow profile building by linking data points to the same sender. It concluded that, for our use case (first and foremost, building a competitive search engine), preventing record linkage at the message level is an effective defence against de-anonymization attacks, even if it means that some of the messages will have to be dropped at the client before they are sent to the server.

Next, we will cover two questions:

User perspective: When data is sent, how can you guarantee anonymity? Don’t you still see my IP?
Our perspective: When we receive data without knowing the sender, how can we trust it?

First, we will explain how our system implements an anonymous channel by sending data through a third party proxy–we will see that this alone is not enough as there are other pitfalls at the network level. Finally, we will give an introduction to HPN (Human Web Proxy Network), which includes a cryptographic protocol to enforce rate limits without revealing the identity of the sender.

Sanitizing HTTP requests

Let us first clarify the term network fingerprinting: Whenever data is sent over the internet, the receiver gets the sender’s IP address along with the message. That information alone can be used to link multiple messages to the same network. Besides the IP address, there is additional data (e.g. user-agent HTTP headers, TCP session information). Individually, this information may not be reliable as an explicit identifier, however, when combined, the information might be unique enough to link messages to one device, at least for a limited period of time.

To prevent that attack, the Cliqz extension removes HTTP headers, for example, user-agent, accept-language, or origin.

The origin header in particular deserves attention: whenever an extension makes HTTP calls using the fetch API, Firefox automatically sets the origin header. It includes the extension ID, which is unique per user (a random UUID generated after the installation):

origin: moz-extension://539a0d5b-60d9-4726-b3de-27d6979edc26

It illustrates how easy it is to end up with a strong identifier in the transmitted data if we do not pay enough attention.

We solve this particular problem by using the webRequest API to strip the header, after Firefox added it, but before the request is sent.

Sending messages via Proxy Servers

Finally, to hide the IP address, we need to send the data through a third party. We experimented with sending data through the Tor network, but since our code needs to run inside a WebExtension, this approach turned out to be impractical^[1].

In the end, we settled for a more conventional solution. Cliqz has a contract with the VPN Provider FoxyProxy, which takes the role of a trusted third party. Instead of sending directly to our servers, the Cliqz extension instead sends the request through their proxy servers. The data is end-to-end encrypted, so FoxyProxy will not be able to read or modify the content:

Forward secrecy is guaranteed by negotiating ephemeral symmetric keys per request (using Diffie-Hellman, where the server’s secret rotates daily).
To prevent statistical attacks based on the size of the encrypted message, the payload is padded to fixed bucket sizes.

FoxyProxy operates under GDPR regulations, keeps no logs, and most importantly, will give us neither access to the machines nor any information that we can use to determine the original IP of the sender.

As there is no API in WebExtension to force a connection to be closed, the proxy servers are available under multiple subdomains:

proxy1.cliqz.foxyproxy.com
...
proxy100.cliqz.foxyproxy.com

By randomly selecting one of these subdomains whenever we send a message, we prevent HTTP connection reuse. Otherwise, when a client sends a burst of messages, they could arrive over the same TCP connection.

HPN protocol

The system as described in the previous section is called Human Web Proxy Network, or HPN for short. As the name implies, it is part of Human Web, where it is used to allow users to anonymously send messages from their browser to our servers.

When looking at two individual messages, we should not be able to decide whether they came from the same user. From a privacy standpoint, that is a win, but there is one caveat. Once you allow anonymity, it becomes more difficult to stop manipulation attempts by malicious clients. This is a concern. Without effective countermeasures, we are inviting malicious actors to tamper with our search results.

Although the vast majority of clients (operated by normal users) will send legitimate data, due to anonymity there is no concept of trust. Thus, it is impossible to prevent attackers from sending malicious data. Still, we have to take measures to prevent them from overruling the majority to subvert the system. This is where the HPN protocol comes into play, as it allows us to prevent those attempts by enforcing rate limits without revealing the identity of the sender.

What could such an attack look like? Image that Mallory, a malicious actor, owns a website. To give it a boost of popularity, she decides to send us the url she wants to promote a hundred of times to make us believe those were real visits by different people. That attack should not be possible due to the rate limiting system. Mallory might still be able to send fake messages, but not enough to alter the ranking of the search. Of course, it would be easier to enforce those limits if you had user-identifiers, but those are forbidden to preserve anonymity and privacy. We will revisit the example again in the section How to detect duplicates.

Overall, attacks on the system should become harder with each additional user participating in Human Web. At the moment, we estimate the number of users contributing Human Web data to be in the order of millions (counting daily active users). Each additional honest client is a stabilizing factor and will increase the amount of resources that an attacker needs to bias the results, be it for spreading misinformation or for promoting links to malware. Preventing manipulation is important–not only for us, but also for you, as a user of the search!

By sharing Human Web data, you are not only supporting us, but you will also play a role in keeping all other users safe.

We need a mechanism that allows senders to prove–without revealing their identity– that they have never sent an identical message before. The server requires only enough information to decide whether to trust the sender (and accept the message), or to reject it. In cryptography, the term for these kind of protocols is zero-knowledge proof.

How to detect duplicates

A quick recap:

Messages must not contain sensitive information.
Messages must not be linkable based on their content.
Before sending, strip all unnecessary information related to the transportation protocol (HTTPS).
Send through a third party to eliminate network identifiers (IP hiding).

After these steps, the server gets messages that it cannot link back to the original sender.

Imagine we want to improve the ranking of our search by looking at incoming messages containing the search term and the page that was chosen. Let us assume the following messages are sent:

Alice: { "yt" => "https://youtube.com" }
Bob: { "yt" => "https://youtube.com" }
Mallory: { "yt" => "https://mallory.com/evil" }
Mallory: { "yt" => "https://mallory.com/evil" }
Mallory: { "yt" => "https://mallory.com/evil" }

As the server only sees the message itself, but without knowledge who sent it, it could be tricked into believing that Mallory’s page is the best result for the search term “yt”. The question is, how can the server detect the duplicates and stop Mallory from outranking the messages from Alice and Bob?

Direct Anonymous Attestation (DAA) and zero-knowledge proofs

The approach used in HPN is based on work by Jan Camenisch and Anna Lysyanskaya (Signatures with efficient protocols). Their signature scheme is used for a cryptographic primitive called Direct Anonymous Attestation (DAA).

Today, DAA is mainly used in hardware inside a Trusted Platform Module (TPM). Most likely, your notebook or smartphone has such a hardware chip inside which implements DAA. TPMs are out of scope of this document, but we will mention it briefly, mainly to avoid confusion, as all literature on DAA will focus on TPMs. DAA adoption came with the TPM 1.2 standard, which improved privacy by eliminating the need for a trusted third party (Privacy CA in TPM 1.1).

Even though it is widely deployed now, Trusted Computing remains a controversial topic. However, the concerns are about how the capabilities of these chips can be abused, not about the cryptographic primitives inside them.

In HPN, we are not using a TPM. Instead we have built a software implementation, which we open-sourced under the MPL 2.0 license: anonymous-credentials.

What we have in common with TPM is that the algorithm is similar to their latest TPM 2.0 Library specification. Both use pairing-based elliptic curve cryptography, using 256-bit Barreto-Naehrig curves. For the elliptic curve and pairing primitives, we are using the MIRACL Core library (formerly known as Milagro Crypto C).

In this blog post, we will not go into details about the cryptography, nor assume any knowledge about elliptic curve or pairing based cryptography. It is a difficult topic, but for the interested reader, the first two chapters of Ben Lynn’s thesis can help you get familiar with the math behind pairings and elliptic curves and introduce the notations.

Camenisch-Lysyanskaya signatures are introduced in their paper (A Signature Scheme with Efficient Protocols). Those looking to get at least a general idea of how it works can look at an older signature scheme called Schnorr signatures. Some of the ideas published by Schnorr in 1991 can be found in the more complex signature schemes used in DAA.

From a high level perspective, what all signature schemes have in common is that they depend on the discrete logarithm problem. The solution to the discrete logarithm is the secret key that only the signer knows.

The trick used by Schnorr but also in the more complex schemes is to use a construction that is based on hashing. For correct signatures, both signer and verifier will end up with the same hash value. However, the steps that both sides take are different:

The signer combines the message, a random number and the secret key.
The verifier combines the message, the public key of the signer and the signature (the resulting hash computed by the signer).

Due to some neat math tricks, for correct signatures, the hashes will be the same for both sides. However, without knowing the secret, there is no systematic way to forge such a signature.

In the more complex schemes by Camenisch and Lysyanskaya, similar constructions are used^[2], which allow one party to prove to the other that it knows one or more secrets, but without revealing the secrets. This is the idea behind the zero-knowledge proofs in their work.

Direct Anonymous Attestation in HPN

Let us now look at the DAA protocols in the context of HPN to understand how it can be used to detect duplicates without revealing the sender’s identity. The explanations are on an informal level, but formal definitions can be found in the paper Preventing Attacks on Anonymous Data Collection.

Instead of two parties (user/sender and receiver/server), three parties are now involved conceptually:

DAA member: a user who wants to send data.
DAA issuer: verifies that the user can be trusted and gives out DAA credentials.
DAA verifier: verifies that data was correctly signed with DAA credentials.

Getting credentials from the issuer is called the “join” operation, while the interaction with the verifier is the “send” operation.

Join operation

First, each user creates their own RSA key. The public key will be shared with the issuer to get DAA credentials, however, it will not be shared in the send operation; otherwise, we would immediately lose anonymity, as the key is long-lived and can therefore be linked to the user’s identity.

The join operation itself is an interactive protocol. It has to guarantee that once the issuer gave out credentials to multiple users, the verifier will not be able to distinguish which of the given credentials have been used to create a signature.

In other words, the issuer will not be able to link signed messages based on which user signed them. There is only one exception that we will see in a moment. The issuer will be able to decide whether two signatures for an identical message were signed with the same credentials. This controlled form of message linkage is the property that we need to reject duplicates.

Send operation

After a successful join, the user can use the DAA credentials to sign messages and send them to the verifier. Each signature is a zero-knowledge proof to verify that the user has valid credentials and that the signed message has not been sent before by that user.

The zero-knowledge property is crucial, as we have to include the identity of the user to detect collisions with previously sent messages. Being zero-knowledge means that the verifier cannot learn about the user’s identity, independent of whether the signature was correct or not.

Basenames and how they can be used for rate limiting

For simplicity’s sake, the meaning of “identical message” has not been explained yet. DAA has the concept of a “basename”, which is a function that maps a message to a value. Both sender and verifier need to agree on the same algorithm to compute the basename. If there is a mismatch, the verifier will reject the message.

In our previous example, the basename could be defined as the normalized search term:

{ "yt" => "https://youtube.com" }: yt
{ "twit" => "https://twitter.com" }: twit

The normalizing step is important; otherwise, you could bypass limits by modifying the query - by adding whitespaces, for example:

{ " twit" => "https://twitter.com" }: twit

Now we can allow a user to send one message for each search term. You could say the user can vote once per search term. But attempts to vote multiple times will be rejected by the verifier.

If we wanted to allow three votes, we could extend the message to contain a number between one and three, and include it in the message as well as the basename.

After that change, the following messages will all be allowed:

{ "yt" => "https://youtube.com", 2 }: [yt,2]
{ "yt" => "https://en.wikipedia.org/wiki/YouTube", 1 }: [yt,1]
{ "yt" => "https://de.wikipedia.org/wiki/YouTube", 3 }: [yt,3]

After these three messages, the rate limit is exceeded, and the user can send no more messages for that query.

Another example would be to limit the sending to one per hour by redefining message and basename to include the hour instead:

{ "yt" => "https://youtube.com", 22 }: [yt,22]
{ "yt" => "https://youtube.com", 23 }: [yt,23]
{ "yt" => "https://youtube.com", 0 }: [yt,0]
{ "yt" => "https://youtube.com", 1 }: [yt,1]

In this example, the verifier should additionally reject all messages that fill in the wrong timestamp. Only if a user sends them at the expected time will the verifier accept them.

All limits will be reset when the group is rotated. In our current implementation, that happens once a day. Group rotation means that new group keys will be generated. Once issuer and verifier have switched to the new keys, clients are forced to join again to get new credentials, as their old ones will no longer be accepted.

Preventing Sybil attacks

The motivation for enforcing daily join operations is to mitigate the effects of Sybil attacks, where an attacker builds up a large number of identities. Forcing join operations combined with traditional rate limits can be used as a defensive mechanism, but it is hard to eliminate the possibility of Sybil attacks completely.

There is also a trade-off between security and acceptance. Some approaches would be great for security, but inconvenient for the user (e.g. by demanding proof of a valid email address when joining for the first time). HPN is not a silver bullet that solves all problems. It still brings value, however, by allowing different strategies for handling otherwise conflicting goals:

Fraud detection (Is it a real user or a bot? Every bit of information helps.)
Anonymous sending (Only the message content counts, and nothing else.)

If you need to meet both goals at once, it becomes impossible as the requirements are conflicting. If you invest more in fraud detection, you put anonymity at risk, but if you improve anonymity, you open yourself up to additional attacks.

We can eliminate conflicts by separating the steps in a dedicated join operation and a dedicated send operation. In a dedicated join operation, we do not need to hide the identity as no data is transferred and in a dedicated send operation we can fully rely on the signatures.

It allows us to implement different strategies for both operations. For example, in Cliqz, sending is always done through FoxyProxy to hide the IP, while attempts to join through the proxies will be rejected.

Protecting the user from de-anonymization attacks

Ideally, the protocol should resist attacks from all sides. We have already looked at attacks of the sender (data injection and Sybil attacks), but the protocol should also protect the user in situations where the proxy, issuer or verifier (or all of them) are malicious and try to de-anonymize the sender. In other words, can you trust that the protocol protects you from attempts of the other actors to de-anonymize you?

Proxy: Attempts to read or change the message

We are not aware of attack possibilities by the proxy. Attempts to directly read or modify the message are prevented by the authenticated encryption layer (in our case, AES-GCM-128). What the proxy will see is the size of a request and its response. Without padding, statistical attacks would be feasible.

Originally, we used fixed buckets of 16K, which had the advantage that information loss due to size could be ruled out. Switching to dynamic bucket sizes (currently 1K, 2K, …, 32K) introduced information loss, but we are not aware that it is possible to exploit it to gain useful information about the message content. As a side-note, dynamic bucket sizes were an important performance optimization, as it reduced the amount of data sent by a factor of 10.

Issuer: Creating one group per user

The issuer could attempt to create not only one global group, but one group per joining user. If successful, it would allow message linking, as a successful signature proves that the user is part of the group, and no other user would have been able to pass the verification for that group.

To spot the attack, users can download our public group key over a different channel (e.g. through the Tor Browser), or compare the keys that other users saw. If there is a mismatch it would be hard for us to hide the fact that the keys differ from the public.

As a protective measure, the client will automatically detect when already published keys are changed and invalidate all keys. The effect is that it will stop sending any more data until all invalidated keys are rotated out, which will take three days.

Issuer: Rejecting joins

Another attack also uses the approach of targetting the number of clients in one group. But instead of tricking users into different groups, the issuer could reduce the number of users by preventing other users from joining. In other words, it could only let a specific subset of users join, which will have the effect that all incoming messages have to be from users in that group.

This attack is harder to spot in the beginning. In the long run, it would be visible once the word spreads that some users are unable to join, while others can (and it is not simply a bug in the client).

There is no way to prevent that sort of attempt in the client code. However, the consequence of preventing most of the users from joining is that we would sacrifice the majority of our data collection. Theoretically, we could single out a specific user, but in doing so, we would lose all data collection during that time. That makes the attack impractical.

Attacks by the verifier (with help from the proxy)

We are not aware of de-anonymization attacks on the cryptographic part of verifying the signature of knowledge itself. An independent attack scenario would involve the proxy and verifier colluding and sharing information. In that scenario, the verifier could correlate messages by network information such as IP addresses.

Although there is no theoretical defence against this attack, in practice the attack would be difficult to execute. First, both companies would have to violate laws. Second, it is hard to imagine that a proxy provider would agree and risk their reputation in an industry where trust is paramount.

Links to the code

If you can think of more attacks, or find security flaws in the implementation, please get into contact with us (email: privacy [at] cliqz [dot] com).

The Cliqz extension is open source. Most of the cryptographic part of the protocol can be found in the anonymous-credentials library, which is a C library built on top of MIRACL. It gets compiled to WebAssembly, so we can use it inside the Cliqz extension. More information about the protocol itself can be found in the paper.

The server code is not public, but it is a Node.js application which also uses the same anonymous-credential library. However, instead of WebAssembly, as in the browser, we compile the library to native code.

Final words

The HPN protocol, as described in this blog post, has been in production since the second half of 2018. The migration from the previous protocol (based on blind signatures) was completed in February 2019. The amount of traffic varies over the day (most of our users are in Europe and the United States), but during peak time we get around 5,000 messages per second.

In the new version (hpnv2), we were able to address some limitations that we faced in the old protocol (hpnv1):

Sending used to consist of two messages, and required interaction with different entities (blind signer and a custom proxy). In hpnv2, the need for the blind signer is gone (a performance win), but more importantly it allowed us to send data over non-custom proxies. Our new setup with FoxyProxy would not have been practical with hpnv1.
In the send operation, hpnv2 uses elliptic curve cryptography instead of RSA, which reduces the CPU overhead of the Human Web data collection, both on the client and on the server.
hpnv2 supports flexible rate limits through basenames (e.g. allow up to 10 messages per hour), while in hpnv1 only duplicate detection was implemented.
hpnv1 had an intentional message loss of about 5% as part of the protocol as a statistical defence against certain message linkage attacks. By switching to DAA, we no longer need to introduce these types of collisions.

In the last three articles, we presented the Cliqz data collection. That ends our mini-series on anonymized data collection:

Why do we need it? Is Data Collection Evil? Privacy or data, a convenient false dichotomy
How to prevent record linkage in the data itself? Human Web–Collecting data in a socially responsible manner
How to prevent record linkage on the network layer? This article, introducing HPN

We hope you enjoyed it.

Tomorrow, we will switch gears and go straight to search. If you want to know how Cliqz search works, stay tuned!

Footnotes

If you are interested, we open-sourced our prototype implementation that builds the Tor client with WebAssembly. The resulting code can be bundled within an extension. However, as the extension cannot create TCP sockets, only WebSockets, it is not possible to connect directly to the Tor network. We experimented with connecting through a Tor Bridge (listening on WebSocket connections), but gave up on the idea, as it introduces the possibility of timing attacks. As long as we self-host the bridge, de-anonymizing users by correlating requests is feasible, as we would control both entry and exit of the network. In that scenario, Tor’s anonymity guarantees do not hold. ↩︎
A formal definition can be found in Efficient group signature schemes for large groups, in section 3.3: Signature of Knowledge of Discrete Logarithms. ↩︎