Entropy and Privacy Analysis

Aggregation is a powerful tool when it comes to providing privacy for users. But analysis that relies on aggregate statistics for privacy loss hides some of the worst effects of designs.

Background

A lot of my time recently has been spent looking at various proposals for improving online advertising. A lot of this work is centred on the Private Advertising Technology Community Group in the W3C where the goal is to find designs that improve advertising while maintaining strong technical protections for privacy.

Part of deciding whether a design does in fact provide strong privacy protections requires understanding firstly what that means. That is a large topic on which the conversation is continuing. In this post, my goal is to look at some aspects of how we might critically evaluate the privacy characteristics of proposals.

Limitations of Differential Privacy

A number of designs have been proposed in this space with supporting analysis that is based on differential privacy. Providing differential privacy involves adding noise to measurements using a tunable parameter (usually called $ε$ ) that hides individual contributions under a random distribution.

I’m a big fan of differential privacy, but while differential privacy provides a good basis for understanding the impact of a proposal, it is recognized that there is a need to continuously release information in order to maintain basic utility in a long-running system.

Continuous release of data potentially leads to the protections offered by differential privacy noise being less effective over time. It is prudent therefore to understand the operation of the system without the protection afforded by noise. This is particularly relevant where the noise uses a large $ε$ value or is applied to unaggregated outputs, where it can be easier to cancel the effect of noise by looking at multiple output values.

Information exposure is often expressed using information theoretic statistics like entropy. This note explores how entropy — or any single statistic — is a poor basis for privacy analysis and suggests options for more rigorous analysis.

Information Theory and Privacy

Some analysis of Web privacy features often looks at the number of bits of information that a system releases to an adversary. Analyses of this type use the distribution of probabilities of all events as a way of estimating the amount of information that might be provided by a specific event.

In information theory, each event provides information or surprisal, defined by a relationship with the probability of the event:

$I (x) = - \log_{2} (P (x))$

The reason we might use information is that if a feature releases too much information, then people might be individually identified. They might no longer be anonymous. Their activities might be linked to them specifically. The information can be used to form a profile based on their actions or further joined to their identity or identities.

Generally, we consider it a problem when information enables identification of individuals. We might express concern if:

$2^{I (x)} \geq size of population$

Because surprisal is about specific events, it can be a little unwieldy. Surprisal is not useful for reaching a holistic understanding of the system. A statistic that summarizes all potential outcomes is more useful in gaining insight into how the system operates as a whole. A common statistic used in this context is entropy, which provides a mean or expected surprisal across a sampled population:

$H (X) = \sum_{x \in X} P (x) I (x) = - \sum_{x \in X} P (x) \log_{2} (P (x)) = - \frac{1}{N} \sum_{i = 1}^{N} \log_{2} (P (x_{i}))$

Entropy has a number of applications. For instance, it can be used to determine an optimal encoding of the information from many events, using entropy coding (such as Huffman or Arithmetic coding).

Using Entropy in Privacy Analysis

The use of specific statistics in privacy analysis is useful to the extent that they provide an understanding of the overall shape of the system. However, simple statistics tend to lose information about exceptional circumstances.

Entropy has real trouble with rare events. Low probability events have high surprisal, but as entropy scales their contribution by their probability, they contribute less to the total entropy than higher probability events.

In general, revealing more information is undesirable from a privacy perspective. Toward that end, it might seem obvious that minimizing entropy is desirable. However, this can be shown to be counterproductive for individual privacy, even if a single statistic is improved.

An example might help prime intuition. A cohort of 100 people is arbitrarily allocated into two groups. If people are evenly distributed into groups of 50, revealing the group that a person has been allocated provides just a single bit of information, that is, surprisal is 1 bit. The total entropy of the system is 1 bit.

An asymmetric allocation can produce a different result. If 99 people are allocated to one group and a single person to the other, revealing that someone is in the first group provides almost no information at 0.0145 bits. On the contrary, revealing the allocation for the lone person in the second group — which uniquely identifies that person — produces a much larger surprisal of 6.64 bits. Though this is clearly a privacy problem for that person, their privacy loss is not reflected in the total entropy of the system, which at 0.0808 bits is close to zero.

Entropy tells us that the average information revealed for all users is very small. That conclusion about the aggregate is reflected in the entropy statistic, but it hides the disproportionately large impact on the single user who loses the most.

The more asymmetric the information contributed by individuals, the lower the entropy of the overall system.

Limiting analysis to simple statistics, and entropy in particular, can hide privacy problems. Somewhat counterintuitively, the adverse consequences of a design are felt more by a minority of participants for systems with lower entropy.

This is not a revelatory insight. It is well known that a single metric is often a poor means of understanding data.

Entropy can provides a misleading intuitive understanding of privacy as it relates to the experience of individual users.

Recommendations

Information entropy remains useful as a means of understanding the overall utility of the information that a system provides. Understanding key statistics as part of a design is valuable. However, for entropy measures in particular, this is only useful from a perspective that seeks to reduce overall utility; entropy provides almost no information about the experience of individuals.

Understand the Surprisal Distribution

Examining only the mean surprisal offers very little insight into a system. Statistical analysis rarely considers a mean value in isolation. Most statistical treatment takes the shape of the underlying distribution into account.

For privacy analysis, understanding the distribution of surprisal values is useful. Even just looking at percentiles might offer greater insight into the nature of the privacy loss for those who are most adversely affected.

Shortcomings of entropy are shared by related statistics, like Kullback–Leibler divergence or mutual information, which estimate information gain relative to a known distribution. Considering percentiles and other statistics can improve understanding.

Knowing the distribution of surprisal admits the combination of privacy loss metrics. As privacy is affected by multiple concurrent efforts to change the way people use the Web, the interaction of features can be hard to understand. Richer expressions of the effect of changes might allow for joint analysis to be performed. Though it requires assumptions about the extent to which different surprisal distributions might be correlated, analyses of surprisal that assume either complete independence or perfect correlation could provide insights into the potential extent of privacy loss from combining features.

For example, it might be useful to consider the interaction of a proposal with extant browser fingerprinting. The users who reveal the most information using the proposal might not be the same users who reveal the most fingerprinting information. Analysis could show that there are no problems or it might help guide further research that would provide solutions.

More relevant to privacy might be understanding the proportion of individuals that are potentially identifiable using a system. A common privacy goal is to maintain a minimum size of anonymity set. It might be possible to apply knowledge of a surprisal distribution to estimating the size of a population where the anonymity set becomes too small for some users. This information might then guide the creation of safeguards.

Consider the Worst Case

A worst-case analysis is worth considering from the perspective of understanding how the system treats the privacy of all those who might be affected. That is, consider the implications for users on the tail of any distribution. Small user populations will effectively guarantee that any result is drawn from the tail of a larger distribution.

Concentrating on cases where information might be attributed to individuals might miss privacy problems that might arise from people being identifiable in small groups. Understand how likely smaller groups might be affected.

The potential for targeting of individuals or small groups might justify disqualification of — or at least adjustments to — a proposal. The Web is for everyone, not just most people.