Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Principle: Don't use Entropy #4

Open
martinthomson opened this issue May 27, 2022 · 4 comments
Open

Principle: Don't use Entropy #4

martinthomson opened this issue May 27, 2022 · 4 comments

Comments

@martinthomson
Copy link
Collaborator

I've just been neck deep in looking at Topics and the initial analysis there caused me a little discomfort. It took me a while to realize what the problem was.

We often tend to think of privacy loss in different proposals in information theoretic terms. This is useful, but it has its limits.

I've laid out my thinking on the topic in a longer blog posting that concludes that entropy is a poor statistic to rely on for privacy analysis.

@csharrison
Copy link
Collaborator

Hey Martin, thanks for writing this blog post. Can you clarify exactly what you are proposing here? The blog post goes into a lot of other details other than a critique of the entropy statistic (in fact, it argues we shouldn't look at any single statistic).

Are you open e.g. to using entropy (or a similar, non-worst case measures) in addition to other privacy measures? This is how we are thinking of things like attribution measurement, where both entropy/information gain is a useful "at-scale privacy" metric to us in addition to things like differential privacy which are worst-case.

@martinthomson
Copy link
Collaborator Author

I'm not exactly proposing in the positive, but more in the negative :) That is, entropy provides some insight, but it isn't a very useful one unless you understand what it is really saying.

You might also interpret this as a bit of a repudiation of your notion of "at-scale privacy". Sure, there is something to be said for providing protection to people in larger groups and using aggregate measures as the basis for making claims about those groups. However, I'm asserting that anything you say about a population as a whole is not worth much if it means that there are individuals or minorities that might suffer disproportionately as a result.

That is, saying that the average person is affected in a certain way is far less compelling than being able to say that all people are affected in a certain way.

The sorts of things that might be useful is knowing how often someone is identifiable as being in a group of size less than $k$ for $k\in \{1,10, 100, \dots\}$ for example. Or how many people might reveal information that exceeds some threshold $h$. That might not suffice, in that those sorts of statistics depend on assumptions about populations and selection, but they do provide more direct insight into the effect on privacy.

@csharrison
Copy link
Collaborator

You might also interpret this as a bit of a repudiation of your notion of "at-scale privacy". Sure, there is something to be said for providing protection to people in larger groups and using aggregate measures as the basis for making claims about those groups. However, I'm asserting that anything you say about a population as a whole is not worth much if it means that there are individuals or minorities that might suffer disproportionately as a result.

I didn't say we'd want to use an "at scale" measure alone. There is value to using both worst case metrics like differential privacy, and at-scale metrics (like information gain, entropy, etc) which try to measure something like “scaled abuse is infeasible”.

If there were a privacy change that helped reduce at-scale privacy loss but kept the worst-case equal, I think we ought to consider making that change in a PATCG proposal.

A few good properties of at-scale metrics:

  • They help represent whether pervasive tracking on the web can occur (e.g. many people tracked at once).
  • They can help us understand and shape the economics of tracking behaviors. For instance, if it is only possible to track a small number of users, deploying tracking methods changes the cost/benefit calculus. Note that this can benefit all users by creating an environment that disincentivizes attacks that are useful only if successful on a large scale.

Worst case notions of privacy are ill-suited for evaluating such benefits as by definition, they ignore the effect on the typical use case and only focus on the extreme cases. I tend to think of worst-case metrics as a sort of “back stop”. It can’t get worse than this, but it can definitely be better if you consider less adversarial scenarios. We should improve both.

The sorts of things that might be useful is knowing how often someone is identifiable as being in a group of size less than k for k∈{1,10,100,…} for example. Or how many people might reveal information that exceeds some threshold h. That might not suffice, in that those sorts of statistics depend on assumptions about populations and selection, but they do provide more direct insight into the effect on privacy.

I have concerns with these kinds of k-anon style metrics, but perhaps we should hash it out in a different issue on what the privacy measure should be :) and focus this one on the entropy / at-scale debate.

@martinthomson
Copy link
Collaborator Author

If there were a privacy change that helped reduce at-scale privacy loss but kept the worst-case equal, I think we ought to consider making that change in a PATCG proposal.

As a motherhood statement, this is easy to agree with. I want to get at the basis for my concerns with this "at scale" metric though, because I think that any consideration of those effects needs to be secondary, and strictly so.

[...] at-scale metrics [...] help represent whether pervasive tracking on the web can occur (e.g. many people tracked at once).

This is not a claim that I can agree with in the same way as the previous. The same goes for the economic aspect, which I don't consider to be a separate point; whether tracking is pervasive is a direct consequence of the economics. We see extensive tracking of online activity because it has become sufficiently efficient to do so relative to the value it provides.

The problem I see with basing decisions on an at-scale assessment is that any defense against tracking is fragile. Any individual person who has their identity linked across sites suffers an irrevocable loss of privacy. Once two sites have determined that user $u_a$ on site $a$ is the same as user $u_b$ on site $b$, no more information is needed. The two sites - or third parties present on both sites - can link the activities of that user on the two sites forever.

My understanding is that we're more or less resigned to the continuous release of information over time. With that, if a design allows sites to link identities for some users, then sites can accumulate cross-site linkages. Once a user is linked, they can be excluded from the set of users that participate in measurement, progressively improving the ability to link the identity of other users.

We are also committed to providing information that is scaled by site (with pairwise information revealing more information). That allows for transitive linkages to be created. Identity relationships are commutative across sites, so $u_a=u_b$ and $u_b=u_c$ reveals that $u_a=u_c$ without $a$ and $c$ expending any of their tracking budget to learn that. More indirection adds to fragility, but it also allows for reinforcement of weaker linkages.

Though it might seem like this concern is about being able to target individuals, the concern is more rooted in providing high-fidelity information that crosses sites. Any system that creates clean demarcations between groups of users can be harnessed for tracking over time. Best case, sites aren't able to control how users are allocated into groups, so what information is released comes down to chance. Sites might be able to further refine groupings using external signals, like fingerprints, but this is still dependent on chance. Whether individuals are identifiable as a result will depend on how the population of visitors to the site is composed.

Proposals that don't give sites at least some control over how users are allocated to groupings are rare. For measurement at least, designs that offer no control to sites tend to be pretty useless. Proposals that give sites control over the allocation of users to groups allow for adaptive techniques that can progressively segment populations. Though groupings might be large, progressive divisions allow the identity of all users to be resolved in $\lceil\log_{G}(N)\rceil$ iterations for $N$ users and $G$ groupings. Even when $G$ is small and $N$ large, this doesn't take long. Dividing the population of the planet into just 2 groups each week provides information sufficient to uniquely identify every single person in 34 weeks.

This doesn't need to assume that the mechanisms we are considering are dedicated solely to tracking. Depending on the design, this sort of information leakage could be available for use in linking cross-site identity without impeding the use of the information for conversion measurement.

This is something of a redeeming characteristic of PCM: if you want to use PCM for tracking, you really need to dedicate all of its use to that end. Any use of PCM for tracking really eats into what ever you might gain from measuring conversions.

In comparison, the Chrome event-level proposal offers 2 billion identifiers per person, which offers plenty of opportunities to target each person many times. A true conversion can preempt any tracking attempt using trigger priority, simultaneously ensuring that conversion measurement works while performing tracking. Each conversion then just reduces the amount of tracking information that user provides. The filtering feature might make even prioritization unnecessary. The added randomization on each navigation event (p=0.0024 currently) is almost low enough that it can be ignored. Event sources (p=2.5e-6) can also be turned to this end, with 1=real, 0=tracking, with the observation that there is a third state where no report is generated. Careful sites can then run multiple tests to cancel this noise, up to the cap on the rate that reports can be generated.

This leads neatly to my final point. The introduction of meaningful amounts of noise - as introduced by differential privacy - does change things. $\varepsilon=14$ isn't much noise1, so it can almost be ignored, but differential privacy with smaller values of $\varepsilon$ make it difficult to attribute a single, unambiguous value to an individual. Maybe we are ultimately just talking about time-to-identification either way, but I like to think that we can resolve to pick a value of $\varepsilon$ that is a little better than that.

Footnotes

  1. $\frac{Pr(F(x)=S)}{Pr(F(x')=S)} < 1202604$ is a pretty big ratio of probabilities; you might say that it makes any hypothesis about $x$ over $x'$ fairly easy to advance with just one sample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants