-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Principle: Don't use Entropy #4
Comments
Hey Martin, thanks for writing this blog post. Can you clarify exactly what you are proposing here? The blog post goes into a lot of other details other than a critique of the entropy statistic (in fact, it argues we shouldn't look at any single statistic). Are you open e.g. to using entropy (or a similar, non-worst case measures) in addition to other privacy measures? This is how we are thinking of things like attribution measurement, where both entropy/information gain is a useful "at-scale privacy" metric to us in addition to things like differential privacy which are worst-case. |
I'm not exactly proposing in the positive, but more in the negative :) That is, entropy provides some insight, but it isn't a very useful one unless you understand what it is really saying. You might also interpret this as a bit of a repudiation of your notion of "at-scale privacy". Sure, there is something to be said for providing protection to people in larger groups and using aggregate measures as the basis for making claims about those groups. However, I'm asserting that anything you say about a population as a whole is not worth much if it means that there are individuals or minorities that might suffer disproportionately as a result. That is, saying that the average person is affected in a certain way is far less compelling than being able to say that all people are affected in a certain way. The sorts of things that might be useful is knowing how often someone is identifiable as being in a group of size less than |
I didn't say we'd want to use an "at scale" measure alone. There is value to using both worst case metrics like differential privacy, and at-scale metrics (like information gain, entropy, etc) which try to measure something like “scaled abuse is infeasible”. If there were a privacy change that helped reduce at-scale privacy loss but kept the worst-case equal, I think we ought to consider making that change in a PATCG proposal. A few good properties of at-scale metrics:
Worst case notions of privacy are ill-suited for evaluating such benefits as by definition, they ignore the effect on the typical use case and only focus on the extreme cases. I tend to think of worst-case metrics as a sort of “back stop”. It can’t get worse than this, but it can definitely be better if you consider less adversarial scenarios. We should improve both.
I have concerns with these kinds of k-anon style metrics, but perhaps we should hash it out in a different issue on what the privacy measure should be :) and focus this one on the entropy / at-scale debate. |
As a motherhood statement, this is easy to agree with. I want to get at the basis for my concerns with this "at scale" metric though, because I think that any consideration of those effects needs to be secondary, and strictly so.
This is not a claim that I can agree with in the same way as the previous. The same goes for the economic aspect, which I don't consider to be a separate point; whether tracking is pervasive is a direct consequence of the economics. We see extensive tracking of online activity because it has become sufficiently efficient to do so relative to the value it provides. The problem I see with basing decisions on an at-scale assessment is that any defense against tracking is fragile. Any individual person who has their identity linked across sites suffers an irrevocable loss of privacy. Once two sites have determined that user My understanding is that we're more or less resigned to the continuous release of information over time. With that, if a design allows sites to link identities for some users, then sites can accumulate cross-site linkages. Once a user is linked, they can be excluded from the set of users that participate in measurement, progressively improving the ability to link the identity of other users. We are also committed to providing information that is scaled by site (with pairwise information revealing more information). That allows for transitive linkages to be created. Identity relationships are commutative across sites, so Though it might seem like this concern is about being able to target individuals, the concern is more rooted in providing high-fidelity information that crosses sites. Any system that creates clean demarcations between groups of users can be harnessed for tracking over time. Best case, sites aren't able to control how users are allocated into groups, so what information is released comes down to chance. Sites might be able to further refine groupings using external signals, like fingerprints, but this is still dependent on chance. Whether individuals are identifiable as a result will depend on how the population of visitors to the site is composed. Proposals that don't give sites at least some control over how users are allocated to groupings are rare. For measurement at least, designs that offer no control to sites tend to be pretty useless. Proposals that give sites control over the allocation of users to groups allow for adaptive techniques that can progressively segment populations. Though groupings might be large, progressive divisions allow the identity of all users to be resolved in This doesn't need to assume that the mechanisms we are considering are dedicated solely to tracking. Depending on the design, this sort of information leakage could be available for use in linking cross-site identity without impeding the use of the information for conversion measurement. This is something of a redeeming characteristic of PCM: if you want to use PCM for tracking, you really need to dedicate all of its use to that end. Any use of PCM for tracking really eats into what ever you might gain from measuring conversions. In comparison, the Chrome event-level proposal offers 2 billion identifiers per person, which offers plenty of opportunities to target each person many times. A true conversion can preempt any tracking attempt using trigger priority, simultaneously ensuring that conversion measurement works while performing tracking. Each conversion then just reduces the amount of tracking information that user provides. The filtering feature might make even prioritization unnecessary. The added randomization on each navigation event (p=0.0024 currently) is almost low enough that it can be ignored. Event sources (p=2.5e-6) can also be turned to this end, with 1=real, 0=tracking, with the observation that there is a third state where no report is generated. Careful sites can then run multiple tests to cancel this noise, up to the cap on the rate that reports can be generated. This leads neatly to my final point. The introduction of meaningful amounts of noise - as introduced by differential privacy - does change things. Footnotes
|
I've just been neck deep in looking at Topics and the initial analysis there caused me a little discomfort. It took me a while to realize what the problem was.
We often tend to think of privacy loss in different proposals in information theoretic terms. This is useful, but it has its limits.
I've laid out my thinking on the topic in a longer blog posting that concludes that entropy is a poor statistic to rely on for privacy analysis.
The text was updated successfully, but these errors were encountered: