-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impact to Privacy of Usage of Cross Domain Signals in Opaque Processing with DP/K Output Constraints #26
Comments
I agree, this is an interesting question. It's definitely one of the areas where this high-level document doesn't get into enough detail to offer an opinion one way or the other. I tried to highlight this sort of grey area when I wrote
In various conversations I've had in the years of Privacy Sandbox development, some people are skeptical of any system which has the ability to build a user profile based on data from many different contexts, even if that pool of data is under the user's control and can only be used for targeting inside an opaque environment:
All this is a longwinded way of saying that I don't think there is consensus on the question you're asking, and the document's ambiguity reflects that. |
Interesting thoughts.
Could you please provide an example of when that would be an issue? A maybe naïve assumption is that when clicking on an ad, the user expects the ad information (the product the user clicked on) to be leaked out to the environment, even if that ad information comes from different contexts. For example, if a user, in different contexts, shows interest in healthy drinks and sport, and that user is shown an ad for a healthy sport drink, and clicks on that ad, the user would surely expect to land on a site selling healthy sport drinks. |
Examples come naturally from the combination of the "novel inferences" and "click-time leakage" risks. I think the canonical "novel inferences" example is the famous "Target knows you're pregnant" story. If the on-device ad selection model can be based on items browsed or purchases made across many unrelated sites, it facilitates picking an ad which inherently means "the model believes this person is pregnant". The chosen ad might not be for something obviously pregnancy-related at all. If, as that NYTimes article says, Target thinks it's really valuable to get the bit of information that you are pregnant, then they could show you a good coupon for something unrelated, but with a click-through URL that says [Note that the NYTimes article is paywalled. Click here instead to get the story non-paywalled / "gifted" from my subscription... which, of course, means this link now contains tracking information!] |
Ok, I see, the ad would be hiding the novel inference, in order to pass the information without the user being in a position to know that the information is being passed... |
ScopeSome of what I'm chewing on here has to do with the wording of the Attestation, but I'm curious about your thoughts on the model more abstractly. So I'll say explicitly that here I'm not asking for comment on the Attestation, what it requires, etc. DefinitionSo, thinking about it more the Novel Inferences thing is interesting. I think I think that the idea/concept/threat of "Novel Inference" isn't identical (no pun intended) to the idea/concept/threat of "Re-identification across Contexts". It seems like Novel Inference certainly can include a "complete re-identification" between contexts A and B (max inference), but that B learning something that is tied to your identity in A doesn't imply re-identification. So, just definitionally, does that seem right? Threat ModelI can still see how a user (including me) would want to avoid Novel Inference of particular sets of their characteristics from Context A in Context B, like the pregnancy case or if I don't want my insurance company to know about my arthritis. It would be worse if that Novel Inference (arthritis) could be exported and queried in a permanent and clear way attached to my identity in B (insurance industry)...but even if that was transient (say it prevents me from getting a better insurance quote in my stubborn browser) that is bad. So, then two questions:
PhilosophizingTo dive in a little bit, I'd like to kick a hackysack around on the quad (or toss a frisbee, your choice) with you and ask your thoughts on what it means to:
On one side of the line, I definitely see Persisting a Graph of Unique Identifiers to an ACID store with RAID 10 disks and global replication via Quantum Entanglement, as quite clearly "joining identities", as you can see the result of the join at your leisure, use it to do further ID based lookups in different contexts, know how to repeat it in the future. I think on the other side of the line, I can see a transient process mixing data attached to each ID into a single list, assuming that list does not contain the IDs or any deterministic derivative of it, and that list never leaves a the strongly gated process...that one is tougher. If the output of the join does not contain unique identifiers I couuulld argue you've not joined IDs, re-identification can't happen directly (especially given k-anon output gates), and we've just joined user data. I think that:
Does (1) mean being able to operate on the IDs in a common process in any way? Observing the output of the join, rather than just the input? Can I assume Chrome does not want to take a position on this? :) Dangerous PhilosophizingOK, this might be going to far from the quad, but like, what does "join", even, like, mean, man? Are we referring to:
|
First, I certainly agree that "B learning something that is tied to your identity in A doesn't imply re-identification." This is what I was getting at in my privacy model doc when I wrote "A per-first-party identity can only be associated with small amounts of cross-site information" as one of its big three points. Of course this is where all of the hard questions end up — just to quote my 2019 self a little more:
However, in the previous discussion in this issue, I was trying to use the term "novel inference" to mean something a little different: some probabilistic belief about a person that was not being made based on their behavior on any single site, but rather was made only by drawing on information from multiple contexts. That is, it's not about "B learning something that is tied to your identity in A", but rather "B learning something based on your behavior on A1, A2, A3, etc, which was not derivable from your identity on any one of the Ai alone." The fact that it was not previously tied to any of your partitioned-by-site identities is what makes it "novel". Again, this is surely a different question from that of "joining identity". We know that we don't want someone to be able to join identities across sites — once that happens, the game is lost, and the browser no longer has any hope of preserving the user's privacy or restricting the flow of information across contexts. But if identity is not joined, then the browser does have a chance to be opinionated about the cross-context flow of information. All these other questions are trying to figure out what that opinion ought to be. |
As I've gotten deeper into this I've been pondering something: what would be the impact to this core privacy model if user bidding signals were:
I haven't had the chance to try to work through the math here (some serious cobwebs to dust off for any proof'ing) but I wonder if this would still meet the privacy model laid out here from a "happy path perspective" (meaning impact to "reidentification across context"), with the full understanding that any hacks on that environment would incur a worse privacy loss than if a single-partition-process is hacked.
The text was updated successfully, but these errors were encountered: