-
Notifications
You must be signed in to change notification settings - Fork 0
/
ReID_risk_analysis.Rmd
139 lines (101 loc) · 8.09 KB
/
ReID_risk_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
title: "Re-identification Risk Analysis"
author: "xsong"
date: "10/19/2018"
output: html_document
---
### Re-identification Risk Analysis ###
This document reports disclosure risks for 4 EHR data sets of Acute Kidney Injury (AKI) patients over 10 years at KU medical center with different observation points, which will be made open to public.
***
#### Determining Key Variables ####
"Key variables", also called "implicit identifier" or "quasi-identifier", are defined as a set of variables that, when considered together, can be used to identify individual patients. Also "key variables" are, to some extent, more accessible to the public or to potential intruders. These "key variables" are usually determined by data privacy specialists. However, when this approach is not feasible, we can only treat all variables to be potential "key variables". In our case, treating all 1,287 variables as key variables would result in much higher risk of containing too many "unique" patterns, even if we assume data being a sample from an underlying population that is at least 10 times larger. So, we attempted to just release a maximal set of variables which: a) are more common among patients (assuming rare features are riskier); and b) their collective "Global risk", or population uniqueness, is below 0.03%.
Note that in [sdcMicro Appendix A], only basic demographics are used as "key variables". We have also tested the scenario of only including age (discretized into 6 groups), gender, race as keys, and got all the re-identification risk measures below 0.001%.
[sdcMicro Appendix A]: https://cran.r-project.org/web/packages/sdcMicro/vignettes/sdc_guidelines.pdf
***
#### Measuring the Re-identification/Disclosure Risk ####
Table1 demonstrates results of 5 disclusure risk measures for all the 4 EHR data sets in request, which are properly scrubbed with selective columns of low disclosure risks.
```{r risk_est, include=F}
source("./R/util.R")
source("./R/eval_re_id_risk.R")
require_libraries(c("tidyr",
"dplyr",
"magrittr",
"sdcMicro",
"h2o",
"scales",
"knitr",
"kableExtra"))
#turn off scientific notation
options(scipen=999)
# #--initialize h2o cluster
# h2o.init(nthreads=-1)
#
# dat_stack<-c()
# for(i in 1:4){
# dat<-read.csv(paste0("./data/Perspective_",i,".csv"))
# out<-sel_safe_col(dat,
# col_incl=c("X0","X1","X2"),
# incl_inc=10,
# uni_thresh=0.0001,
# wt=round(1/0.1))
#
# saveRDS(out,file=paste0("./output/ReID_risk_persp",i,".rda"))
#
# out_summ<-out$final_out$risk_summ %>%
# gather(risk_type,risk_score) %>%
# mutate(release_col = length(out$col_sel),
# perspective=paste0("Perspective_",i))
#
# dat_stack %<>% bind_rows(out_summ)
# }
#
# h2o.shutdown(prompt=F)
#comment out above and uncomment the following when re-calculation of risk is not necessary
dat_stack<-c()
for(i in 1:4){
out<-readRDS(paste0("./output/ReID_risk_persp",i,".rda"))
out_summ<-out$final_out$risk_summ %>%
gather(risk_type,risk_score) %>%
mutate(release_col = length(out$col_sel),
perspective=paste0("Perspective_",i))
dat_stack %<>% bind_rows(out_summ)
}
```
```{r out, echo=F}
nice_tbl<-dat_stack %>%
mutate(risk_score=ifelse(risk_score==0,0.0001,risk_score),
risk_score=ifelse(risk_type %in% c("risk3"),
round(risk_score/2,4),risk_score)) %>%
mutate(risk_perc=paste0(risk_score*100,"%")) %>%
dplyr::select(perspective,release_col,risk_type, risk_perc) %>%
dplyr::filter(risk_type %in% paste0("risk",2:6)) %>%
mutate(risk_type=recode(risk_type,
risk2="Measure 1.1",
risk4="Measure 1.2",
risk3="Measure 2.1",
risk5="Measure 2.2",
risk6="Measure 3")) %>%
spread(risk_type,risk_perc) %>%
dplyr::rename(`# of Released Columns`=release_col)
kable(nice_tbl,
caption="Table1 - Disclosure Risk Estimations of Scrubbed Data Sets") %>%
kable_styling("striped", full_width = F)
```
***
#### Intepreting the Re-identification/Disclosure Risk ####
Three measures for re-identification/disclosure risk, inspired by Statistical Disclosure Control([sdcMicro]), are calculated and reported for all of the data sets. In addition, the number of secure columns for release are also reported.
[sdcMicro]: https://cran.r-project.org/web/packages/sdcMicro/vignettes/sdc_guidelines.pdf
**Measure 1 - Population Uniquness**: It measures the estimated percentage of unique records within the underlying population (e.g. all hospitalization encounters over 10 years) from which the sample is drawn. The idea is that individual record having rare pattern (i.e. combination of variables) can be more easily identified and thus have a higher risk of re-identification/disclosure. Two models were used to estimate the risk assuming two typical distributions of frequency counts for the underlying population, which are negative binomial (Measure 1.1) [see Rinott and Shlomo, 2006] or poisson distributions (Measure 1.2) [see Skinner and Holmes, 1998]. When feature dimension is high, quality of Measure 1.2 tend to drop, which can be more accurate than Measure 1.1 with fewer features.
For example, if an attacker wants to use data set "perspective_1" to re-identify some patients, the chance for them to pick out a record that likely to uniquely belong to a patient is less than `r max(nice_tbl[1,3],nice_tbl[1,4])`. In other words, more than 99.99% records are ambiguous, or "anonymous", enough to not put any particular patient at risk of being disclosed.
***
**Measure 2 - Success Rate**: "Success rate" is defined as the likelihood of correct guess if each sample unique is matched to a randomly chosen individual from the same population cell. This measure is also estimated through two different models, negative binomial (Measure 2.1) [see Rinott and Shlomo, 2006] or poisson (Measure 2.2) [see Skinner and Holmes, 1998].
For example, the average chance for a unqiue or rare record within the data set "perspective_1" being successfully re-identified is less than `r max(nice_tbl[1,5],nice_tbl[1,6])`. In other words, even if an attacker is able to link additional information to records in this data set, the chance for them to be sure that their linkage is correct is less than `r max(nice_tbl[1,5],nice_tbl[1,6])` of the time.
***
**Measure 3 - % Records at risk**: This measure calculates the percentage of observations that are condiered risky and also have relatively higher risk than majority of the cohort. The default lower boundary of individual risk for a record being considered as "risky" is 0.1, while the lower boundary for a record being considered as "riskier than majority" is $2*(median(r)+2*MAD(r))$ where $r$ represents the vector of indivial risks.
Measure 1 and Measure 2 estimates the global disclosure risks, Measure 3 is used to identify if there exists patients with extremely high risk of being disclosed (e.g. patients with very rare clinical pathways or very complete information), which turns out to be all less then `r max(nice_tbl[,7])` in all of the scrubbed data sets.
[see Rinott and Shlomo, 2006]: https://link.springer.com/chapter/10.1007/11930242_8
[see Skinner and Holmes, 1998]: http://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/estimating-the-re-identification-risk-per-record-in-microdata.pdf
***
In conclusion, these 4 data sets do not contain any sensitive patient-level information. The only three demographic features have been properly discretized to increase "anonymity". All the clinical variables have been abstracted to high levels than the raw entries (e.g. diagnosis are grouped according to clinical classification software ([CCS]) categories, medications are grouped at their generic name level). Moreover, we will only release less than 25% (~200-300) columns of each data sets, so as to futher control the global risk of re-identification/disclosure.
[CCS]: https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccsfactsheet.jsp
***