-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathREADME.qmd
159 lines (123 loc) · 6.97 KB
/
README.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
format: gfm
default-image-extension: ""
editor_options:
chunk_output_type: console
---
# DQAstats <img src="man/figures/logo.png" align="right" width="120" />
<!-- badges: start -->
```{r}
#| echo: false
#| message: false
#| results: asis
pkg <- desc::desc_get_field("Package")
cat_var <- paste(
badger::badge_lifecycle(),
badger::badge_cran_release(pkg = pkg),
gsub("summary", "worst", badger::badge_cran_checks(pkg = pkg)),
badger::badge_cran_download(pkg = pkg, type = "grand-total", color = "blue"),
badger::badge_cran_download(pkg = pkg, type = "last-month", color = "blue"),
gsub("netlify\\.com", "netlify.app", badger::badge_dependencies(pkg = pkg)),
badger::badge_github_actions(action = utils::URLencode("R CMD Check via {tic}")),
badger::badge_github_actions(action = "lint"),
badger::badge_github_actions(action = "test-coverage"),
badger::badge_codecov(ref = desc::desc_get_urls()),
badger::badge_doi("10.1055/s-0041-1733847", color = "yellow"),
sep = "\n"
)
cat_var |> cat()
```
<!-- badges: end -->
- [DQAstats](#dqastats)
- [Installation](#installation)
- [CRAN Version](#cran-version)
- [Development Version](#development-version)
- [Configuration of the tool](#configuration-of-the-tool)
- [Example](#example)
- [Demo Usage / Deployment Examples](#demo-usage--deployment-examples)
- [Citation](#citation)
- [More Infos](#more-infos)
The R package 'DQAstats' provides core functionalities to perform data quality assessment (DQA) of electronic health record data (EHR).
Currently implemented features are:
- descriptive (univariate) analysis of categorical and continuous variables of a source database and a target database
- checks of the extract-transform-load (ETL) pipeline:
- comparing of distinct values and valid values between the source database and the target database
- identification of missing/duplicate Data based on a comparison of timestamps
- value conformance checks by comparing the resulting statistics to value constraints (given in a meta data repository (MDR))
- 'atemporal plausibility' checks (multivariate)
- 'uniqueness plausibility' checks (multivariate)
The tool provides one main function, `dqa()`, to create a comprehensive PDF document, which presents all statistics and results of the data quality assessment.
Currently supported input data formats / databases:
- CSV files (via R package [`data.table`](https://cran.r-project.org/package=data.table))
- PostgreSQL (via R package [`RPostgres`](https://cran.r-project.org/package=RPostgres))
- ORACLE (via R package [`RJDBC`](https://cran.r-project.org/package=RJDBC))
## Installation
### CRAN Version
`DQAstats` can be installed directly from CRAN with:
```{r}
#| eval: false
install.packages("DQAstats")
```
### Development Version
You can install the latest development version of `DQAstats` with:
```{r}
#| eval: false
install.packages("remotes")
remotes::install_github("miracum/dqa-dqastats")
```
Note: A working LaTeX installation is a prerequisite for using this software (e.g. using the R package [`tinytex`](https://yihui.org/tinytex/))!
:bulb: If you want to run this in a dockerized environment you can use the [`rocker/verse`](https://hub.docker.com/r/rocker/verse/) image which has TeX already installed.
## Configuration of the tool
The configuration of databases, be it CSV files or SQL-based databases, is done with environment variables, which can be set using the base R command `Sys.setenv()`.
A detailed description, which environment variables need to be set for the specific databases can be found [here](https://github.com/miracum/misc-dizutils#db_connection).
## Example
The following code example is intended to provide a minimal working example on how to apply the DQA tool to data. Example data and a corresponding MDR are provided with the R package *DQAstats* (a working LaTeX installation is a prerequisite for using this software, e.g. by using the R package [`tinytex`](https://yihui.org/tinytex/); please refer to the [DQAstats wiki](https://github.com/miracum/dqa-dqastats/wiki/Installation) for further installation instructions).
- Example data: [https://github.com/miracum/dqa-dqastats/tree/master/inst/demo_data](https://github.com/miracum/dqa-dqastats/tree/master/inst/demo_data)
- Example MDR: [https://github.com/miracum/dqa-dqastats/blob/master/inst/demo_data/utilities/MDR/mdr_example_data.csv](https://github.com/miracum/dqa-dqastats/blob/master/inst/demo_data/utilities/MDR/mdr_example_data.csv)
```{r}
#| eval: false
# Load library DQAstats:
library(DQAstats)
# Set environment vars to demo files paths:
Sys.setenv("EXAMPLECSV_SOURCE_PATH" = system.file("demo_data",
package = "DQAstats"))
Sys.setenv("EXAMPLECSV_TARGET_PATH" = system.file("demo_data",
package = "DQAstats"))
# Set path to utilities folder where to find the mdr and template files:
utils_path <- system.file("demo_data/utilities",
package = "DQAstats")
# Execute the DQA and generate a PDF report:
results <- DQAstats::dqa(
source_system_name = "exampleCSV_source",
target_system_name = "exampleCSV_target",
utils_path = utils_path,
mdr_filename = "mdr_example_data.csv",
output_dir = "output/",
parallel = FALSE
)
# The PDF report is stored at "./output/"
```
## Demo Usage / Deployment Examples
You can test the package without needing to install anything except [docker](https://docs.docker.com/get-docker/).
:bulb: For further details, see the Wiki: <https://github.com/miracum/dqa-dqastats/wiki/Deployment>.
## Citation
L.A. Kapsner, J.M. Mang, S. Mate, S.A. Seuchter, A. Vengadeswaran, F. Bathelt, N. Deppenwiese, D. Kadioglu, D. Kraska, and H.-U. Prokosch, Linking a Consortium-Wide Data Quality Assessment Tool with the MIRACUM Metadata Repository, Appl Clin Inform. 12 (2021) 826–835. doi:[10.1055/s-0041-1733847](https://www.thieme-connect.com/products/ejournals/abstract/10.1055/s-0041-1733847).
```bibtex
@article{kapsner2021,
title = {Linking a {{Consortium}}-{{Wide Data Quality Assessment Tool}} with the {{MIRACUM Metadata Repository}}},
author = {Kapsner, Lorenz A. and Mang, Jonathan M. and Mate, Sebastian and Seuchter, Susanne A. and Vengadeswaran, Abishaa and Bathelt, Franziska and Deppenwiese, Noemi and Kadioglu, Dennis and Kraska, Detlef and Prokosch, Hans-Ulrich},
year = {2021},
month = aug,
journal = {Applied Clinical Informatics},
volume = {12},
number = {04},
pages = {826--835},
issn = {1869-0327},
doi = {10.1055/s-0041-1733847},
language = {en}
}
```
## More Infos
* about MIRACUM: [https://www.miracum.org/](https://www.miracum.org/)
* about the Medical Informatics Initiative: [https://www.medizininformatik-initiative.de/index.php/de](https://www.medizininformatik-initiative.de/index.php/de)
* the precursor's version of the MIRACUM DQA tool ([publication](https://doi.org/10.3233/SHTI190834)): [https://gitlab.miracum.org/miracum/dqa/dqa-p21-i2b2](https://gitlab.miracum.org/miracum/dqa/dqa-p21-i2b2)