-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R scripts for data importing and processing to assess initial feasibility of "starting metformin within 10d of pos. SARS-CoV-2" #2
base: main
Are you sure you want to change the base?
Conversation
Hi @alainamstutz - given that the dataset-definition PR is now approved, do you want to merge the dataset-definition branch to main and rebase this branch before I do my review so I can test the code fully? Or would you rather I provide high level comments on the R code now? |
Hi @venexia - thanks. I merged the dataset-definition back to my main. To get it right re rebasing, I now need to Create a pull request to merge 'main' to 'analysis', correct? here: analysis...main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alainamstutz. Great work on getting so much code done this week. I have added (hopefully!) helpful comments to the code and highlighted some repetitions that should be removed.
My main feedback is that this script needs to be split up and incorporated as a couple of actions. My current thinking is that might look like this:
- Action 1: Data preprocessing - this would include combining categories, applying the diabetes algorithm (diabetes_algo()), any cleaning
- Action 2: Quality assurance
- Action 3: Eligibility criteria
- Action 4: TTE stetup - this would include stuff like applying the primary endpoint window (28 days) and add a treatment window period (grace period) that can be changed flexibly
- Action 5: Feasibility - this is the calc_n_treated stuff that might not be in the final version of the project in this form
There are a couple of reasons to split this up. Firstly, many things are happening in this script. If any one of them goes wrong and causes an error, you will lose all your progress if they are in one action. Secondly, particularly while you are exploring the feasibility, you may well be making tweaks to actions 4 and 5. There is no need to rerun data preprocessing and QA every time you make a tweak (unless the tweak requires it). Third, for downstream actions, OpenSAFELY reads in all outputs saved in the dependency actions. There is no need for later actions to read in tables detailing how many people were excluded from the study so best to separate them off.
Hope this is clear. Let me know if you want to meet to discuss.
################################################################################ | ||
# 0.0 Import libraries + functions | ||
################################################################################ | ||
library('feather') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feather is from a package called 'arrow' rather than 'feather'.
# POPULATION/DEMOGRAPHIC ---- | ||
cov_cat_age = cut( | ||
cov_num_age, | ||
breaks = c(18, 40, 60, 80, Inf), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be capped at a plausible maximum age - we usually use 110. The data may contain people with implausible ages and they should be coded as missing and removed from analysis.
|
||
|
||
# MAIN ELIGIBILITY - HISTORY OF T2DM ---- | ||
## First, define helper variables needed, esp. for step 5 in diabetes algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Iines 66-77 are all for the diabetes algorithm then I would move this to diabetes_algorithm.R
then the tmp variables will be removed after they are used.
diabetes_algo() %>% | ||
mutate( | ||
## Third, extract T2DM as a separate variable | ||
cov_bin_t2dm = case_when(cov_cat_diabetes == "T2DM" ~ TRUE, TRUE ~ FALSE), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is T2DM included as a covariate in your analysis? I thought this was used for inclusion criteria. If so, the prefix 'cov' is a bit misleading and, if this variable gets left in, may prevent you from using wildcards such as cov_*
to reference all covariates.
cov_bin_t2dm = case_when(cov_cat_diabetes == "T2DM" ~ TRUE, TRUE ~ FALSE), | ||
|
||
|
||
# TREATMENT ---- keep the structure as it is, may want to add more treatment strategies (other OADs) in future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have done lots of work on covariates, moved to exposures and then moved back to covariates. I would group variables by type. You may even want separate functions that add covariates, add exposures, etc.
# completeness criteria | ||
n_is_alive <- | ||
data_processed %>% | ||
filter(qa_bin_was_alive == TRUE & (qa_date_of_death > baseline_date | is.na(qa_date_of_death))) %>% # additional condition since "qa_bin_was_alive == TRUE" may not cover all (e.g. pos test came out after death) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of these variables is very confusing. They are being used for eligibility but named as if they are quality assurance. Also, you potentially have repetition for some of them, there is no point defining a new variable qa_bin_is_female_or_male
if you already have cov_cat_sex
or similar in your dataset. Keeping the datasets as small as possible is going to make this run faster.
################################################################################ | ||
n_excluded <- calc_n_excluded(data_processed$grace10) | ||
|
||
data_processed_g10 <- data_processed_g10 %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this - it is repeated below.
filter(cov_bin_metfin_interaction == FALSE) %>% | ||
filter(cov_bin_long_covid == FALSE) | ||
|
||
data_processed <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the counting, which is currently happening on row 96, should be happening. You can write a function that both filters and counts that is then called in this map command.
names(data_processed) <- c("grace7", "grace8", "grace9", "grace10") | ||
|
||
################################################################################ | ||
# 3 Apply quality assurance criteria |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left comments on section (4) but they all apply to section (3) as well!
################################################################################ | ||
# 5 Apply treatment window | ||
################################################################################ | ||
n_treated <- calc_n_treated(data_processed$grace10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Optional] These tables would typically be presented in a long rather than a wide format.
P.S. I would also recommend increasing your population size in the dummy data so you have a bit more data to play with! |
Hi @venexia - great comments! Quick one: With separate "Actions" I guess you mean separate yaml actions = separate R files for the 5 actions outlined above? |
Hi Venexia,
Here are initial R scripts supposed to do the following:
I end up with three 1-line tables:
Protocol here: https://docs.google.com/document/d/1Vptfzx-S981m7Th1FIDbAjwe1iUchrXFh0Y6HrVzrDQ/edit#heading=h.fim3pqbbhatr
I am sure I could have structured it a bit simpler since I based it on the (rather complex) sequential trial repo from Linda/John, but if we go down that line, then probably good to have it already.
Also, still I need to add the three 1-line tables to my yaml as an action.
Thanks a lot for your initial review - looking forward to your comments!
Best,
-Alain