Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R scripts for data importing and processing to assess initial feasibility of "starting metformin within 10d of pos. SARS-CoV-2" #2

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

alainamstutz
Copy link
Contributor

Hi Venexia,
Here are initial R scripts supposed to do the following:

  1. Import/extract feather dataset from OpenSAFELY including basic type formatting of variables
  2. Process the data: a) combine some categories, b) apply the diabetes algorithm, c) apply primary endpoint window (28 days) and create final outcome and exposure/treatment variables, d) add a treatment window period (grace period) that can be changed flexibly
  3. Apply the data quality assurance and data completeness criteria
  4. Apply the eligibility criteria
  5. Summarize how many initiated treatment within applied grace period

I end up with three 1-line tables:

  • n_qa_excluded (how many excluded due to which quality assurance and data completeness criteria)
  • n_excluded (how many of those excluded due to which eligibility criteria)
  • n_treated (how many of those treated with metformin within applied treatment window (= grace period, for now at 10 days) and beyond)

Protocol here: https://docs.google.com/document/d/1Vptfzx-S981m7Th1FIDbAjwe1iUchrXFh0Y6HrVzrDQ/edit#heading=h.fim3pqbbhatr

I am sure I could have structured it a bit simpler since I based it on the (rather complex) sequential trial repo from Linda/John, but if we go down that line, then probably good to have it already.
Also, still I need to add the three 1-line tables to my yaml as an action.

Thanks a lot for your initial review - looking forward to your comments!
Best,
-Alain

@venexia
Copy link

venexia commented Sep 27, 2024

Hi @alainamstutz - given that the dataset-definition PR is now approved, do you want to merge the dataset-definition branch to main and rebase this branch before I do my review so I can test the code fully? Or would you rather I provide high level comments on the R code now?

@alainamstutz
Copy link
Contributor Author

Hi @venexia - thanks. I merged the dataset-definition back to my main. To get it right re rebasing, I now need to Create a pull request to merge 'main' to 'analysis', correct? here: analysis...main

Copy link

@venexia venexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alainamstutz. Great work on getting so much code done this week. I have added (hopefully!) helpful comments to the code and highlighted some repetitions that should be removed.

My main feedback is that this script needs to be split up and incorporated as a couple of actions. My current thinking is that might look like this:

  • Action 1: Data preprocessing - this would include combining categories, applying the diabetes algorithm (diabetes_algo()), any cleaning
  • Action 2: Quality assurance
  • Action 3: Eligibility criteria
  • Action 4: TTE stetup - this would include stuff like applying the primary endpoint window (28 days) and add a treatment window period (grace period) that can be changed flexibly
  • Action 5: Feasibility - this is the calc_n_treated stuff that might not be in the final version of the project in this form

There are a couple of reasons to split this up. Firstly, many things are happening in this script. If any one of them goes wrong and causes an error, you will lose all your progress if they are in one action. Secondly, particularly while you are exploring the feasibility, you may well be making tweaks to actions 4 and 5. There is no need to rerun data preprocessing and QA every time you make a tweak (unless the tweak requires it). Third, for downstream actions, OpenSAFELY reads in all outputs saved in the dependency actions. There is no need for later actions to read in tables detailing how many people were excluded from the study so best to separate them off.

Hope this is clear. Let me know if you want to meet to discuss.

################################################################################
# 0.0 Import libraries + functions
################################################################################
library('feather')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feather is from a package called 'arrow' rather than 'feather'.

# POPULATION/DEMOGRAPHIC ----
cov_cat_age = cut(
cov_num_age,
breaks = c(18, 40, 60, 80, Inf),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be capped at a plausible maximum age - we usually use 110. The data may contain people with implausible ages and they should be coded as missing and removed from analysis.



# MAIN ELIGIBILITY - HISTORY OF T2DM ----
## First, define helper variables needed, esp. for step 5 in diabetes algorithm
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Iines 66-77 are all for the diabetes algorithm then I would move this to diabetes_algorithm.R then the tmp variables will be removed after they are used.

diabetes_algo() %>%
mutate(
## Third, extract T2DM as a separate variable
cov_bin_t2dm = case_when(cov_cat_diabetes == "T2DM" ~ TRUE, TRUE ~ FALSE),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is T2DM included as a covariate in your analysis? I thought this was used for inclusion criteria. If so, the prefix 'cov' is a bit misleading and, if this variable gets left in, may prevent you from using wildcards such as cov_* to reference all covariates.

cov_bin_t2dm = case_when(cov_cat_diabetes == "T2DM" ~ TRUE, TRUE ~ FALSE),


# TREATMENT ---- keep the structure as it is, may want to add more treatment strategies (other OADs) in future
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have done lots of work on covariates, moved to exposures and then moved back to covariates. I would group variables by type. You may even want separate functions that add covariates, add exposures, etc.

# completeness criteria
n_is_alive <-
data_processed %>%
filter(qa_bin_was_alive == TRUE & (qa_date_of_death > baseline_date | is.na(qa_date_of_death))) %>% # additional condition since "qa_bin_was_alive == TRUE" may not cover all (e.g. pos test came out after death)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of these variables is very confusing. They are being used for eligibility but named as if they are quality assurance. Also, you potentially have repetition for some of them, there is no point defining a new variable qa_bin_is_female_or_male if you already have cov_cat_sex or similar in your dataset. Keeping the datasets as small as possible is going to make this run faster.

################################################################################
n_excluded <- calc_n_excluded(data_processed$grace10)

data_processed_g10 <- data_processed_g10 %>%
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this - it is repeated below.

filter(cov_bin_metfin_interaction == FALSE) %>%
filter(cov_bin_long_covid == FALSE)

data_processed <-
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the counting, which is currently happening on row 96, should be happening. You can write a function that both filters and counts that is then called in this map command.

names(data_processed) <- c("grace7", "grace8", "grace9", "grace10")

################################################################################
# 3 Apply quality assurance criteria
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left comments on section (4) but they all apply to section (3) as well!

################################################################################
# 5 Apply treatment window
################################################################################
n_treated <- calc_n_treated(data_processed$grace10)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional] These tables would typically be presented in a long rather than a wide format.

@venexia
Copy link

venexia commented Sep 27, 2024

P.S. I would also recommend increasing your population size in the dummy data so you have a bit more data to play with!

@alainamstutz
Copy link
Contributor Author

Hi @venexia - great comments! Quick one: With separate "Actions" I guess you mean separate yaml actions = separate R files for the 5 actions outlined above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants