R scripts for data importing and processing to assess initial feasibility of "starting metformin within 10d of pos. SARS-CoV-2" #2

alainamstutz · 2024-09-26T17:10:46Z

Hi Venexia,
Here are initial R scripts supposed to do the following:

Import/extract feather dataset from OpenSAFELY including basic type formatting of variables
Process the data: a) combine some categories, b) apply the diabetes algorithm, c) apply primary endpoint window (28 days) and create final outcome and exposure/treatment variables, d) add a treatment window period (grace period) that can be changed flexibly
Apply the data quality assurance and data completeness criteria
Apply the eligibility criteria
Summarize how many initiated treatment within applied grace period

I end up with three 1-line tables:

n_qa_excluded (how many excluded due to which quality assurance and data completeness criteria)
n_excluded (how many of those excluded due to which eligibility criteria)
n_treated (how many of those treated with metformin within applied treatment window (= grace period, for now at 10 days) and beyond)

Protocol here: https://docs.google.com/document/d/1Vptfzx-S981m7Th1FIDbAjwe1iUchrXFh0Y6HrVzrDQ/edit#heading=h.fim3pqbbhatr

I am sure I could have structured it a bit simpler since I based it on the (rather complex) sequential trial repo from Linda/John, but if we go down that line, then probably good to have it already.
Also, still I need to add the three 1-line tables to my yaml as an action.

Thanks a lot for your initial review - looking forward to your comments!
Best,
-Alain

…ith dataset-definition

venexia · 2024-09-27T09:15:45Z

Hi @alainamstutz - given that the dataset-definition PR is now approved, do you want to merge the dataset-definition branch to main and rebase this branch before I do my review so I can test the code fully? Or would you rather I provide high level comments on the R code now?

alainamstutz · 2024-09-27T09:57:49Z

Hi @venexia - thanks. I merged the dataset-definition back to my main. To get it right re rebasing, I now need to Create a pull request to merge 'main' to 'analysis', correct? here: analysis...main

venexia

Hi @alainamstutz. Great work on getting so much code done this week. I have added (hopefully!) helpful comments to the code and highlighted some repetitions that should be removed.

My main feedback is that this script needs to be split up and incorporated as a couple of actions. My current thinking is that might look like this:

Action 1: Data preprocessing - this would include combining categories, applying the diabetes algorithm (diabetes_algo()), any cleaning
Action 2: Quality assurance
Action 3: Eligibility criteria
Action 4: TTE stetup - this would include stuff like applying the primary endpoint window (28 days) and add a treatment window period (grace period) that can be changed flexibly
Action 5: Feasibility - this is the calc_n_treated stuff that might not be in the final version of the project in this form

There are a couple of reasons to split this up. Firstly, many things are happening in this script. If any one of them goes wrong and causes an error, you will lose all your progress if they are in one action. Secondly, particularly while you are exploring the feasibility, you may well be making tweaks to actions 4 and 5. There is no need to rerun data preprocessing and QA every time you make a tweak (unless the tweak requires it). Third, for downstream actions, OpenSAFELY reads in all outputs saved in the dependency actions. There is no need for later actions to read in tables detailing how many people were excluded from the study so best to separate them off.

Hope this is clear. Let me know if you want to meet to discuss.

venexia · 2024-09-27T12:50:46Z

analysis/data_process.R

+################################################################################
+# 0.0 Import libraries + functions
+################################################################################
+library('feather')


Feather is from a package called 'arrow' rather than 'feather'.

venexia · 2024-09-27T13:37:33Z

analysis/data_import/process_data.R

+      # POPULATION/DEMOGRAPHIC ----
+      cov_cat_age = cut(
+        cov_num_age,
+        breaks = c(18, 40, 60, 80, Inf),


This should be capped at a plausible maximum age - we usually use 110. The data may contain people with implausible ages and they should be coded as missing and removed from analysis.

venexia · 2024-09-27T13:52:09Z

analysis/data_import/process_data.R

+
+
+      # MAIN ELIGIBILITY - HISTORY OF T2DM ----
+      ## First, define helper variables needed, esp. for step 5 in diabetes algorithm


If Iines 66-77 are all for the diabetes algorithm then I would move this to diabetes_algorithm.R then the tmp variables will be removed after they are used.

venexia · 2024-09-27T13:54:49Z

analysis/data_import/process_data.R

+    diabetes_algo() %>%
+    mutate(
+      ## Third, extract T2DM as a separate variable
+      cov_bin_t2dm = case_when(cov_cat_diabetes == "T2DM" ~ TRUE, TRUE ~ FALSE),


Is T2DM included as a covariate in your analysis? I thought this was used for inclusion criteria. If so, the prefix 'cov' is a bit misleading and, if this variable gets left in, may prevent you from using wildcards such as cov_* to reference all covariates.

venexia · 2024-09-27T14:00:30Z

analysis/data_import/process_data.R

+      cov_bin_t2dm = case_when(cov_cat_diabetes == "T2DM" ~ TRUE, TRUE ~ FALSE),
+
+
+      # TREATMENT ---- keep the structure as it is, may want to add more treatment strategies (other OADs) in future


You have done lots of work on covariates, moved to exposures and then moved back to covariates. I would group variables by type. You may even want separate functions that add covariates, add exposures, etc.

venexia · 2024-09-27T14:39:58Z

analysis/data_import/calc_n_excluded.R

+  # completeness criteria
+  n_is_alive <-
+    data_processed %>%
+    filter(qa_bin_was_alive == TRUE & (qa_date_of_death > baseline_date | is.na(qa_date_of_death))) %>% # additional condition since "qa_bin_was_alive == TRUE" may not cover all (e.g. pos test came out after death)


The naming of these variables is very confusing. They are being used for eligibility but named as if they are quality assurance. Also, you potentially have repetition for some of them, there is no point defining a new variable qa_bin_is_female_or_male if you already have cov_cat_sex or similar in your dataset. Keeping the datasets as small as possible is going to make this run faster.

venexia · 2024-09-27T14:46:34Z

analysis/data_process.R

+################################################################################
+n_excluded <- calc_n_excluded(data_processed$grace10)
+
+data_processed_g10 <- data_processed_g10 %>%


Remove this - it is repeated below.

venexia · 2024-09-27T14:48:09Z

analysis/data_process.R

+  filter(cov_bin_metfin_interaction == FALSE) %>%
+  filter(cov_bin_long_covid == FALSE)
+
+data_processed <-


This is where the counting, which is currently happening on row 96, should be happening. You can write a function that both filters and counts that is then called in this map command.

venexia · 2024-09-27T14:49:12Z

analysis/data_process.R

+names(data_processed) <- c("grace7", "grace8", "grace9", "grace10")
+
+################################################################################
+# 3 Apply quality assurance criteria


I have left comments on section (4) but they all apply to section (3) as well!

venexia · 2024-09-27T14:52:28Z

analysis/data_process.R

+################################################################################
+# 5 Apply treatment window
+################################################################################
+n_treated <- calc_n_treated(data_processed$grace10)


[Optional] These tables would typically be presented in a long rather than a wide format.

venexia · 2024-09-27T15:16:30Z

P.S. I would also recommend increasing your population size in the dummy data so you have a bit more data to play with!

alainamstutz · 2024-09-27T15:28:28Z

Hi @venexia - great comments! Quick one: With separate "Actions" I guess you mean separate yaml actions = separate R files for the 5 actions outlined above?

alainamstutz added 4 commits September 25, 2024 22:29

initial commit of R analysis scripts

6cfc68d

deleted old unused chunks in R scripts

9404da0

updated R scripts to take out unnecessary bits and bring it in line w…

3e4c608

…ith dataset-definition

adapted comments in R scripts

1cc2e98

alainamstutz requested a review from venexia September 26, 2024 17:10

Merge branch 'main' into analysis

1e3e0d7

venexia requested changes Sep 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R scripts for data importing and processing to assess initial feasibility of "starting metformin within 10d of pos. SARS-CoV-2" #2

R scripts for data importing and processing to assess initial feasibility of "starting metformin within 10d of pos. SARS-CoV-2" #2

alainamstutz commented Sep 26, 2024

venexia commented Sep 27, 2024 •

edited

Loading

alainamstutz commented Sep 27, 2024

venexia left a comment •

edited

Loading

venexia Sep 27, 2024

venexia Sep 27, 2024

venexia Sep 27, 2024

venexia Sep 27, 2024

venexia Sep 27, 2024

venexia Sep 27, 2024

venexia Sep 27, 2024

venexia Sep 27, 2024

venexia Sep 27, 2024

venexia Sep 27, 2024

venexia commented Sep 27, 2024

alainamstutz commented Sep 27, 2024



		# MAIN ELIGIBILITY - HISTORY OF T2DM ----
		## First, define helper variables needed, esp. for step 5 in diabetes algorithm

		cov_bin_t2dm = case_when(cov_cat_diabetes == "T2DM" ~ TRUE, TRUE ~ FALSE),


		# TREATMENT ---- keep the structure as it is, may want to add more treatment strategies (other OADs) in future

R scripts for data importing and processing to assess initial feasibility of "starting metformin within 10d of pos. SARS-CoV-2" #2

Are you sure you want to change the base?

R scripts for data importing and processing to assess initial feasibility of "starting metformin within 10d of pos. SARS-CoV-2" #2

Conversation

alainamstutz commented Sep 26, 2024

venexia commented Sep 27, 2024 • edited Loading

alainamstutz commented Sep 27, 2024

venexia left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

venexia commented Sep 27, 2024

alainamstutz commented Sep 27, 2024

venexia commented Sep 27, 2024 •

edited

Loading

venexia left a comment •

edited

Loading