Adding model code

cancer-genomics · Sep 9, 2024 · 3f98c76 · 3f98c76
1 parent 0efeb03
commit 3f98c76
Show file tree

Hide file tree

Showing 24 changed files with 2,335 additions and 0 deletions.
diff --git a/code/00-preprocess/README.md b/code/00-preprocess/README.md
@@ -0,0 +1,64 @@
+# Preprocessing 
+
+A number of exisiting pipelines are applied to the input sequencing files (FASTQ format) to enable downstream analysis. This directory includes the relevant references and scripts to ensure reproducibility.
+
+### ARTEMIS-DELFI Model
+
+The input features used to train the ARTEMIS-DELFI model are generated by aggregating the features extracted by the DELFI pipeline and the ARTEMIS pipeline.
+
+* For the DELFI pipeline, please refer to:
+
+Genome-wide cell-free DNA fragmentation in patients with cancer
+Cristiano et al. Nature, 2019
+
+Detection and characterization of lung cancer using cell-free DNA fragmentomes
+Mathios et al. Nat Comm, 2021
+
+https://github.com/cancer-genomics/reproduce_lucas_wflow/tree/master/code/preprocessing
+
+
+* For the ARTEMIS pipeline, please see:
+
+Genome-wide repeat landscapes in cancer and cell-free DNA
+Annapragada et al. Sci Trans Med, 2024
+
+https://github.com/cancer-genomics/artemis2024 
+
+
+
+### DECIFER Analysis
+
+The decifer analysis relies on two distinct sources of input data. 
+
+(1) Expression profiles for the transcription factors of interest 
+
+In the current manuscript, we predominantly used gene-level expression data from the TCGA-GTEX-TARGET Expression Dataset available from the UCSC Toil RNAseq Recompute Compendium [1]. Next, we appended this dataset with two independent cell-sorted datasets where the transcriptomes of immune (SRP045500) [2] and brain (SRP064454) [3] cell subpopulations are profiled by RNA-sequencing. For these two datasets, the raw fastq files were retrieved from the SRA and processed through an analysis pipeline mirroring the one described by the UCSC Toil RNAseq Recompute Compendium. The resulting aggregated expression matrix, and the associated metadata table were stored in an R object (merged_tpm.rds), which cannot be included in the workflow due to its large size.
+
+The file `extract_expression_stats.r` reads in the expression matrix, and compares the TF expression of each tumor/cell type to whole blood from healthy individuals.
+
+(2) The relative coverage at TFBS calculated from WGS of plasma cfDNA
+
+Here, we calculate the relative coverage at TFBS following the same procedure used in Mathios et al. [4] and Foda et al. [5]. The script generating the TFBS profile from cfDNA fragment objects (which are generated as part of the DELFI pipeline) can be found at:
+
+https://github.com/cancer-genomics/reproduce_liver_final/blob/master/code/tfbs_coverage-v1-01val.R
+
+The TFBS relative coverage tables for the current brain tumor cohort and the liver cohort are not included in the current workflow due to size limitation.
+
+* References:
+
+[1] Toil enables reproducible, open source, big biomedical data analyses.
+Vivian et al. Nat Biotechnol. 2017.
+
+[2] Copy number loss of the interferon gene cluster in melanomas is linked to reduced T cell infiltrate and poor patient prognosis.
+Linsley, P.S. et al. PLoS One, 2014.
+
+[3] Purification and Characterization of Progenitor and Mature Human Astrocytes Reveals Transcriptional and Functional Differences with Mouse.
+Zhang, Y. et al. Neuron, 2016.
+
+[4] Detection and characterization of lung cancer using cell-free DNA fragmentomes
+Mathios et al. Nat Comm, 2021.
+
+[5] Detecting Liver Cancer Using Cell-Free DNA Fragmentomes
+Foda et al. Cancer Disc, 2023.
+
+
diff --git a/code/00-preprocess/extract_expression_stats.r b/code/00-preprocess/extract_expression_stats.r
@@ -0,0 +1,63 @@
+# extract whole blood samples from gtex (healthy individuals)
+
+.libPaths('/dcs04/scharpf/data/pipeline-hub/pipeline-lib/R-libs/3.17-bioc-release')
+
+library(tidyverse)
+library(data.table)
+library(pbapply)
+library(effsize)
+
+#----------------------------------------------#
+# annotate gene ids
+mapping = readRDS('data/04-decifer/gene-id-mapping.rds')
+
+tf.ids = readRDS('data/04-decifer/all_tfs.rds')
+mapping <- mapping %>% filter(gene_name %in% tf.ids)
+map.array = matrix(mapping$gene_name, ncol = 1)
+rownames(map.array) <- mapping$gene_id
+
+#----------------------------------------------#
+# read in tpm values
+expr <- readRDS('data/expression/merged_tpm.rds')
+meta <- expr$meta
+mat <- expr$matrix
+
+mat <- mat[rownames(map.array),]
+#----------------------------------------------#
+# annotate sample sources for aggregation
+s = melt(mat)
+colnames(s) = c('gene_id', 'sample_id', 'value')
+
+s$source = meta[s$sample_id, 'source']
+s$study = meta[s$sample_id, 'study']
+
+s$gene_name = map.array[as.character(s$gene_id),1]
+remove(list =c('expr'))
+#----------------------------------------------#
+
+gene.ids <- rownames(mat)
+types <- meta$source %>% as.character() %>% unique() %>% sort()
+
+master <- data.frame()
+ref.ids <- meta %>% filter(source == 'gtex.whole_blood') %>% rownames()
+
+for (type in setdiff(types, 'gtex.whole_blood')){
+   type.ids <- meta %>% filter(source == type) %>% rownames() %>% as.character()
+   print(type)
+   for (gene in gene.ids){
+       x = log2(mat[gene, ref.ids]+0.001)
+       y = log2(mat[gene, type.ids]+0.001)
+       master <- rbind(master, data.frame(type = type, gene = gene, 
+                                          gtex.whole_blood.mean = mean(x),
+                                          test.mean = mean(y),
+                                          gtex.whole_blood.median = median(x),
+                                          test.median = median(y),
+                                          cohen.d = cohen.d(y, x)$estimate))}}
+
+
+
+
+counts = meta %>% group_by(source) %>% summarize(n = n()) %>% ungroup() %>% arrange(n) %>% data.frame() %>% dplyr::rename(type = source, test.group.size = n)
+
+master = merge(master, counts, by = 'type', all.x = TRUE)
+saveRDS(master, file = 'data/04-decifer/expression-stats-by-type.rds')
diff --git a/code/01-rbrain/README.md b/code/01-rbrain/README.md
@@ -0,0 +1,4 @@
+### Input Data Extraction
+
+The scripts in this directory extract analysis-ready R objects from input metadata level tables.
+
diff --git a/code/01-rbrain/cbc.r b/code/01-rbrain/cbc.r
@@ -0,0 +1,15 @@
+library(tidyverse)
+library(data.table)
+library(openxlsx)
+library(here)
+
+tab = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"),
+                  sheet = 6, startRow = 3)
+
+cbc = tab %>%
+      dplyr::rename(id = Sample.ID,
+                    pathology = Diagnosis.Group,
+                    immune.cells = Immune.Cell.Population,
+                    fraction = `Cell.Fraction.(%)`)
+
+save(cbc, file = here("output", "01-rbrain", "cbc.rda"))
diff --git a/code/01-rbrain/metadata.r b/code/01-rbrain/metadata.r
@@ -0,0 +1,58 @@
+library(tidyverse)
+library(data.table)
+library(openxlsx)
+library(here)
+
+S1 = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"),
+               sheet = 1, startRow = 3)
+
+S2 = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"),
+               sheet = 2, startRow = 3)
+
+
+colnames(S1)[3] = 'Patient.ID'
+S1 = S1 %>% select(-2) %>% filter(grepl('CG', Complete.Sample.ID))
+
+
+base = S1 %>%
+  select(Complete.Sample.ID, Clinical.Condition, Age, Gender, Pathology.Classification, Detailed.Pathology, MRI.Enhancement, Ki67, `Sample.Source*`,
+        `Tumor.Volume.(cm3)`) %>%
+  dplyr::rename(alternate_id = Complete.Sample.ID,
+                age = Age,
+                gender = Gender,
+                pathology._simplified = Pathology.Classification,
+                tumor_subset_pathology = Detailed.Pathology,
+                mri.enhancement = MRI.Enhancement,
+                ki67.simplified = Ki67,
+                source = `Sample.Source*`,
+                volume = `Tumor.Volume.(cm3)`) %>%
+  mutate(tumor.lab.id  = alternate_id,
+         type = ifelse(Clinical.Condition == "Healthy", 'healthy', 'cancer')) %>%
+  select(alternate_id, type, age, gender, pathology._simplified, tumor_subset_pathology, mri.enhancement, ki67.simplified, source, volume)
+
+base$patient_id = base %>%
+                  mutate(p = gsub('P1', 'P', alternate_id)) %>%
+                  separate(p, into = c('root', 'suffix'), '_') %>%
+                  mutate(r = gsub('P$', '', root)) %>%
+                  pull(r)
+
+
+S2 = S2 %>%
+     select(Alternate_ID, `Discovery.Cohort.(Training)`, `Discovery.Cohort.(Held-out)`, `Validation.Cohort`) %>%
+     dplyr::rename(alternate_id = Alternate_ID) %>%
+     mutate(training = ifelse(`Discovery.Cohort.(Training)` == 'YES', TRUE, FALSE),
+         heldout = ifelse(`Discovery.Cohort.(Held-out)` == 'YES', TRUE, FALSE),
+         validation = ifelse(`Validation.Cohort` == 'YES', TRUE, FALSE)) %>%
+     select(alternate_id, training, heldout, validation)
+
+
+data = merge(base, S2, by = 'alternate_id', all.x =TRUE) %>%
+       arrange(alternate_id) %>%
+       mutate(index = seq_along(alternate_id), drop = FALSE, exclude= FALSE) %>%
+       dplyr::relocate(index, .before = 'alternate_id')
+
+metadata = data %>%
+         select(index, alternate_id, patient_id, age, gender, type, training, validation, heldout,exclude, drop,
+         pathology._simplified,tumor_subset_pathology, mri.enhancement, ki67.simplified, volume, source)
+
+save(metadata, file = here("output", "01-rbrain", "metadata.rda"))
diff --git a/code/01-rbrain/plasma_maf.r b/code/01-rbrain/plasma_maf.r
@@ -0,0 +1,41 @@
+library(tidyverse)
+library(data.table)
+library(openxlsx)
+library(here)
+
+ddpcr = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"),
+               sheet = 3, startRow = 3)
+
+ts = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"),
+               sheet = 4, startRow = 3)
+
+
+ddpcr = ddpcr %>%
+        mutate(method = 'ddPCR')  %>%
+        filter(Sample.ID != 'CGCNS49P1') %>%
+        mutate(gene = 'IDH1') %>%
+        dplyr::rename(id = Sample.ID)
+ddpcr$maf = ddpcr$`Mutant.Allele.Fraction.(%)`
+ddpcr$maf[ddpcr$`Mutant.Allele.Fraction.(%)` == "<LOD"] = 1e-6
+ddpcr$maf = as.numeric(ddpcr$maf)
+
+ts_pos = ts %>%
+     filter(`Mutation.Designation` == "Tumor Mutation") %>%
+     select(Sample, Gene.Symbol, `%.Mutant.Reads`) %>%
+     dplyr::rename(id = Sample, gene = Gene.Symbol, maf = `%.Mutant.Reads`) %>%
+     mutate(method = 'Targeted Sequencing')
+
+pos_ids = ts_pos$id
+tested_ids = ts %>% filter(`Mutation.Designation` != "Exclude") %>% pull(Sample)
+neg_ids = setdiff(tested_ids, pos_ids)
+
+ts_neg = data.frame(id = neg_ids) %>%
+     mutate(gene = "-", maf = 1e-6, method = "Targeted Sequencing")
+
+ts = rbind(ts_neg, ts_pos)
+
+plasma_maf = rbind(ts %>% select(id, gene, maf, method),
+             ddpcr %>% select(id, gene, maf, method))
+
+save(plasma_maf, file = here("output", "01-rbrain", "plasma_maf.rda"))
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,4 @@
		### Input Data Extraction

		The scripts in this directory extract analysis-ready R objects from input metadata level tables.