Skip to content

Commit

Permalink
Adding model code
Browse files Browse the repository at this point in the history
  • Loading branch information
Niknafs committed Sep 9, 2024
1 parent 0efeb03 commit 3f98c76
Show file tree
Hide file tree
Showing 24 changed files with 2,335 additions and 0 deletions.
64 changes: 64 additions & 0 deletions code/00-preprocess/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Preprocessing

A number of exisiting pipelines are applied to the input sequencing files (FASTQ format) to enable downstream analysis. This directory includes the relevant references and scripts to ensure reproducibility.

### ARTEMIS-DELFI Model

The input features used to train the ARTEMIS-DELFI model are generated by aggregating the features extracted by the DELFI pipeline and the ARTEMIS pipeline.

* For the DELFI pipeline, please refer to:

Genome-wide cell-free DNA fragmentation in patients with cancer
Cristiano et al. Nature, 2019

Detection and characterization of lung cancer using cell-free DNA fragmentomes
Mathios et al. Nat Comm, 2021

https://github.com/cancer-genomics/reproduce_lucas_wflow/tree/master/code/preprocessing


* For the ARTEMIS pipeline, please see:

Genome-wide repeat landscapes in cancer and cell-free DNA
Annapragada et al. Sci Trans Med, 2024

https://github.com/cancer-genomics/artemis2024



### DECIFER Analysis

The decifer analysis relies on two distinct sources of input data.

(1) Expression profiles for the transcription factors of interest

In the current manuscript, we predominantly used gene-level expression data from the TCGA-GTEX-TARGET Expression Dataset available from the UCSC Toil RNAseq Recompute Compendium [1]. Next, we appended this dataset with two independent cell-sorted datasets where the transcriptomes of immune (SRP045500) [2] and brain (SRP064454) [3] cell subpopulations are profiled by RNA-sequencing. For these two datasets, the raw fastq files were retrieved from the SRA and processed through an analysis pipeline mirroring the one described by the UCSC Toil RNAseq Recompute Compendium. The resulting aggregated expression matrix, and the associated metadata table were stored in an R object (merged_tpm.rds), which cannot be included in the workflow due to its large size.

The file `extract_expression_stats.r` reads in the expression matrix, and compares the TF expression of each tumor/cell type to whole blood from healthy individuals.

(2) The relative coverage at TFBS calculated from WGS of plasma cfDNA

Here, we calculate the relative coverage at TFBS following the same procedure used in Mathios et al. [4] and Foda et al. [5]. The script generating the TFBS profile from cfDNA fragment objects (which are generated as part of the DELFI pipeline) can be found at:

https://github.com/cancer-genomics/reproduce_liver_final/blob/master/code/tfbs_coverage-v1-01val.R

The TFBS relative coverage tables for the current brain tumor cohort and the liver cohort are not included in the current workflow due to size limitation.

* References:

[1] Toil enables reproducible, open source, big biomedical data analyses.
Vivian et al. Nat Biotechnol. 2017.

[2] Copy number loss of the interferon gene cluster in melanomas is linked to reduced T cell infiltrate and poor patient prognosis.
Linsley, P.S. et al. PLoS One, 2014.

[3] Purification and Characterization of Progenitor and Mature Human Astrocytes Reveals Transcriptional and Functional Differences with Mouse.
Zhang, Y. et al. Neuron, 2016.

[4] Detection and characterization of lung cancer using cell-free DNA fragmentomes
Mathios et al. Nat Comm, 2021.

[5] Detecting Liver Cancer Using Cell-Free DNA Fragmentomes
Foda et al. Cancer Disc, 2023.


63 changes: 63 additions & 0 deletions code/00-preprocess/extract_expression_stats.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# extract whole blood samples from gtex (healthy individuals)

.libPaths('/dcs04/scharpf/data/pipeline-hub/pipeline-lib/R-libs/3.17-bioc-release')

library(tidyverse)
library(data.table)
library(pbapply)
library(effsize)

#----------------------------------------------#
# annotate gene ids
mapping = readRDS('data/04-decifer/gene-id-mapping.rds')

tf.ids = readRDS('data/04-decifer/all_tfs.rds')
mapping <- mapping %>% filter(gene_name %in% tf.ids)
map.array = matrix(mapping$gene_name, ncol = 1)
rownames(map.array) <- mapping$gene_id

#----------------------------------------------#
# read in tpm values
expr <- readRDS('data/expression/merged_tpm.rds')
meta <- expr$meta
mat <- expr$matrix

mat <- mat[rownames(map.array),]
#----------------------------------------------#
# annotate sample sources for aggregation
s = melt(mat)
colnames(s) = c('gene_id', 'sample_id', 'value')

s$source = meta[s$sample_id, 'source']
s$study = meta[s$sample_id, 'study']

s$gene_name = map.array[as.character(s$gene_id),1]
remove(list =c('expr'))
#----------------------------------------------#

gene.ids <- rownames(mat)
types <- meta$source %>% as.character() %>% unique() %>% sort()

master <- data.frame()
ref.ids <- meta %>% filter(source == 'gtex.whole_blood') %>% rownames()

for (type in setdiff(types, 'gtex.whole_blood')){
type.ids <- meta %>% filter(source == type) %>% rownames() %>% as.character()
print(type)
for (gene in gene.ids){
x = log2(mat[gene, ref.ids]+0.001)
y = log2(mat[gene, type.ids]+0.001)
master <- rbind(master, data.frame(type = type, gene = gene,
gtex.whole_blood.mean = mean(x),
test.mean = mean(y),
gtex.whole_blood.median = median(x),
test.median = median(y),
cohen.d = cohen.d(y, x)$estimate))}}




counts = meta %>% group_by(source) %>% summarize(n = n()) %>% ungroup() %>% arrange(n) %>% data.frame() %>% dplyr::rename(type = source, test.group.size = n)

master = merge(master, counts, by = 'type', all.x = TRUE)
saveRDS(master, file = 'data/04-decifer/expression-stats-by-type.rds')
4 changes: 4 additions & 0 deletions code/01-rbrain/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
### Input Data Extraction

The scripts in this directory extract analysis-ready R objects from input metadata level tables.

15 changes: 15 additions & 0 deletions code/01-rbrain/cbc.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
library(tidyverse)
library(data.table)
library(openxlsx)
library(here)

tab = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"),
sheet = 6, startRow = 3)

cbc = tab %>%
dplyr::rename(id = Sample.ID,
pathology = Diagnosis.Group,
immune.cells = Immune.Cell.Population,
fraction = `Cell.Fraction.(%)`)

save(cbc, file = here("output", "01-rbrain", "cbc.rda"))
58 changes: 58 additions & 0 deletions code/01-rbrain/metadata.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
library(tidyverse)
library(data.table)
library(openxlsx)
library(here)

S1 = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"),
sheet = 1, startRow = 3)

S2 = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"),
sheet = 2, startRow = 3)


colnames(S1)[3] = 'Patient.ID'
S1 = S1 %>% select(-2) %>% filter(grepl('CG', Complete.Sample.ID))


base = S1 %>%
select(Complete.Sample.ID, Clinical.Condition, Age, Gender, Pathology.Classification, Detailed.Pathology, MRI.Enhancement, Ki67, `Sample.Source*`,
`Tumor.Volume.(cm3)`) %>%
dplyr::rename(alternate_id = Complete.Sample.ID,
age = Age,
gender = Gender,
pathology._simplified = Pathology.Classification,
tumor_subset_pathology = Detailed.Pathology,
mri.enhancement = MRI.Enhancement,
ki67.simplified = Ki67,
source = `Sample.Source*`,
volume = `Tumor.Volume.(cm3)`) %>%
mutate(tumor.lab.id = alternate_id,
type = ifelse(Clinical.Condition == "Healthy", 'healthy', 'cancer')) %>%
select(alternate_id, type, age, gender, pathology._simplified, tumor_subset_pathology, mri.enhancement, ki67.simplified, source, volume)

base$patient_id = base %>%
mutate(p = gsub('P1', 'P', alternate_id)) %>%
separate(p, into = c('root', 'suffix'), '_') %>%
mutate(r = gsub('P$', '', root)) %>%
pull(r)


S2 = S2 %>%
select(Alternate_ID, `Discovery.Cohort.(Training)`, `Discovery.Cohort.(Held-out)`, `Validation.Cohort`) %>%
dplyr::rename(alternate_id = Alternate_ID) %>%
mutate(training = ifelse(`Discovery.Cohort.(Training)` == 'YES', TRUE, FALSE),
heldout = ifelse(`Discovery.Cohort.(Held-out)` == 'YES', TRUE, FALSE),
validation = ifelse(`Validation.Cohort` == 'YES', TRUE, FALSE)) %>%
select(alternate_id, training, heldout, validation)


data = merge(base, S2, by = 'alternate_id', all.x =TRUE) %>%
arrange(alternate_id) %>%
mutate(index = seq_along(alternate_id), drop = FALSE, exclude= FALSE) %>%
dplyr::relocate(index, .before = 'alternate_id')

metadata = data %>%
select(index, alternate_id, patient_id, age, gender, type, training, validation, heldout,exclude, drop,
pathology._simplified,tumor_subset_pathology, mri.enhancement, ki67.simplified, volume, source)

save(metadata, file = here("output", "01-rbrain", "metadata.rda"))
41 changes: 41 additions & 0 deletions code/01-rbrain/plasma_maf.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
library(tidyverse)
library(data.table)
library(openxlsx)
library(here)

ddpcr = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"),
sheet = 3, startRow = 3)

ts = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"),
sheet = 4, startRow = 3)


ddpcr = ddpcr %>%
mutate(method = 'ddPCR') %>%
filter(Sample.ID != 'CGCNS49P1') %>%
mutate(gene = 'IDH1') %>%
dplyr::rename(id = Sample.ID)
ddpcr$maf = ddpcr$`Mutant.Allele.Fraction.(%)`
ddpcr$maf[ddpcr$`Mutant.Allele.Fraction.(%)` == "<LOD"] = 1e-6
ddpcr$maf = as.numeric(ddpcr$maf)

ts_pos = ts %>%
filter(`Mutation.Designation` == "Tumor Mutation") %>%
select(Sample, Gene.Symbol, `%.Mutant.Reads`) %>%
dplyr::rename(id = Sample, gene = Gene.Symbol, maf = `%.Mutant.Reads`) %>%
mutate(method = 'Targeted Sequencing')

pos_ids = ts_pos$id
tested_ids = ts %>% filter(`Mutation.Designation` != "Exclude") %>% pull(Sample)
neg_ids = setdiff(tested_ids, pos_ids)

ts_neg = data.frame(id = neg_ids) %>%
mutate(gene = "-", maf = 1e-6, method = "Targeted Sequencing")

ts = rbind(ts_neg, ts_pos)

plasma_maf = rbind(ts %>% select(id, gene, maf, method),
ddpcr %>% select(id, gene, maf, method))

save(plasma_maf, file = here("output", "01-rbrain", "plasma_maf.rda"))

Loading

0 comments on commit 3f98c76

Please sign in to comment.