-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
24 changed files
with
2,335 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# Preprocessing | ||
|
||
A number of exisiting pipelines are applied to the input sequencing files (FASTQ format) to enable downstream analysis. This directory includes the relevant references and scripts to ensure reproducibility. | ||
|
||
### ARTEMIS-DELFI Model | ||
|
||
The input features used to train the ARTEMIS-DELFI model are generated by aggregating the features extracted by the DELFI pipeline and the ARTEMIS pipeline. | ||
|
||
* For the DELFI pipeline, please refer to: | ||
|
||
Genome-wide cell-free DNA fragmentation in patients with cancer | ||
Cristiano et al. Nature, 2019 | ||
|
||
Detection and characterization of lung cancer using cell-free DNA fragmentomes | ||
Mathios et al. Nat Comm, 2021 | ||
|
||
https://github.com/cancer-genomics/reproduce_lucas_wflow/tree/master/code/preprocessing | ||
|
||
|
||
* For the ARTEMIS pipeline, please see: | ||
|
||
Genome-wide repeat landscapes in cancer and cell-free DNA | ||
Annapragada et al. Sci Trans Med, 2024 | ||
|
||
https://github.com/cancer-genomics/artemis2024 | ||
|
||
|
||
|
||
### DECIFER Analysis | ||
|
||
The decifer analysis relies on two distinct sources of input data. | ||
|
||
(1) Expression profiles for the transcription factors of interest | ||
|
||
In the current manuscript, we predominantly used gene-level expression data from the TCGA-GTEX-TARGET Expression Dataset available from the UCSC Toil RNAseq Recompute Compendium [1]. Next, we appended this dataset with two independent cell-sorted datasets where the transcriptomes of immune (SRP045500) [2] and brain (SRP064454) [3] cell subpopulations are profiled by RNA-sequencing. For these two datasets, the raw fastq files were retrieved from the SRA and processed through an analysis pipeline mirroring the one described by the UCSC Toil RNAseq Recompute Compendium. The resulting aggregated expression matrix, and the associated metadata table were stored in an R object (merged_tpm.rds), which cannot be included in the workflow due to its large size. | ||
|
||
The file `extract_expression_stats.r` reads in the expression matrix, and compares the TF expression of each tumor/cell type to whole blood from healthy individuals. | ||
|
||
(2) The relative coverage at TFBS calculated from WGS of plasma cfDNA | ||
|
||
Here, we calculate the relative coverage at TFBS following the same procedure used in Mathios et al. [4] and Foda et al. [5]. The script generating the TFBS profile from cfDNA fragment objects (which are generated as part of the DELFI pipeline) can be found at: | ||
|
||
https://github.com/cancer-genomics/reproduce_liver_final/blob/master/code/tfbs_coverage-v1-01val.R | ||
|
||
The TFBS relative coverage tables for the current brain tumor cohort and the liver cohort are not included in the current workflow due to size limitation. | ||
|
||
* References: | ||
|
||
[1] Toil enables reproducible, open source, big biomedical data analyses. | ||
Vivian et al. Nat Biotechnol. 2017. | ||
|
||
[2] Copy number loss of the interferon gene cluster in melanomas is linked to reduced T cell infiltrate and poor patient prognosis. | ||
Linsley, P.S. et al. PLoS One, 2014. | ||
|
||
[3] Purification and Characterization of Progenitor and Mature Human Astrocytes Reveals Transcriptional and Functional Differences with Mouse. | ||
Zhang, Y. et al. Neuron, 2016. | ||
|
||
[4] Detection and characterization of lung cancer using cell-free DNA fragmentomes | ||
Mathios et al. Nat Comm, 2021. | ||
|
||
[5] Detecting Liver Cancer Using Cell-Free DNA Fragmentomes | ||
Foda et al. Cancer Disc, 2023. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# extract whole blood samples from gtex (healthy individuals) | ||
|
||
.libPaths('/dcs04/scharpf/data/pipeline-hub/pipeline-lib/R-libs/3.17-bioc-release') | ||
|
||
library(tidyverse) | ||
library(data.table) | ||
library(pbapply) | ||
library(effsize) | ||
|
||
#----------------------------------------------# | ||
# annotate gene ids | ||
mapping = readRDS('data/04-decifer/gene-id-mapping.rds') | ||
|
||
tf.ids = readRDS('data/04-decifer/all_tfs.rds') | ||
mapping <- mapping %>% filter(gene_name %in% tf.ids) | ||
map.array = matrix(mapping$gene_name, ncol = 1) | ||
rownames(map.array) <- mapping$gene_id | ||
|
||
#----------------------------------------------# | ||
# read in tpm values | ||
expr <- readRDS('data/expression/merged_tpm.rds') | ||
meta <- expr$meta | ||
mat <- expr$matrix | ||
|
||
mat <- mat[rownames(map.array),] | ||
#----------------------------------------------# | ||
# annotate sample sources for aggregation | ||
s = melt(mat) | ||
colnames(s) = c('gene_id', 'sample_id', 'value') | ||
|
||
s$source = meta[s$sample_id, 'source'] | ||
s$study = meta[s$sample_id, 'study'] | ||
|
||
s$gene_name = map.array[as.character(s$gene_id),1] | ||
remove(list =c('expr')) | ||
#----------------------------------------------# | ||
|
||
gene.ids <- rownames(mat) | ||
types <- meta$source %>% as.character() %>% unique() %>% sort() | ||
|
||
master <- data.frame() | ||
ref.ids <- meta %>% filter(source == 'gtex.whole_blood') %>% rownames() | ||
|
||
for (type in setdiff(types, 'gtex.whole_blood')){ | ||
type.ids <- meta %>% filter(source == type) %>% rownames() %>% as.character() | ||
print(type) | ||
for (gene in gene.ids){ | ||
x = log2(mat[gene, ref.ids]+0.001) | ||
y = log2(mat[gene, type.ids]+0.001) | ||
master <- rbind(master, data.frame(type = type, gene = gene, | ||
gtex.whole_blood.mean = mean(x), | ||
test.mean = mean(y), | ||
gtex.whole_blood.median = median(x), | ||
test.median = median(y), | ||
cohen.d = cohen.d(y, x)$estimate))}} | ||
|
||
|
||
|
||
|
||
counts = meta %>% group_by(source) %>% summarize(n = n()) %>% ungroup() %>% arrange(n) %>% data.frame() %>% dplyr::rename(type = source, test.group.size = n) | ||
|
||
master = merge(master, counts, by = 'type', all.x = TRUE) | ||
saveRDS(master, file = 'data/04-decifer/expression-stats-by-type.rds') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
### Input Data Extraction | ||
|
||
The scripts in this directory extract analysis-ready R objects from input metadata level tables. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
library(tidyverse) | ||
library(data.table) | ||
library(openxlsx) | ||
library(here) | ||
|
||
tab = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"), | ||
sheet = 6, startRow = 3) | ||
|
||
cbc = tab %>% | ||
dplyr::rename(id = Sample.ID, | ||
pathology = Diagnosis.Group, | ||
immune.cells = Immune.Cell.Population, | ||
fraction = `Cell.Fraction.(%)`) | ||
|
||
save(cbc, file = here("output", "01-rbrain", "cbc.rda")) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
library(tidyverse) | ||
library(data.table) | ||
library(openxlsx) | ||
library(here) | ||
|
||
S1 = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"), | ||
sheet = 1, startRow = 3) | ||
|
||
S2 = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"), | ||
sheet = 2, startRow = 3) | ||
|
||
|
||
colnames(S1)[3] = 'Patient.ID' | ||
S1 = S1 %>% select(-2) %>% filter(grepl('CG', Complete.Sample.ID)) | ||
|
||
|
||
base = S1 %>% | ||
select(Complete.Sample.ID, Clinical.Condition, Age, Gender, Pathology.Classification, Detailed.Pathology, MRI.Enhancement, Ki67, `Sample.Source*`, | ||
`Tumor.Volume.(cm3)`) %>% | ||
dplyr::rename(alternate_id = Complete.Sample.ID, | ||
age = Age, | ||
gender = Gender, | ||
pathology._simplified = Pathology.Classification, | ||
tumor_subset_pathology = Detailed.Pathology, | ||
mri.enhancement = MRI.Enhancement, | ||
ki67.simplified = Ki67, | ||
source = `Sample.Source*`, | ||
volume = `Tumor.Volume.(cm3)`) %>% | ||
mutate(tumor.lab.id = alternate_id, | ||
type = ifelse(Clinical.Condition == "Healthy", 'healthy', 'cancer')) %>% | ||
select(alternate_id, type, age, gender, pathology._simplified, tumor_subset_pathology, mri.enhancement, ki67.simplified, source, volume) | ||
|
||
base$patient_id = base %>% | ||
mutate(p = gsub('P1', 'P', alternate_id)) %>% | ||
separate(p, into = c('root', 'suffix'), '_') %>% | ||
mutate(r = gsub('P$', '', root)) %>% | ||
pull(r) | ||
|
||
|
||
S2 = S2 %>% | ||
select(Alternate_ID, `Discovery.Cohort.(Training)`, `Discovery.Cohort.(Held-out)`, `Validation.Cohort`) %>% | ||
dplyr::rename(alternate_id = Alternate_ID) %>% | ||
mutate(training = ifelse(`Discovery.Cohort.(Training)` == 'YES', TRUE, FALSE), | ||
heldout = ifelse(`Discovery.Cohort.(Held-out)` == 'YES', TRUE, FALSE), | ||
validation = ifelse(`Validation.Cohort` == 'YES', TRUE, FALSE)) %>% | ||
select(alternate_id, training, heldout, validation) | ||
|
||
|
||
data = merge(base, S2, by = 'alternate_id', all.x =TRUE) %>% | ||
arrange(alternate_id) %>% | ||
mutate(index = seq_along(alternate_id), drop = FALSE, exclude= FALSE) %>% | ||
dplyr::relocate(index, .before = 'alternate_id') | ||
|
||
metadata = data %>% | ||
select(index, alternate_id, patient_id, age, gender, type, training, validation, heldout,exclude, drop, | ||
pathology._simplified,tumor_subset_pathology, mri.enhancement, ki67.simplified, volume, source) | ||
|
||
save(metadata, file = here("output", "01-rbrain", "metadata.rda")) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
library(tidyverse) | ||
library(data.table) | ||
library(openxlsx) | ||
library(here) | ||
|
||
ddpcr = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"), | ||
sheet = 3, startRow = 3) | ||
|
||
ts = read.xlsx(here("data", "Supplementary Tables 20240628.xlsx"), | ||
sheet = 4, startRow = 3) | ||
|
||
|
||
ddpcr = ddpcr %>% | ||
mutate(method = 'ddPCR') %>% | ||
filter(Sample.ID != 'CGCNS49P1') %>% | ||
mutate(gene = 'IDH1') %>% | ||
dplyr::rename(id = Sample.ID) | ||
ddpcr$maf = ddpcr$`Mutant.Allele.Fraction.(%)` | ||
ddpcr$maf[ddpcr$`Mutant.Allele.Fraction.(%)` == "<LOD"] = 1e-6 | ||
ddpcr$maf = as.numeric(ddpcr$maf) | ||
|
||
ts_pos = ts %>% | ||
filter(`Mutation.Designation` == "Tumor Mutation") %>% | ||
select(Sample, Gene.Symbol, `%.Mutant.Reads`) %>% | ||
dplyr::rename(id = Sample, gene = Gene.Symbol, maf = `%.Mutant.Reads`) %>% | ||
mutate(method = 'Targeted Sequencing') | ||
|
||
pos_ids = ts_pos$id | ||
tested_ids = ts %>% filter(`Mutation.Designation` != "Exclude") %>% pull(Sample) | ||
neg_ids = setdiff(tested_ids, pos_ids) | ||
|
||
ts_neg = data.frame(id = neg_ids) %>% | ||
mutate(gene = "-", maf = 1e-6, method = "Targeted Sequencing") | ||
|
||
ts = rbind(ts_neg, ts_pos) | ||
|
||
plasma_maf = rbind(ts %>% select(id, gene, maf, method), | ||
ddpcr %>% select(id, gene, maf, method)) | ||
|
||
save(plasma_maf, file = here("output", "01-rbrain", "plasma_maf.rda")) | ||
|
Oops, something went wrong.