TOPAS-pipeline

Automated (phospho)proteomics processing pipeline for large patient cohorts providing cohort as well as patient-specific insights.

The results of the pipeline can be explored on the web-based TOPAS portal: https://github.com/kusterlab/TOPAS-portal.git A public instance of the portal can be found here: https://topas-portal.kusterlab.org/

Supported inputs

MaxQuant (TMT, LFQ)
SIMSI-Transfer (TMT)

Runtime overview

Runtimes exclude processing time of SIMSI-Transfer.

Dataset	#samples	#cores	runtime (h)	max memory (GB)
CPTAC UCEC	170	8	1.2	30
CPTAC BRCA	170	8	1.4	30
CPTAC LUAD	250	8	2	40
MTB cohort	2068	8

Configuration

The pipeline needs configurations input from a JSON file format. Examples can be found in config.json (full version) and config_minimal.json (only required configs). Both relative and absolute paths are allowed.

Input parameters:

Parameter	Required	Description	Example	Default
results_folder	yes	Path to the folder where results will be written.	`"results/example_run"`
sample_annotation	yes	Path to the sample annotation file (CSV).	`"example/annotation.csv"`
metadata_annotation	yes	Path to the metadata annotation file (Excel).	`"example/METADATA_UCEC.xlsx"`
raw_file_folders	yes	List of raw file folders for proteomics and phosphoproteomics data.	`["example/raw_fp", "example/raw_pp"]`
data_types		List of data types to process: "fp" for proteome and "pp" for phosphoproteome.	`["fp", "pp"]`	`["fp", "pp"]`
slack_webhook_url		URL for the Slack webhook.	`""`	`""`
simsi
run_simsi		Boolean indicating whether to run SIMSI analysis.	`true`	`true`
simsi_folder	yes	Path to the folder for writing SIMSI-Transfer results.	`"results/SIMSI"`	N/A
tmt_ms_level		MS level for TMT quantification.	`"ms2"`	`"ms2"`
stringencies		Stringency value for MaRaCluster.	`10`	`10`
tmt_requantify		Boolean indicating whether to requantify TMT data.	`false`	`false`
maximum_pep		Maximum posterior error probability in percent for peptide ID propagation.	`1`	`1`
num_threads		Number of threads to use for SIMSI-Transfer.	`8`	`8`
preprocessing
raw_data_location	yes	Path to the folder containing MaxQuant search result folders.	`"example/CPTAC_searches"`	N/A
fasta_file	yes	Path to the FASTA file for protein sequences.	`"example/uniprot_human.fasta"`	N/A
picked_fdr		False discovery rate threshold for protein groups.	`0.01`	`0.01`
fdr_num_threads		Number of threads to use in MaxLFQ computation.	`8`	`8`
imputation		Perform data imputation within batch on phosphoproteome level.	`true`	`true`
debug		Run in debug mode.	`false`	`false`
run_lfq		Input is from LFQ experiments.	`false`	`false`
normalize_to_reference		Normalize channel intensities to the reference channel.	`false`	`false`
clinic_proc
pspFastaFile	yes	Path to the PSP FASTA file.	`"PSP_annotations/Phosphosite_seq.fasta"`	N/A
pspKinaseSubstrateFile	yes	Path to the PSP kinase-substrate dataset.	`"PSP_annotations/Kinase_Substrate_Dataset"`	N/A
pspAnnotationFile	yes	Path to the PSP phosphorylation site dataset.	`"PSP_annotations/Phosphorylation_site_dataset"`	N/A
pspRegulatoryFile	yes	Path to the PSP regulatory sites file.	`"PSP_annotations/Regulatory_sites"`	N/A
prot_baskets	yes	Path to the annotation file for TOPAS scores and proteins of interest.	`"TOPASscores_POI_AS_250307.xlsx"`	N/A
extra_kinase_annot		Path to the annotation file with custom kinase-substrate relations.	`""`	`""`
report
samples_for_report		Which samples to include in the report.	`"all"`	`"all"`
portal
update		Automatically update the TOPAS portal once the run has finished	`false`	`false`
cohort		Specifies the cohort name that should be updated in the TOPAS portal.	`""`	`""`
url		URL of the TOPAS portal.	`""`	`""`
config		Configuration file for the TOPAS portal.	`""`	`""`

Install webhook for slack (optional)

If you want the pipeline to post update messages (finished runs, error messages) to your slack channel, follow these steps:

Create a new slack app here: https://api.slack.com/apps?new_app=1, use the from scratch option.
Select your slack workspace and pick an appropriate name for the app, e.g. topas-pipeline.
Navigate to Incoming webhooks in the left menu.
Set Activate Incoming webhooks to On if this was not already the case.
Click on Add New Webhook to Workspace at the bottom of the page.
Select the channel you want to post messages in.
Copy the generated Webhook URL to your config file as the slack_webhook_url property.

Source: https://api.slack.com/messaging/webhooks

Running the pipeline

With Docker (recommended)

Requirements:

git
docker
make

Clone this repository

git clone https://github.com/kusterlab/TOPAS-pipeline.git

Build the docker image
```
make build
```
Create a config file named config_patients.json in the repository with your configurations (see section Configuration) and run:
```
make docker_all 
```
You can use a custom config file (works only with relative paths) and adjust the memory and cores (default: 300GB, 8 cores):
```
CONFIG_FILE=./path/to/config.json MEMORY_LIMIT=300gb CPU_LIMIT=16 make docker_all
```

With conda and poetry

Requirements:

git
Python ">=3.9, <=3.11"
poetry
make
conda

Create environment and install required packages from poetry.lock file:

conda create --name topas-pipeline python=3.9.12
conda activate topas-pipeline

Clone this repository

git clone https://github.com/kusterlab/TOPAS-pipeline.git

Install dependencies and start a poetry shell
```
poetry install
poetry shell
```
Adjust the file paths in config_minimal.json and run:
```
make all
```

Note that it is also possible to run individual pipeline modules, e.g.:

# run simsi
python -m topas_pipeline.simsi -c config.json

# run whole pipeline following simsi
python -m topas_pipeline.main -c config.json 

# run clinical annotation
python -m topas_pipeline.clinical_process -c config.json

Example

An example of the project folder setup and configuration file can be found in the /example folder. Check the ReadMe in the /example folder for details.

Integration tests

If problems arise with running the pipeline, check if the integration tests pass.

pytest ./tests/integration_tests/test_simsi.py
pytest ./tests/integration_tests/test_picked_group.py
pytest ./tests/integration_tests/test_clinical_tools.py

Pipeline result files

The pipeline creates a folder with multiple output files:

Output file	Description	Used on portal
config.json	Copy of the configuration file in JSON format used for this pipeline run as described above.
sample_annot.tsv	Saved copy of current version of sample annotation/metadata given as input
sample_annot_filtered.tsv	Subset of sample annotation/metadata after filtering out QC failed samples
meta_input_file_{data_type}.tsv	Location per batch of search folder input, raw files and TMT correction factor file
{data_type}_qc_numbers.csv	Per sample count of peptides, median intensities and summed intensities
{data_type}_qc_batch_wise.csv	Per batch median and summed intensities
{data_type}_in_batch_correction_factors.csv	Per sample correction factors for in-batch median centering
{data_type}_ms1_correction_factors.csv	Per batch correction factors for MS1 median centering
evidence.txt	Combined output from SIMSI-Transfer (same format as MQ evidence file)
pickedGeneGroups.txt	Output from Picked Protein Group FDR (using gene-level)
pickedGeneGroups_with_quant.txt	Groups from Picked Protein Group FDR (using gene-level) containing quant using MaxLFQ algorithm
preprocessed_{data_type}.csv	Data matrix with patients as columns and normalized abundances of proteins or phosphopeptides as rows
preprocessed_{data_type}_with_ref.csv	Same as preprocessed_{data_type}.csv but also containing the QC channels
annot_{data_type}.csv	Same as preprocessed_{data_type}.csv but with clinical annotations
annot_{data_type}_with_ref.csv	Same as preprocessed_{data_type}_with_ref.csv but with clinical annotations
{data_type}_measures_rank.tsv	Data matrix with patients as columns and in-cohort rank of proteins or phosphopeptides as rows
{data_type}_measures_fc.tsv	Same as {data_type}_measures_rank.tsv but with fold changes
{data_type}_measures_z.tsv	Same as {data_type}_measures_rank.tsv but with z-scores
{data_type}_measures_p.tsv	Same as {data_type}_measures_rank.tsv but with p-values derived from the z-scores
basket_scores_4th_gen_zscored.tsv	Data matrix with patients as columns and TOPAS scores as rows
subbasket_scores_{topas_rtk}.tsv	Data matrix with patients as columns and TOPAS subscores as rows for each TOPAS RTK
kinase_results/kinase_scores.tsv	Data matrix with patients as columns and TOPAS substrate phosphorylation scores as rows
protein_results/protein_scores.tsv	Data matrix with patients as columns and TOPAS protein phosphorylation scores as rows
Reports/{patient_id}.xlsx	Patient-specific reports
Pipeline_log.txt	Log messages printed by the pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
example		example
tests		tests
topas_pipeline		topas_pipeline
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
MakefileShared		MakefileShared
README.md		README.md
__init__.py		__init__.py
config.json		config.json
config_minimal.json		config_minimal.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TOPAS-pipeline

Supported inputs

Runtime overview

Configuration

Install webhook for slack (optional)

Running the pipeline

With Docker (recommended)

With conda and poetry

Example

Integration tests

Pipeline result files

About

Releases

Packages

Contributors 2

Languages

License

kusterlab/topas-pipeline

Folders and files

Latest commit

History

Repository files navigation

TOPAS-pipeline

Supported inputs

Runtime overview

Configuration

Install webhook for slack (optional)

Running the pipeline

With Docker (recommended)

With conda and poetry

Example

Integration tests

Pipeline result files

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages