Releases: bbglab/oncodrive3d
Release v1.0.5
Oncodrive3D is a fast and accurate 3D-clustering algorithm for driver gene discovery.
Key Updates and Features
This release addresses bug fixes in the build annotations and plotting modules, introduces enhancements to association plots, updates documentation, and includes general code cleanups.
Bug Fixes
Features in associations plots
- Add FDR to logistic regression analysis for association between clusters and annotations 1
- Added associations plots to nextflow 1
- Removed comparative plots 1
Others
- Documentation update
- Linting
Release v1.0.4
Second release of Oncodrive3D, a fast and accurate 3D-clustering algorithm for driver gene discovery. It identifies mutation-enriched volumes by analyzing missense somatic mutations, leveraging AlphaFold's structural predictions to define residue contacts and mutation profiles to simulate neutral mutagenesis. The tool uses rank-based statistics and can process mutations from duplex sequencing studies, enabling the analysis of both cancer and normal tissue datasets across potentially any organism.
Key Updates and Features
This release mainly update the README
with important information and fix a bug in the oncodrive3d build-datasets
step.
Documentation Updates
- General improved documentation for clarity and usability.
- Added steps to fulfill software requirements, addressing installation failures on older machines lacking updated C libraries.
- Provided detailed information on input and output data formats, including:
- How to obtain the required input files.
- In-depth descriptions of the main outputs, including gene-level and residue-level clustering results.
Bug Fixes and Refactoring
- Fixed bug in
scripts/datasets/build_datasets.py
andscripts/datasets/seq_for_mut_prob.py
:- Disabled downloading and integrating MANE structures if
--mane
flag is not enabled. - Removed usage of files related to the MANE downloads when computing the
seq_for_mut_prob.py
for a non-MANE Human proteome.
- Disabled downloading and integrating MANE structures if
- Updated
scripts/datasets/utils.py
to increase the timeout forsock_read
in PyPdl, preventing errors during the download of AlphaFold structures. - Refactored
scripts/main.py
by moving the import of specific modules into their corresponding functions for better modularity and efficiency.
Release v1.0.3
First release of Oncodrive3D, a fast and accurate 3D-clustering algorithm for driver gene discovery. It identifies mutation-enriched volumes by analyzing missense somatic mutations, leveraging AlphaFold's structural predictions to define residue contacts and mutation profiles to simulate neutral mutagenesis. The tool uses rank-based statistics and can process mutations from duplex sequencing studies, enabling the analysis of both cancer and normal tissue datasets across potentially any organism.
Key Updates and Features
Packaging and Linting
- Added Python package build using
uv
. - Published the package to
PyPI
, enabling installation viapip install oncodrive3d
. - Updated the
Dockerfile
. - Applied code linting to improve code quality and maintainability.
- Added
LICENCE
NextFlow Pipeline Updates
- Restructured the pipeline according to best practices for enhanced performance and maintainability and moved to
oncodrive3d_pipeline/.
Documentation Updates
- Updated the
README
file:- Added instructions for installation.
- Added instructions for running the provided NextFlow pipeline.
Bug Fixes and Refactoring and Others
- Removed preprocessing scripts in
build/preprocessing
. - Updated URLs in
scripts/datasets/seq_for_mut_prob.py
andscripts/plotting/pfam.py
to use the January 2024 Ensembl archive. - Changed output column from
Cluster
toClump
in the residue-level output (<cohort>.3d_clustering_pos.csv
). - Changed
oncodrive3d run
input argument frominput_maf_path
to input_path inscripts/main.py.
- Refactored
scripts/datasets/utils.py
to improve download functionality and logging.
Pre-release v1.0.2-rc
This is the second pre-release of Oncodrive3D, a fast and accurate 3D-clustering algorithm for driver gene discovery. It identifies mutation-enriched volumes by analyzing missense somatic mutations, leveraging AlphaFold's structural predictions to define residue contacts and mutation profiles to simulate neutral mutagenesis. The tool uses rank-based statistics and can process mutations from duplex sequencing studies, enabling the analysis of both cancer and normal tissue datasets across potentially any organism.
Key Updates and Features
New Modules for Annotation and Plotting:
- Introduced a comprehensive plotting module, including summary plots, gene plots, comparative plots, association plots, and ChimeraX plots.
Nextflow Pipeline:
- Added a minimal Nextflow pipeline to perform 3D clustering analysis across multiple cohorts and generate all relevant plots.
MANE Transcripts Support:
- Built datasets prioritizing MANE AF-predicted structures.
- Tracked transcript IDs from input data, including mismatch, match, or missing status compared to Oncodrive3D datasets.
Mutation Filtering:
- Filtered mutations with wild-type (WT) structure-AA mismatches and genes exceeding a threshold ratio of mapping issues.
- Added an option to disable WT AA mismatch filtering, particularly useful for mouse data where VEP and Uniprot isoform inconsistencies occur.
Direct VEP Output Support:
- Enabled direct VEP output processing, allowing filtering of transcripts based on Oncodrive3D-built datasets.
Enhanced Outputs:
- Included processed input mutations (
<cohort>.mutations.processed.tsv
), missense mutation probabilities (<cohort>.miss_prob.processed.tsv
), and Oncodrive3D sequence dataframes (<cohort>.seq_df.processed.tsv
).
Mouse Data Support:
- Fully enabled and tested processing of mouse data (mm39) across all steps, including dataset building, annotations, and plotting.
Bug Fixes and Improvements:
- Resolved bug affecting the identification of the most significant volume per gene.
- Changed sorting of position-level results from rank-based (Gene, Rank) to significance-based (Gene, p-value, Score).
- Refactored
main.py
, offloading unnecessary code to module-specific scripts for better organization.
Example usage
To run the examples provided, the <input_path>
directory should be organized as follows:
<input_path>/
├── vep/
│ ├── <cohort_1>.vep.tsv.gz
│ └── <cohort_2>.vep.tsv.gz
├── mut_profile/
│ ├── <cohort_1>.sig.json
│ ├── <cohort_2>.sig.json
vep/
: Contains the VEP output files for each cohort, compressed as .tsv.gz.
mut_profile/
: Contains the Bgsignature output files (mutation profile in trinucleotide context) for each cohort, saved as .sig.json.
Human MANE
build_datasets -o <datasets_path> --mane
build_annotations -o <annotations_path> -d <datasets_path>
nextflow run main.nf --indir <input_path> --outdir <output_path> --data_dir <datasets_path> --annotations_dir <annotations_path> --vep_input true --verbose true --plot true --chimerax_plot true --mane true --seed 64 -profile container
Mouse
build_datasets -o <datasets_path> --organism mouse
build_annotations -o <annotations_path> -d <datasets_path> --organism mouse
nextflow run main.nf --indir <input_path> --outdir <output_path> --data_dir <datasets_path> --annotations_dir <annotations_path> --ignore_mapping_issues true --plot true --chimerax_plot true --vep_input true -profile container
Pre-release v1.0.1-rc
This is the first pre-release of Oncodrive3D, a fast and accurate novel 3D-clustering algorithm for driver genes discovery. This approach involves analysing patterns of observed missense somatic mutations (in cancer or normal tissue) to identify volumes that exhibit a higher-than-expected frequency of mutations than what is typically observed under neutral mutagenesis. Oncodrive3D leverages AlphaFold's structure predictions and Predicted Aligned Error (PAE) to construct contact probability maps. Moreover, if provided, it uses the mutation profile of the cohort to simulate neutral mutagenesis while employing rank-based statistics to determine empirical p-values for the volumes of each mutated residue. Also, It can process the mutation profile and sequencing depth information. If provided as a mutability file, this allows the tool to process mutations obtained from duplex sequencing studies, which are commonly used in normal tissue sequencing at the time of this release.
Input
-
input.maf (
required
): Mutation Annotation Format (MAF) file annotated with consequences (e.g., by using Ensembl Variant Effect Predictor (VEP)). -
mut_profile.json (
optional
): Dictionary including the normalized frequencies of mutations (values) in every possible trinucleotide context (keys), such as 'ACA>A', 'ACC>A', and so on. -
mut_config.json (
optional
): Dictionary including the path and parsing information for the mutability file, which includes information about mutation profile integrated with sequencing depth.
Output
-
cohort_filename.3d_clustering_genes.csv: This is a Comma-Separated Values (CSV) file containing the results of the analysis at the gene level.
-
cohort_filename.3d_clustering_pos.csv: This is a Comma-Separated Values (CSV) file containing the results of the analysis at the level of mutated positions.