Sage is a proteomics search engine - a tool that transforms raw mass spectra from proteomics experiments into peptide identificatons via database searching & spectral matching. But, it's also more than just a search engine - Sage includes a variety of advanced features that make it a one-stop shop: retention time prediction, quantification (both isobaric & LFQ), peptide-spectrum match rescoring, and FDR control.
Sage was designed with cloud computing in mind - massively parallel processing and the ability to directly stream compressed mass spectrometry data to/from AWS S3 enables unprecedented search speeds with minimal cost. (Sage also runs just as well reading local files from your Mac/PC/Linux device)
Let's not forget to mention that it is incredibly fast, sensitive, 100% free, and open source!
Check out the blog post introducing Sage for more information and full benchmarks!
- Incredible performance out of the box
- Effortlessly cross-platform (Linux/MacOS/Windows), effortlessly parallel (uses all of your CPU cores)
- Fragment indexing strategy allows for blazing fast narrow and open searches (> 500 Da precursor tolerance)
- MS3-TMT quantification (R-squared of 0.999 with Proteome Discoverer)
- Capable of searching for chimeric/co-fragmenting spectra
- Retention time prediction models fit to each LC/MS run
- PSM rescoring using built-in linear discriminant analysis (LDA)
- PEP calculation using a non-parametric model (KDE)
- FDR calculation using target-decoy competition and picked-peptide & picked-protein approaches
- Percolator/Mokapot compatible output
- Configuration by JSON file
- Built-in support for reading gzipped-mzML files
- Support for reading/writing directly from AWS S3
- Label-free quantification: consider all charge states & isotopologues a la FlashLFQ
- Boosts PSM identifications using prediction of retention times with a linear regression model fit to each LC/MS run
- Hand-rolled, 100% pure Rust implementations of Linear Discriminant Analysis and KDE-mixture models for refinement of false discovery rates
- Models demonstrate 1:1 results with scikit-learn, but have increased performance
- No need for a second post-search pipeline step
Sage is distributed as source code, and as a standalone executable file.
Sage can be installed from bioconda:
$ conda install -c bioconda -c conda-forge sage-proteomics
$ sage --help
- Install the Rust programming language compiler
- Download Sage source code via git:
git clone https://github.com/lazear/sage.git
or by zip file - Compile:
cargo build --release
- Run:
./target/release/sage config.json
Once you have Rust installed, you can copy and paste the following lines into your terminal to complete the above instructions, and run Sage on the example mzML provided in the repository (a single scan from PXD016766)
git clone https://github.com/lazear/sage.git
cd sage
cargo run --release tests/config.json
- Visit the Releases website.
- Download the correct pre-compiled binary for your operating system.
- Run:
sage <path/to/config.json>
Sage is capable of natively reading & writing files to AWS S3:
- S3 paths should be specified as
s3://bucket/prefix/key.mzML.gz
ors3://bucket/prefix
for output folder - See AWS docs for configuring your credentials
- Using S3 may incur data transfer charges as well as multi-part upload request charges.
$ sage --help
Usage: sage [OPTIONS] <parameters> [mzml_paths]...
๐ฎ Sage ๐ง - Proteomics searching so fast it feels like magic!
Arguments:
<parameters> The search parameters as a JSON file.
[mzml_paths]... mzML files to analyze. Overrides mzML files listed in the parameter file.
Options:
-f, --fasta <fasta>
The FASTA protein database. Overrides the FASTA file specified in the parameter file.
-o, --output_directory <output_directory>
Where the search and quant results will be written. Overrides the directory specified in the parameter file.
--no-parallel
Turn off parallel file searching. Useful for memory constrained systems or large sets of files.
-h, --help
Print help information
-V, --version
Print version information
Sage is called from the command line using and requires a path to a JSON-encoded parameter file as an argument (see below).
Example usage: sage config.json
Some options in the parameters file can be over-written using the command line interface. These are:
- The paths to the raw mzML data
- The path to the database (fasta file)
- The output directory
For example:
# Specify fasta and output dir:
sage -f proteins.fasta -o output_directory config.json
# Specify mzML files:
sage -f proteins.fasta config.json *.mzML
# Specify mzML file located in an S3 bucket
sage config.json s3://my-bucket/YYYY-MM-DD_expt_A_fraction_1.mzML.gz
Running Sage will produce several output files (located in either the current directory, or output_directory
if that option is specified):
- Record of search parameters (
results.json
) will be created that details input/output paths and all search parameters used for the search - MS2 search results will be stored as a tab-separated file (
results.sage.tsv
) file - this is a tab-separated file, which can be opened in Excel/Pandas/etc - MS3 search results will be stored as a tab-separated file (
quant.tsv
) ifquant.tmt
option is used in the parameter file
- The majority of parameters are optional - only "database.fasta", "precursor_tol", and "fragment_tol" are required. Sage will try and use reasonable defaults for any parameters not supplied
- Tolerances are specified on the experimental m/z values. To perform a -100 to +500 Da open search (mass window applied to precursor), you would use
"da": [-500, 100]
Using decoy sequences is critical to controlling the false discovery rate in proteomics experiments. Sage can use decoy sequences in the supplied FASTA file, or it can generate internal sequences. Sage reverses tryptic peptides (not proteins), so that the picked-peptide approach to FDR can be used.
If database.generate_decoys
is set to true (or unspecified), then decoy sequences in the FASTA database matching database.decoy_tag
will be ignored, and Sage will internally generate decoys. It is critical that you ensure you use the proper decoy_tag
if you are using a FASTA database containing decoys and have internal decoy generation turned on - otherwise Sage will treat the supplied decoys as hits!
Internally generated decoys will have protein accessions matching "{decoy_tag}{accession}", e.g. if decoy_tag
is "rev_" then a protein accession like "rev_sp|P01234|HUMAN" will be listed in the output file.
Sage can be used from a docker image!
$ docker pull ghcr.io/lazear/sage:master
$ docker run -it --rm -v ${PWD}:/data ghcr.io/lazear/sage:master sage -o /data /data/config.json
# The sage executable is located in /app/sage in the image
-v ${PWD}:/data
means it will mount your current directory as/data
in the docker image. Make sure all the paths in your command and configuration use the location in the image and not your local directory