Structural variant calling from single-cell Strand-seq data - summarized in a Snakemake pipeline.
For Info on Strand-seq see
- Falconer E et al., 2012 (doi: 10.1038/nmeth.2206)
- Sanders AD et al., 2017 (doi: 10.1038/nprot.2017.029)
This workflow uses Snakemake to execute all steps of MosaiCatcher in order. The starting point are single-cell BAM files from Strand-seq experiments and the final output are SV predictions in a tabular format as well as in a graphical representation. To get to this point, the workflow goes through the following steps:
- Read binning in fixed-width genomic windows of 50kb or 100kb via mosaicatcher
- Normalization of coverage in respect to a reference sample (included)
- Strand state detection (mosaicatcher)
- Haplotype resolution via StrandPhaseR
- Multi-variate segmentation of cells (mosaicatcher)
- Bayesian classification of segmentation to find SVs using mosaiClassifier (included)
- Visualization of results using custom R plots
While there are different options for the installation/execution (see below), the following steps are shared by all of them:
-
Download this pipeline.
git clone https://github.com/friendsofstrandseq/pipeline
-
Configuration. In the config file
Snake.config.json
, specify the reference genome, the chromosomes you would like to analyse. -
Add your data. Create a subdirectory
bam/sampleName/
and place the single-cell BAM files (one file per cell) in there:cd pipeline mkdir -p bam/NA12878/all mkdir -p bam/NA12878/selected cp /path/to/my/all_cells/*.bam bam/NA12878/all cp /path/to/my/all_cells/*.bam.bai bam/NA12878/all cp /path/to/my/selected_cells/*.bam bam/NA12878/selected cp /path/to/my/selected_cells/*.bam.bai bam/NA12878/selected
-
BAM requirements. Make sure that all BAM files belonging to the same sample contain the same read group sample tag (
@RG SM:thisIsTheSampleName
), that they are indexed and have PCR duplicates marked (the latter is especially relevant for single-cell data). -
SNP call set. If available, specify SNV calls (VCF) in
Snake.config.json
. Note that the sample name in the VCF must match the one in the BAM files.
Note: Multiple samples can be run simultaneously. Just create different subfolders
below bam/
. The same settings from the Snake.config.json
config files are
applied to all samples.
A Snakemake version of at least 4.8.0 is required for Singularity support. When only an old Snakemake version is available, remove the
singularity
line inSnakefile
and go for option 2 or 3.
We provide a Docker image of this pipeline, which can be used in Snakemake together with Singularity. This image contains all software (but no data) required to run MosaiCatcher.
-
Singularity required. We tested this with version 2.5.1.
-
Provide SNP call set (optional). External VCF files (if available) should be copied into a subfolder of the pipeline, which can be read from within the image. Accordingly, you need to specify a relative path in
Snake.config-singularity.json
. -
Run Snakemake with
--use-singularity
option. The software inside the Singularity image need to access external data, such as the reference genome. These are specified in a separate config file.We also stripped off the content of the R package BSgenome.Hsapiens.UCSC.hg38 (find it in your local R installation), which need to be made available inside the image by binding these files during execution. This is how the command looks like:
# paths on the host system REF="~/data/refGenomes/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" R_REF="~/R-lib/3.4.0/BSgenome.Hsapiens.UCSC.hg38/extdata/single_sequences.2bit" snakemake \ --use-singularity \ --singularity-args \ "-B ${REF}:/reference.fa:ro \ -B ${REF}.fai:/reference.fa.fai:ro \ -B ${R_REF}:/usr/local/lib/R/site-library/BSgenome.Hsapiens.UCSC.hg38/extdata/single_sequences.2bit:ro" \ --configfile Snake.config-singularity.json
Note: Currently only hg38 is supported within the singularity inmage.
To install the correct environment, you can use Bioconda.
-
Install MiniConda: In case you do not have Conda yet, it is easiest to just install MiniConda.
-
Create environment:
conda env create -n strandseqnation -f conda-environment.yml source activate strandseqnation
-
Install mosaicatcher and update the file paths pointing to it (and to several R scripts) in
Snake.config.json
. -
Run
snakemake
-
Install required software:
-
Install mosaicatcher (currently you will need the
develop
branch) -
Install BSgenome.Hsapiens.UCSC.hg38 from Bioconductor:
source("https://bioconductor.org/biocLite.R") biocLite('BSgenome.Hsapiens.UCSC.hg38')
-
Install Strand-Phaser. This is no longer installed automatically
-
Install other required R packages
-
-
Set up the configuration of the snakemake pipeline
- Open
Snake.config.json
and specify the path to the executatables (such as Mosaicatcher) and to the R scripts.
- Open
-
Run
snakemake
You can ask Snakemake to submit your jobs to a HPC cluster. We provided a config
file (cluster.json
) for this purpose, yet it might need to be adapted to your
infrastructure. Here is an example command:
snakemake -j 100 \
--cluster-config Snake.cluster.json \
--cluster "sbatch --cpus-per-task {cluster.n} --time {cluster.time} --mem {cluster.mem}"
Further, it is often advisable to increase the time Snakemake waits for the file system via this flag:
--latency-wait 60
In the HPC system this was tested (based on SLURM), Snakemake sometimes does not
recognize if a job was killed on the cluster and hangs up waiting for it to finish.
To overcome this, we provide a small script called cluster_status.py
which can
be passed to Snakemake as shown below. Note that this script might need to be adapted.
--cluster-status cluster_status.py
Finally, of course the cluster mode can be combined with --use-singularity
.
The pipeline will run simple SNV calling using samtools and bcftools on Strand-seq. If you already have
SNV calls, you can avoid that by entering your VCF files into the pipeline.
To so, make sure the files are tabix-indexed
and specifigy them inside the Snake.config.json
file:
"snv_calls" : {
"NA12878" : "path/to/snp/calls.vcf.gz"
},
The provided environment conda-environment.yml
is the result of running:
conda create -y -n strandseqnation snakemake=5.2 r-ggplot2=2.2 bcftools bioconductor-biobase bioconductor-biocgenerics bioconductor-biocinstaller bioconductor-biocparallel bioconductor-biostrings bioconductor-bsgenome bioconductor-bsgenome.hsapiens.ucsc.hg38 bioconductor-delayedarray bioconductor-fastseg bioconductor-genomeinfodb bioconductor-genomeinfodbdata bioconductor-genomicalignments bioconductor-genomicranges bioconductor-iranges bioconductor-rsamtools bioconductor-rtracklayer bioconductor-s4vectors bioconductor-summarizedexperiment bioconductor-xvector bioconductor-zlibbioc filechunkio ftputil htslib pysftp r-data.table r-peer samtools urllib3 aioeasywebdav aiohttp appdirs asn1crypto async-timeout boost boost-cpp bzip2 ca-certificates cairo certifi cffi chardet cmake configargparse cryptography curl docutils dropbox expat fontconfig freetype gettext glib graphite2 gsl harfbuzz icu idna jpeg krb5 libffi libiconv libpng libssh2 libtiff libuv libxml2 multidict ncurses openssl pandas pango paramiko pcre pip pixman psutil pyasn1 pycparser pynacl python python-dateutil pytz pyyaml r-assertthat r-base r-bh r-bitops r-colorspace r-cowplot r-curl r-devtools r-dichromat r-digest r-futile.logger r-futile.options r-git2r r-gridextra r-gtable r-hexbin r-httr r-jsonlite r-labeling r-lambda.r r-lattice r-lazyeval r-magrittr r-mass r-matrix r-matrixstats r-memoise r-mime r-munsell r-openssl r-plyr r-r6 r-rcolorbrewer r-rcpp r-rcurl r-reshape2 r-rlang r-rstudioapi r-scales r-snow r-stringi r-stringr r-tibble r-viridislite r-whisker r-withr r-xml ratelimiter readline requests rhash setuptools six sqlite tk wheel wrapt xz yaml yarl zlib bcrypt intel-openmp libgcc libgcc-ng libgfortran-ng libstdcxx-ng mkl numpy wget r-dplyr scipy whatshap freebayes r-mc2d
conda env export > conda-environment.yml
This avoids some annoying version downgrades that sometimes happen when creating an environment incrementally.