This pipeline makes use of the paralogy resolution (also described as orthology inference) approaches described and implemented by Yang and Smith 2014 here. These approaches have since been adapted for target capture datasets as described in the bioRxiv manuscript here. The original documentation and scripts can be found here.
To simplify running the Yang and Smith paralogy resolution methods on target capture data, I’ve provided a Singularity container containing the Linux distribution Ubuntu 18.04, containing the original scripts required to run the pipeline (including modifications, additions and bug fixes, see below for details), as well additonal new scripts and all the dependencies (IQTree, Clustal Omega, mafft, BioPython, HMMCleaner, trimal). The container is called yang_and_smith.sif
.
To run the paralogy resolution pipeline using this container, I’ve provided a Nextflow script that uses the software in the Singularity container. This pipeline runs all steps with a single command. The pipeline script is called yang-and-smith-rbgv-pipeline.nf
. It comes with an associated config file called yang-and-smith-rbgv.config
. The only input required is a folder containing .fasta
files for each of your target-capture loci, including paralogs, and a .fasta
file containing outgroup sequences (used by some of the paralogy resolution methods, see below). The number of parallel processes running at any time, as well as computing resources given to each process (e.g. number of CPUs, amount of RAM etc) can be configured by the user by modifying the provided config file. The pipeline can be run directly on your local computer, and on an HPC system submitting jobs via a scheduler (e.g. SLURM, PBS, etc).
If you have used the Nextflow pipeline hybpiper-rbgv-pipeline.nf
to run HybPiper, as described here, your main results
folder will contain the following subfolders:
11_paralogs
12_paralogs_noChimeras
See the HybPiper-RBGV wiki entry Output folders and files for a full description of the files in these output folders. Briefly, folder 11_paralogs
contains a fasta file for each gene in your HybPiper target file. Each fasta file contains the 'main' contig selected by HybPiper for each sample. Where HybPiper has detected putative paralog contigs, these sequences are also included; in such cases, the main contig has the fasta header suffix .main
, whereas putative paralogs have the suffix .0
, .1
etc. Folder 12_paralogs_noChimeras
contains the same data, except putative chimeric contigs (see here for an explanation) have been removed.
Tutorial step 1:
Copy folder `11_paralogs` into your current working directory (i.e. the directory containing `yang-and-smith-rbgv-pipeline.nf` and `yang-and-smith-rbgv.config`.
Some of the paralogy resolution methods used in this pipeline require an outgroup sequence for each of your genes. These outgroup sequences can be provided in two ways.
-
Designating one or more taxa in your HybPiper paralog files as outgroups, via the
--internal_outgroups <taxon1,taxon2,taxon3...>
option. For example, if your paralogfasta
files contain sequences from the taxa79686
and79689
, you could designate these sequences as outgroups via--internal_outgroups 79686,79689
. -
Providing a fasta file (e.g.
outgroups.fasta
) containing 'external' outgroup sequences via the option--external_outgroups_file outgroups.fasta
. The sequences in the file should have the same fasta header formatting and gene names as your HybPiper target file. For example, if you have used the Angiosperms353 target file for you HybPiper analysis, and you wish to use sequences from Sesame as your outgroup, youroutgroups.fasta
file might contain the following:>sesame-6995 gtgggatatgaacaaaatccattgagcttgtattactgtta... >sesame-4757 ctggtgcgtcgagcacttctcttgagcagcaacaatggcgg... >sesame-6933 gaagtagatgctgtggtggtggaagcattcgacatatgcac... ...etc
Again, note that the gene identifier following the dash in the fasta headers (e.g. '6995' for header '>sesame-6995') needs to correspond to a gene identifier in your target file.
It's fine if your outgroups.fasta
file contains additional sequences. When running the pipeline (see below) you can optionally provide one or more taxon names using the parameter --external_outgroups <taxon1,taxon2,taxon3...>
, e.g. --outgroups sesame
, and only these taxa will be included as outgroups. If this option isn't provided, all taxa/sequences in the outgroups.fasta
file will be used. You can provide more than one outgroup taxon name using a comma-separated list, e.g. --external_outgroups sesame,taxon2,taxon3
etc.
NOTE: at a minimum, you must provide either 'internal' ingroups via the --internal_outgroups <taxon1,taxon2,taxon3...>
option, or a file of 'external' outgroup sequences via the --external_outgroups_file outgroups.fasta
option.
Tutorial step 2:
Copy the fasta file containing outgroup sequences to your current working directory.
Please see the Wiki entry Running on Linux.
Please see the Wiki entry Running on a Mac. #CJJ to update for yang and smith
Please see the Wiki entry Running on a PC. #CJJ to update for yang and smith
See section Pipeline parameters and options for a full explanation of available parameters and flags. The required parameters are:
--hybpiper_paralogs_directory <directory> Path to folder containing HybPiper paralog fasta files)
...and either
--internal_outgroups <taxon1,taxon2,taxon3...>
A comma-separated list of taxa present in the paralog fasta files
to use as outgroups. Default is none
...and/or
--external_outgroups_file <file> File containing fasta sequences of outgroup sequences for each
gene
Tutorial step 3:
Run the pipeline using the command:
nextflow run yang-and-smith-rbgv-pipeline.nf -c yang-and-smith-rbgv.config -profile slurm -resume --hybpiper_paralogs_directory 11_paralogs --external_outgroups_file outgroups.fasta --outgroups sesame --internal_outgroups taxon1,taxon2,taxon3>
After running the pipeline, output can be found in the folder results
(unless you have changed the name of the default output folder using the --outdir <name>
parameter. This will consist of 23 subfolders. If you're just after the aligned .fasta files for each of your target genes as output by each of the paralogy resolution methods, the three main output folders of interest are probably:
18_alignments_stripped_names_MO_realigned
20_alignments_stripped_names_RT_realigned
22_alignments_stripped_names_MI_realigned
For a full explanation of output folders and files, please see the Wiki entry Output folders and files.
e.g.
- Concatenated IQtree
- Astral
etc.
For details on adapting the pipeline to run on local and HPC computing resources, see here.
Please see the Wiki entry [Bug fixes and changes][].
Please see the Wiki entry [Issues][].
Mandatory arguments:
--hybpiper_paralogs_directory <directory> Path to folder containing HybPiper paralog fasta files.
Optional arguments:
--internal_outgroups <taxon1,taxon2,taxon3...> A comma-separated list of taxa present in the paralog fasta files
to use as outgroups. Default is none
--external_outgroups_file <file> File containing fasta sequences of outgroup sequences for each
gene
--external_outgroups <taxon1,taxon2,taxon3...> A comma-separated list of outgroup taxa to add from the
outgroups_file. Default is all
-profile <profile> Configuration profile to use. Can use multiple (comma separated)
Available: standard (default), slurm
--output_dir Specify the name of the main output results folder.
Default is 'results'
--pool <int> Number of threads for the Python multiprocessing pool. Used in
e.g. alignments and tree-building steps. Default is 1, so
e.g. one alignment will be run at a time during alignment steps.
--threads <int> Number of threads per multiprocessing pool instance. Used for
programs that support multi-threading (e.g. mafft, IQTree).
Default is 1
--no_supercontigs Use this flag if you are processing paralogs from a run of
HybPiper that used the --nosupercontigs flag. Mafft alignments
are re-aligned using clustal omega, which can do a better job in
these cases. Default is off
--process_04_trim_tips_relative_cutoff <float> When pruning long tips during the tree QC stage, provide a branch
length for the maximum imbalance between sister tips allowed.
Default is 0.2
--process_04_trim_tips_absolute_cutoff <float> When pruning long tips during the tree QC stage, provide a branch
length for the maximum allowed tip branch length. Default is 0.4.
--process_06_branch_length_cutoff <float> When pruning long internal branches (putative deep paralogs)
during the tree QC stage, provide a branch length for the maximum
allowed internal branch length. Default is 0.3
--process_06_minimum_taxa <int> After the final tree-pruning step prior to paralogy resolution,
only retain trees with a minimum number of taxa remaining.
Default is 3
--process_09_prune_paralog_MO_minimum_taxa <int>
For the MO method, only process trees with a minimum number of
taxa. Default is 2
--process_10_prune_paralogs_RT_minimum_ingroup_taxa <int>
For the RT method, only process trees with a minumum number of
ingroup taxa. Default is 2
--process_11_prune_paralogs_MI_relative_tip_cutoff <float>
Default is 0.2
--process_11_prune_paralogs_MI_absolute_tip_cutoff <float>
Default is 0.4
--process_11_prune_paralogs_MI_minimum_taxa <int>
Default is 2
Please see the Wiki entry Additional pipeline features and details for further explanation of the parameters above, and general pipeline functionality.
e.g.
- Manually reviewing trees from a preliminary run to select appropriate cut-off values for tree pruning
- Caveats of using these paralogy resolution approaches - relying largely on the fidelity of single-gene trees. Some loci will be better than others (i.e. short, low phylogenetic signal)
etc.