Skip to content

GAP tutorial for containerised Yang and Smith paralogy resolution pipeline

License

Notifications You must be signed in to change notification settings

LarsNauheimer/Yang-and-Smith-paralogy-resolution-tutorial

 
 

Repository files navigation

HybSeq Paralogy Resolution Tutorial

Original orthology inference manuscript, documentation and scripts

This pipeline makes use of the paralogy resolution (also described as orthology inference) approaches described and implemented by Yang and Smith 2014 here. These approaches have since been adapted for target capture datasets as described in the bioRxiv manuscript here. The original documentation and scripts can be found here.

Yang_and_Smith-RBGV: containerised and pipelined using Singularity and Nextflow

To simplify running the Yang and Smith paralogy resolution methods on target capture data, I’ve provided a Singularity container containing the Linux distribution Ubuntu 18.04, containing the original scripts required to run the pipeline (including modifications, additions and bug fixes, see below for details), as well additonal new scripts and all the dependencies (IQTree, Clustal Omega, mafft, BioPython, HMMCleaner, trimal). The container is called yang_and_smith.sif.

To run the paralogy resolution pipeline using this container, I’ve provided a Nextflow script that uses the software in the Singularity container. This pipeline runs all steps with a single command. The pipeline script is called yang-and-smith-rbgv-pipeline.nf. It comes with an associated config file called yang-and-smith-rbgv.config. The only input required is a folder containing .fasta files for each of your target-capture loci, including paralogs, and a .fasta file containing outgroup sequences (used by some of the paralogy resolution methods, see below). The number of parallel processes running at any time, as well as computing resources given to each process (e.g. number of CPUs, amount of RAM etc) can be configured by the user by modifying the provided config file. The pipeline can be run directly on your local computer, and on an HPC system submitting jobs via a scheduler (e.g. SLURM, PBS, etc).

Input data

Paralog sequences

If you have used the Nextflow pipeline hybpiper-rbgv-pipeline.nf to run HybPiper, as described here, your main results folder will contain the following subfolders:

  • 11_paralogs
  • 12_paralogs_noChimeras

See the HybPiper-RBGV wiki entry Output folders and files for a full description of the files in these output folders. Briefly, folder 11_paralogs contains a fasta file for each gene in your HybPiper target file. Each fasta file contains the 'main' contig selected by HybPiper for each sample. Where HybPiper has detected putative paralog contigs, these sequences are also included; in such cases, the main contig has the fasta header suffix .main, whereas putative paralogs have the suffix .0, .1 etc. Folder 12_paralogs_noChimeras contains the same data, except putative chimeric contigs (see here for an explanation) have been removed.

Tutorial step 1:

Copy folder `11_paralogs` into your current working directory (i.e. the directory containing `yang-and-smith-rbgv-pipeline.nf` and `yang-and-smith-rbgv.config`.

Outgroup sequences

Some of the paralogy resolution methods used in this pipeline require an outgroup sequence for each of your genes. These outgroup sequences can be provided in two ways.

  1. Designating one or more taxa in your HybPiper paralog files as outgroups, via the --internal_outgroups <taxon1,taxon2,taxon3...> option. For example, if your paralog fasta files contain sequences from the taxa 79686 and 79689, you could designate these sequences as outgroups via --internal_outgroups 79686,79689.

  2. Providing a fasta file (e.g. outgroups.fasta) containing 'external' outgroup sequences via the option --external_outgroups_file outgroups.fasta. The sequences in the file should have the same fasta header formatting and gene names as your HybPiper target file. For example, if you have used the Angiosperms353 target file for you HybPiper analysis, and you wish to use sequences from Sesame as your outgroup, your outgroups.fasta file might contain the following:

    >sesame-6995
    gtgggatatgaacaaaatccattgagcttgtattactgtta...
    >sesame-4757
    ctggtgcgtcgagcacttctcttgagcagcaacaatggcgg...
    >sesame-6933
    gaagtagatgctgtggtggtggaagcattcgacatatgcac...
    
    ...etc
    

Again, note that the gene identifier following the dash in the fasta headers (e.g. '6995' for header '>sesame-6995') needs to correspond to a gene identifier in your target file.

It's fine if your outgroups.fasta file contains additional sequences. When running the pipeline (see below) you can optionally provide one or more taxon names using the parameter --external_outgroups <taxon1,taxon2,taxon3...>, e.g. --outgroups sesame, and only these taxa will be included as outgroups. If this option isn't provided, all taxa/sequences in the outgroups.fasta file will be used. You can provide more than one outgroup taxon name using a comma-separated list, e.g. --external_outgroups sesame,taxon2,taxon3 etc.

NOTE: at a minimum, you must provide either 'internal' ingroups via the --internal_outgroups <taxon1,taxon2,taxon3...> option, or a file of 'external' outgroup sequences via the --external_outgroups_file outgroups.fasta option.

Tutorial step 2:

Copy the fasta file containing outgroup sequences to your current working directory.

Running the pipeline on Linux

Please see the Wiki entry Running on Linux.

Running the pipeline on a Mac (macOS)

Please see the Wiki entry Running on a Mac. #CJJ to update for yang and smith

Running the pipeline on a PC (Windows)

Please see the Wiki entry Running on a PC. #CJJ to update for yang and smith

Nextflow pipeline options and parameters

See section Pipeline parameters and options for a full explanation of available parameters and flags. The required parameters are:

--hybpiper_paralogs_directory <directory>    Path to folder containing HybPiper paralog fasta files)

...and either

--internal_outgroups <taxon1,taxon2,taxon3...>
                                             A comma-separated list of taxa present in the paralog fasta files
                                             to use as outgroups. Default is none

...and/or

--external_outgroups_file <file>             File containing fasta sequences of outgroup sequences for each
                                             gene

Tutorial step 3:

Run the pipeline using the command:

nextflow run yang-and-smith-rbgv-pipeline.nf -c yang-and-smith-rbgv.config -profile slurm -resume --hybpiper_paralogs_directory 11_paralogs --external_outgroups_file outgroups.fasta --outgroups sesame --internal_outgroups taxon1,taxon2,taxon3>

Output folders and files

After running the pipeline, output can be found in the folder results (unless you have changed the name of the default output folder using the --outdir <name> parameter. This will consist of 23 subfolders. If you're just after the aligned .fasta files for each of your target genes as output by each of the paralogy resolution methods, the three main output folders of interest are probably:

  • 18_alignments_stripped_names_MO_realigned
  • 20_alignments_stripped_names_RT_realigned
  • 22_alignments_stripped_names_MI_realigned

For a full explanation of output folders and files, please see the Wiki entry Output folders and files.

Post-pipeline analyses

e.g.

  • Concatenated IQtree
  • Astral

etc.

General information

For details on adapting the pipeline to run on local and HPC computing resources, see here.

Bug fixes and changes (WIP)

Please see the Wiki entry [Bug fixes and changes][].

Issues still to deal with (WIP)

Please see the Wiki entry [Issues][].

Pipeline parameters and options

Mandatory arguments:

  --hybpiper_paralogs_directory <directory>    Path to folder containing HybPiper paralog fasta files.

Optional arguments:

  --internal_outgroups <taxon1,taxon2,taxon3...>  A comma-separated list of taxa present in the paralog fasta files
                                                  to use as outgroups. Default is none
  --external_outgroups_file <file>                File containing fasta sequences of outgroup sequences for each
                                                  gene
  --external_outgroups <taxon1,taxon2,taxon3...>  A comma-separated list of outgroup taxa to add from the
                                                  outgroups_file. Default is all
  -profile <profile>                              Configuration profile to use. Can use multiple (comma separated)
                                                  Available: standard (default), slurm
  --output_dir                                    Specify the name of the main output results folder.
                                                  Default is 'results'
  --pool <int>                                    Number of threads for the Python multiprocessing pool. Used in
                                                  e.g. alignments and tree-building steps. Default is 1, so
                                                  e.g. one alignment will be run at a time during alignment steps.
  --threads <int>                                 Number of threads per multiprocessing pool instance. Used for
                                                  programs that support multi-threading (e.g. mafft, IQTree).
                                                  Default is 1
  --no_supercontigs                               Use this flag if you are processing paralogs from a run of
                                                  HybPiper that used the --nosupercontigs flag. Mafft alignments
                                                  are re-aligned using clustal omega, which can do a better job in
                                                  these cases. Default is off
  --process_04_trim_tips_relative_cutoff <float>  When pruning long tips during the tree QC stage, provide a branch
                                                  length for the maximum imbalance between sister tips allowed.
                                                  Default is 0.2
  --process_04_trim_tips_absolute_cutoff <float>  When pruning long tips during the tree QC stage, provide a branch
                                                  length for the maximum allowed tip branch length. Default is 0.4.
  --process_06_branch_length_cutoff <float>       When pruning long internal branches (putative deep paralogs)
                                                  during the tree QC stage, provide a branch length for the maximum
                                                  allowed internal branch length. Default is 0.3
  --process_06_minimum_taxa <int>                 After the final tree-pruning step prior to paralogy resolution,
                                                  only retain trees with a minimum number of taxa remaining.
                                                  Default is 3
  --process_09_prune_paralog_MO_minimum_taxa <int>
                                                  For the MO method, only process trees with a minimum number of
                                                  taxa. Default is 2
  --process_10_prune_paralogs_RT_minimum_ingroup_taxa <int>
                                                  For the RT method, only process trees with a minumum number of
                                                  ingroup taxa. Default is 2
  --process_11_prune_paralogs_MI_relative_tip_cutoff <float>
                                                  Default is 0.2
  --process_11_prune_paralogs_MI_absolute_tip_cutoff <float>
                                                  Default is 0.4
  --process_11_prune_paralogs_MI_minimum_taxa <int>    
                                                  Default is 2

Please see the Wiki entry Additional pipeline features and details for further explanation of the parameters above, and general pipeline functionality.

General notes

e.g.

  • Manually reviewing trees from a preliminary run to select appropriate cut-off values for tree pruning
  • Caveats of using these paralogy resolution approaches - relying largely on the fidelity of single-gene trees. Some loci will be better than others (i.e. short, low phylogenetic signal)

etc.

About

GAP tutorial for containerised Yang and Smith paralogy resolution pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Nextflow 100.0%