Adds CGA documentation

sing-group · Jun 22, 2022 · dc388b8 · dc388b8
1 parent c0d9779
commit dc388b8
Show file tree

Hide file tree

Showing 14 changed files with 124 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -13,7 +13,7 @@ Among other functions, SEDA allows you to:
 - Sort, merge, split, or reformat FASTA files.
 - Use [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) to perform different types of queries.
 - Use [Clustal Omega](http://www.clustal.org/omega/) to perform multiple sequence alignments.
-- Perform gene annotation using different tools: Splign/Compart, ProSplign/ProCompart, or Augustus (as implemented in SAPP).
+- Perform gene annotation using different tools: Splign/Compart, ProSplign/ProCompart, Augustus (as implemented in SAPP), or the [Conserved Genome Annotation (CGA) Pipeline](https://github.com/pegi3s/cga).
 
 ## Debugging
 In case you need see the commands executed by SEDA to run third-party software, just run SEDA with `-Dseda.execution.showcommands=true`.

diff --git a/seda-docs/source/images/operations/cga/1.png b/seda-docs/source/images/operations/cga/1.png
diff --git a/seda-docs/source/images/operations/cga/2.png b/seda-docs/source/images/operations/cga/2.png
diff --git a/seda-docs/source/images/operations/ncbi-rename/1.png b/seda-docs/source/images/operations/ncbi-rename/1.png
diff --git a/seda-docs/source/images/operations/ncbi-rename/2.png b/seda-docs/source/images/operations/ncbi-rename/2.png
diff --git a/seda-docs/source/installation-and-configuration.rst b/seda-docs/source/installation-and-configuration.rst
@@ -180,23 +180,29 @@ Follow the official Docker for Mac installation instructions (https://docs.docke
 Dependencies
 ============
 
-As explained before, some operations require third-party software (e.g. BLAST) in order to work. This section describes the dependencies required by SEDA. If Docker is available, then SEDA can run these software dependencies using Docker images (we recommend using the official iamges provided and maintained by us, although custom images can be used).
-
-+----------------------+------------+-----+-----+--------------------------+
-| BLAST                | 2.6.0      | Yes | Yes | Yes                      |
-+======================+============+=====+=====+==========================+
-| Clustal Omega        | 1.2.4      | Yes | Yes | Yes                      |
-+----------------------+------------+-----+-----+--------------------------+
-| bedtools             | 2.29.2     | Yes | No  | Yes (MacPorts, Homebrew) |
-+----------------------+------------+-----+-----+--------------------------+
-| EMBOSS               | 6.6.0      | Yes | No  | Yes (Native, Homebrew)   |
-+----------------------+------------+-----+-----+--------------------------+
-| Splign/Compart       | N/A        | Yes | No  | No                       |
-+----------------------+------------+-----+-----+--------------------------+
-| ProSplign/ProCompart | N/A        | Yes | No  | No                       |
-+----------------------+------------+-----+-----+--------------------------+
-| SAPP                 | 12/09/2019 | Yes | No  | No                       |
-+----------------------+------------+-----+-----+--------------------------+
+As explained before, some operations require third-party software (e.g. BLAST) in order to work. This section describes the dependencies required by SEDA. If Docker is available, then SEDA can run these software dependencies using Docker images (we recommend using the official images provided and maintained by us, although custom images can be used).
+
++----------------------+------------+---------------+---------------+---------------------------+
+| Tool                 | Version    | Linux         | Windows       | MacOS                     |
++======================+============+===============+===============+===========================+
+| BLAST                | 2.6.0      | Yes           | Yes           | Yes                       |
++----------------------+------------+---------------+---------------+---------------------------+
+| Clustal Omega        | 1.2.4      | Yes           | Yes           | Yes                       |
++----------------------+------------+---------------+---------------+---------------------------+
+| bedtools             | 2.29.2     | Yes           | No            | Yes (MacPorts, Homebrew)  |
++----------------------+------------+---------------+---------------+---------------------------+
+| EMBOSS               | 6.6.0      | Yes           | No            | Yes (Native, Homebrew)    |
++----------------------+------------+---------------+---------------+---------------------------+
+| Splign/Compart       | N/A        | Yes           | No            | No                        |
++----------------------+------------+---------------+---------------+---------------------------+
+| ProSplign/ProCompart | N/A        | Yes           | No            | No                        |
++----------------------+------------+---------------+---------------+---------------------------+
+| SAPP                 | 12/09/2019 | Yes           | No            | No                        |
++----------------------+------------+---------------+---------------+---------------------------+
+| CGA Pipeline         | 1.0.0      | Yes\ :sup:`1` | Yes\ :sup:`1` | Yes\ :sup:`1`             |
++----------------------+------------+---------------+---------------+---------------------------+
+
+:sup:`1` CGA is distributed as an executable Docker image and thus can be used as long as Docker is available.
 
 Compatibility issues
 --------------------
@@ -246,6 +252,11 @@ SAPP
 
 The original SAPP binaries are available here: http://sapp.gitlab.io/installation/. Nevertheless, it is recommended to use the following binaries: http://static.sing-group.org/software/SEDA/dev_resources/sapp.tar.gz. This version is the one included in the official Docker image (https://hub.docker.com/r/singgroup/seda-sapp).
 
+CGA
+---
+
+CGA can be only executed in SEDA using the official Docker image: https://hub.docker.com/r/pegi3s/cga.
+
 .. _ram_memory:
 
 Increasing RAM memory

diff --git a/seda-docs/source/operations.rst b/seda-docs/source/operations.rst
@@ -1170,6 +1170,56 @@ Finally, the remaining options in the configuration panel also allows to choose
 .. figure:: images/operations/sapp/3.png
    :align: center
 
+Conserved Genome Annotation (CGA) Pipeline
+------------------------------------------
+
+This operation allows the execution of the CGA (Conserved Genome Annotation) Pipeline, a Compi pipeline developed by us to efficiently perform CDS annotations by automating the steps that researchers usually follow when performing manual annotations. For further information and references about this method, refer to the official CGA documentation: https://github.com/pegi3s
+
+Each input FASTA file selected in SEDA will be used to launch a new pipeline execution with the specified reference file and configuration parameters.
+
+Configuration
++++++++++++++
+
+First, the *‘CGA Docker Image’* text box allows to specify the CGA Docker image used to run the pipeline. By default, the official *pegi3s/cga* image is used and it is not recommended changing it.
+
+Second, the *Reference FASTA file* is mandatory and requires to select the FASTA file containing the reference protein sequence to run the pipeline.
+
+.. figure:: images/operations/cga/1.png
+   :align: center
+
+Then, the following group contains specific parameters of the CGA pipeline to control the annotation process:
+
+- *Max. dist.*: the maximum distance between exons (in this case sequences identified by getorf) from the same gene. It only applies to large genome sequences where there is some chance that two genes with similar features are present.
+- *Intron BP*: Distance around the junction point between two sequences where to look for splicing signals.
+- *Min. CDS size*: Minimum size for CDS to be reported.
+- *Selection criterion*: The selection model to be used:
+
+    - 1. Similarity with reference sequence first, in case of a tie, percentage of gaps relative to reference sequence.
+    - 2. Percentage of gaps relative to reference sequence first, in case of a tie, similarity with reference sequence.
+    - 3. A mixed model with similarity with reference sequence first, but if fewer gaps relative to reference sequence similarity gets a bonus defined by the user. Currently, a bonus of 20, means 2%.
+
+- *Selection correction*: A bonus percentage times 10 when using the mixed selection model (3). For instance, 20 means 2% bonus. Something with 18% similarity acts as having 20% similarity.
+
+The *Skip pull Docker images* option can be selected to skip the *pull-docker-images* task of the pipeline. It can be used when the pipeline has been run already and the external Docker images used have been already downloaded.
+
+The *Results* option is used to specify the CGA result files that must be used as output for each input FASTA file:
+
+- *Predicted CDS (\*.nuc)*: takes the *results/nuc* file produced by the pipeline containing the predicted CDS nucleic acid sequences.
+- *Predicted proteins (\*.pep)*: takes the *results/pep* file produced by the pipeline containing the predicted CDS protein sequences.
+- *Incomplete CDS annotations (\*.results*): takes the *results/results* file produced by the pipeline containing the DNA sequences being considered before the predict step. This file is useful for manual sequence refinement when there are reasons to believe that a complete annotation was not achieved. There are a number of situations in which this could happen. For instance, the first coding exon could be smaller than 30 bp (the minimum size for an ORF to be reported by getorf). It should, however, be noted that in such cases it would be equally difficult to annotate the gene manually
+
+Finally, the *Parallel tasks* option allows to specify the maximum number of parallel tasks that each CGA pipeline execution will be able to run. This number should be equal or less than the number of available cores.
+
+Test data
++++++++++
+
+This operation can be tested using the test data available here (https://github.com/pegi3s/cga/raw/master/resources/test-data/cga-test-data.zip). First, the *‘input.fasta‘* file should be selected using the SEDA *Input* area. Then, the *reference.fasta‘* file should be selected in the configuration panel of the operation as *Reference FASTA file*. The rest of the configuration should look as in the following image:
+
+.. figure:: images/operations/cga/2.png
+   :align: center
+
+This example tooks about 21 minutes in a workstation with Ubuntu 18.04.6 LTS, 8 CPUs (Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz), 16GB of RAM and SSD disk.
+
 getorf (EMBOSS)
 ---------------
 
@@ -2032,7 +2082,9 @@ Input:
 NCBI rename
 -----------
 
-This operation allows replacing NCBI accession numbers in the names of FASTA files by the associated organism name and additional information from the NCBI Taxonomy Browser (https://www.ncbi.nlm.nih.gov/Taxonomy/). An example of a FASTA file could be ‘GCF_000001735.3_TAIR10_cds_from_genomic.fna’. When this file is given to this operation, the organism name associated to the accession number ‘GCF_000001735.3’ is obtained from the NCBI (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001735.3). In this case, the ‘*Arabidopsis thaliana* (thale cress)’ is the associated organism name. The *‘File name’* allows specifying how this name is added to the file name and the *‘Delimiter’* parameter specifies if a separator should be set between the name and the file name. You can choose between one of the following *‘Position’* values:
+This operation allows replacing NCBI accession numbers in the names of FASTA files by the associated organism name and additional information from the NCBI Taxonomy Browser (https://www.ncbi.nlm.nih.gov/Taxonomy/). An example of a FASTA file could be ‘GCF_000001735.3_TAIR10_cds_from_genomic.fna’. When this file is given to this operation, the organism name associated to the accession number ‘GCF_000001735.3’ is obtained from the NCBI (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001735.3). In this case, the ‘*Arabidopsis thaliana* (thale cress)’ is the associated organism name.
+
+The *‘File name’* allows specifying how this name is added to the file name and the *‘Delimiter’* parameter specifies if a separator should be set between the name and the file name. You can choose between one of the following *‘Position’* values:
 
 - *Prefix*: before the actual file name. In the example, with ‘Delimiter’ = ‘_’, the output FASTA would be named ‘Arabidopsis thaliana (thale cress)_GCF_000001735.3_TAIR10_cds_from_genomic.fna’.
 - *Suffix*: after the actual file name.  In the example, with ‘Delimiter’ = ‘_’, the output FASTA would be named ‘GCF_000001735.3_TAIR10_cds_from_genomic.fna_Arabidopsis thaliana (thale cress)’.

diff --git a/seda-plugin-cga/README.md b/seda-plugin-cga/README.md
@@ -1,6 +1,33 @@
-SEDA Clustal Omega plugin
-=========================
+SEDA Conserved Genome Annotation (CGA) Pipeline plugin
+======================================================
 
-This plugin allows the possibility of executing Clustal Omega sequence alignments trough the SEDA Graphical User Interface. 
+This plugin allows the possibility of executing the [Conserved Genome Annotation (CGA) Pipeline](https://github.com/pegi3s) trough the SEDA Graphical User Interface.
 
-![SEDA Clustal Omega Operation Screenshot](seda-screenshot.png)
+![SEDA CGA Screenshot](seda-screenshot.png)
+
+By default, the intermediate files generated by this operation in temporary directories are removed. If you need to keep them (e.g. for debugging purposes os in case of unexpected errors), it is possible to keep them by running SEDA with `-Dseda.cga.keeptemporaryfiles=true`.
+
+For developers
+--------------
+
+The CGA pipeline involes a series of steps implemented in the `CgaPipeline` class. In order to programmatically test this pipeline, the following code can be used with the test data available [here](https://www.sing-group.org/seda/downloads/data/test-data-splign-compart.zip). It tooks about 21 minutes to complete using a workstation with Ubuntu 18.04.6 LTS, 8 CPUs (Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz), 16GB of RAM and SSD disk
+
+```java
+  public static void main(String[] args) throws IOException, InterruptedException {
+    System.setProperty(AbstractBinariesExecutor.SEDA_EXECUTION_SHOW_COMMANDS, "true");
+    DatatypeFactory factory = DatatypeFactory.getDefaultDatatypeFactory();
+
+    SequencesGroup input = factory.newSequencesGroup(new File("input.fasta").toPath());
+    SequencesGroup reference = factory.newSequencesGroup(new File("ref.fasta").toPath());
+
+    CgaCompiPipelineConfiguration config = new CgaCompiPipelineConfiguration(
+      10000, 500, 200, SelectionCriterion.CRITERION_1, 10, false
+    );
+
+    new CgaPipeline(
+      new DefaultDockerCgaBinariesExecutor(),
+      new CgaPipelineParameters(new File("/tmp/seda-cga"), config, ""),
+      input, reference
+    ).run();
+  }
+```
diff --git a/seda-plugin-cga/pom.xml b/seda-plugin-cga/pom.xml
@@ -12,7 +12,7 @@
 
 	<artifactId>seda-plugin-cga</artifactId>
 	<packaging>jar</packaging>
-	<name>SEquence DAtaset builder CGA Omega plugin</name>
+	<name>SEquence DAtaset builder CGA plugin</name>
 
 	<dependencies>
 		<dependency>

diff --git a/seda-plugin-cga/seda-screenshot.png b/seda-plugin-cga/seda-screenshot.png
diff --git a/...in-cga/src/main/java/org/sing_group/seda/cga/execution/CgaCompiPipelineConfiguration.java b/...in-cga/src/main/java/org/sing_group/seda/cga/execution/CgaCompiPipelineConfiguration.java
@@ -37,10 +37,10 @@ public enum SelectionCriterion {
       "Similarity with reference sequence first, in case of a tie, percentage of gaps relative to reference sequence."
     ), CRITERION_2(
       2, "Percentage of gaps",
-      "Percentage of gaps relative to reference sequence first, in case of a tie, similarity with reference sequence"
+      "Percentage of gaps relative to reference sequence first, in case of a tie, similarity with reference sequence."
     ), CRITERION_3(
       3, "Mixed",
-      "A mixed model with similarity with reference sequence first, but if fewer gaps relative to reference sequence similarity gets a bonus defined by the user. Currently, a bonus of 20, means 2%"
+      "A mixed model with similarity with reference sequence first, but if fewer gaps relative to reference sequence similarity gets a bonus defined by the user. Currently, a bonus of 20, means 2%."
     );
 
     private int value;

diff --git a/...ga/src/main/java/org/sing_group/seda/cga/gui/CgaCompiPipelineConfigurationParameters.java b/...ga/src/main/java/org/sing_group/seda/cga/gui/CgaCompiPipelineConfigurationParameters.java
@@ -45,13 +45,13 @@ public class CgaCompiPipelineConfigurationParameters {
   private static final String HELP_MAX_DIST = "<html>Maximum distance between exons (in this case sequences identified by getorf) from the same gene.<br/><br/>"
     + "It only applies to large genome sequences where there is some chance that two genes with similar features are present.</html>";
   private static final String HELP_INTRON_BP = "Distance around the junction point between two sequences where to look for splicing signals.";
-  private static final String HELP_MIN_FULL_NUCLEOTIDE_SIZE = "Minimum size for CDS to be reported";
+  private static final String HELP_MIN_FULL_NUCLEOTIDE_SIZE = "Minimum size for CDS to be reported.";
   private static final String HELP_SELECTION_CRITERION = "<html>The selection model to be used: <ol><li>"
     + CgaCompiPipelineConfiguration.SelectionCriterion.CRITERION_1.getDescription() + "</li><li>"
     + CgaCompiPipelineConfiguration.SelectionCriterion.CRITERION_2.getDescription() + "</li><li>"
     + CgaCompiPipelineConfiguration.SelectionCriterion.CRITERION_3.getDescription() + "</li></ol></html>";
-  private static final String HELP_SELECTION_CORRECTION = "A bonus percentage times 10. For instance, 20 means 2% bonus. Something with 18% similarity acts as having 20% similarity.";
-  private static final String HELP_SKIP_PULL_DOCKER_IMAGES = "Use this flag to skip the pull-docker-images task.";
+  private static final String HELP_SELECTION_CORRECTION = "A bonus percentage times 10 when using the mixed selection model (3). For instance, 20 means 2% bonus. Something with 18% similarity acts as having 20% similarity.";
+  private static final String HELP_SKIP_PULL_DOCKER_IMAGES = "<html>Use this flag to skip the <i>pull-docker-images</i> task.</html>";
 
   private JIntegerTextField maxDist;
   private JIntegerTextField intronBp;

diff --git a/...rc/main/java/org/sing_group/seda/cga/gui/CgaPipelineTransformationConfigurationPanel.java b/...rc/main/java/org/sing_group/seda/cga/gui/CgaPipelineTransformationConfigurationPanel.java
@@ -54,7 +54,7 @@ public class CgaPipelineTransformationConfigurationPanel extends JPanel {
 
   private static final String HELP_CGA_IMAGE =
     "<html>The CGA Docker image.<br/> By default, the official pegi3s/cga image is used.<br/>"
-      + "It is not recommended to change it.</html>";
+      + "It is not recommended changing it.</html>";
   private static final String HELP_REFERENCE_FASTA = "FASTA file containing the reference sequence.";
   private static final String HELP_CGA_RESULTS = "The CGA results to collect.";
   private static final String HELP_COMPI_TASKS = "The maximum number of parallell tasks that the Compi pipeline may execute.";

diff --git a/seda-plugin-splign-compart/README.md b/seda-plugin-splign-compart/README.md
@@ -1,16 +1,16 @@
 SEDA Splign/Compart plugin
 =================
 
-This plugin allows the possibility of executing the Splign/Compart pipeline trough the SEDA Graphical User Interface. 
+This plugin allows the possibility of executing the Splign/Compart pipeline trough the SEDA Graphical User Interface.
 
 ![SEDA Splign/Compart Operation Screenshot](seda-screenshot.png)
 
 By default, the intermediate files generated by this operation in temporary directories are removed. If you need to keep them (e.g. for debugging purposes os in case of unexpected errors), it is possible to keep them by running SEDA with `-Dseda.spligncompart.keeptemporaryfiles=true`.
 
 For developers
-----------------
+--------------
 
-The Splign/Compart pipeline involes a series of steps implemented in the `SplignCompartPipeline` class. In order to programmatically test this pipeline, the following code can be used with the test data available [here](https://www.sing-group.org/seda/downloads/data/test-data-splign-compart.zip).
+The Splign/Compart pipeline involes a series of steps implemented in the `SplignCompartPipeline` class. In order to programmatically test this pipeline, the following code can be used with the test data available [here](https://github.com/pegi3s/cga/raw/master/resources/test-data/cga-test-data.zip).
 
 ```java
   public static void main(String[] args) throws IOException, InterruptedException, ExecutionException {
@@ -32,4 +32,4 @@ The Splign/Compart pipeline involes a series of steps implemented in the `Splign
 
     splignCompartPipeline.splignCompart(targetFileFasta, cdsQueryFileFasta, outputFasta, true);
   }
-```
+```