Skip to content

Commit

Permalink
backport v1.4.1 documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
hsiaoyi0504 committed Jul 23, 2018
1 parent 2f51054 commit 7b81684
Show file tree
Hide file tree
Showing 18 changed files with 634 additions and 14 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/docs/_build/
.DS_Store
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

The [GFF3 format](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md) (Generic Feature Format Version 3) is one of the standard formats to describe and represent genomic features. It is an incredibly flexible, 9-column format, which is easily manipulated by biologists. This flexibility, however, makes it very easy to break the format. We have developed the GFF3toolkit to help identify common problems with GFF3 files; fix 30 of these common problems; sort GFF3 files (which can aid in using down-stream processing programs and custom parsing); merge two GFF3 files into a single, non-redundant GFF3 file; and generate FASTA files from a GFF3 file for many use cases (e.g. feature types beyond mRNA).

[Frequently Asked Questions/FAQ](https://github.com/NAL-i5K/GFF3toolkit/wiki/FAQ)
[Frequently Asked Questions/FAQ](docs/FAQ.md)

## Prerequisite

Expand Down Expand Up @@ -46,7 +46,7 @@ When installing gff3tool, if you found the package was built through wheel (bdis

* `gff3_QC` - Detection of GFF format errors (~50 types of errors).
* [gff3_QC readme](docs/gff3_QC.md)
* [gff3_QC full documentation](https://github.com/NAL-i5K/GFF3toolkit/wiki/Detection-of-GFF3-format-errors)
* [gff3_QC full documentation](docs/Detection-of-GFF3-format-errors.md)
* Quick start:
`gff3_QC -g example_file/example.gff3 -f example_file/reference.fa -o error.txt`
* Please refer to [gff3tool/lib/ERROR/ERROR.py](gff3tool/lib/ERROR/ERROR.py) to see the full list of Error codes and the corresponding Error tags.
Expand All @@ -55,15 +55,15 @@ When installing gff3tool, if you found the package was built through wheel (bdis

* `gff3_fix` - Correct GFF3 errors detected by gff3_QC.py (30 types of errors).
* [gff3_fix readme](docs/gff3_fix.md)
* [gff3_fix full documentation](https://github.com/NAL-i5K/GFF3toolkit/wiki/gff3_fix.py-documentation/)
* [gff3_fix full documentation](docs/gff3_fix.py-documentation.md)
* Quick start:
`gff3_fix -qc_r error.txt -g example_file/example.gff3 -og corrected.gff3`

### Merge two GFF3 files ([back](#gff3toolkit---python-programs-for-processing-gff3-files))

* `gff3_merge` - Merge two GFF3 files
* [gff3_merge readme](docs/gff3_merge.md)
* [gff3_merge full documentation](https://github.com/NAL-i5K/GFF3toolkit/wiki/Merge-two-GFF3-files)
* [gff3_merge full documentation](docs/Merge-two-GFF3-files.md)
* Quick start:
* Merge the two file with auto-assignment of replace tags (default)
`gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt`
Expand Down
87 changes: 87 additions & 0 deletions docs/Detection-of-GFF3-format-errors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# gff3_QC full documentation

## Background

The GFF3 format is flexible and easy to use for most biologists, but this flexibility also allows many errors to be introduced. This QC program aims to detect over 50 types of formatting errors.

Errors are detected by reviewing three types of feature sets in a GFF3 file, and thus are grouped into three categories (Error category – feature type):
* Intra-model errors (Ema) – multiple features within a model
* Inter-model errors (Emr) – multiple features across models
* Single feature errors (Esf) – each single feature.

In addition, we distinguish between errors that apply to protein-coding genes in the ['canonical' Sequence ontology style](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md), and errors that apply to ‘non-canonical’ gene models – i.e. non-coding models, or protein-coding genes that are not modeled with gene, mRNA, CDS and exon features. To perform error-checking on a gff3 file that contains non-canonical gene models, you can specify the –noncg argument when running the program.

Below we list all errors currently considered by gff3_QC.py, including the error code, the error tag (a brief explanation of the error), and whether the error is checked for non-canonical gene models (when using the –noncg argument).

View the [gff3_QC.py readme](https://github.com/NAL-i5K/GFF3toolkit/blob/master/gff3_QC.md) for instructions on how to run the program.

### Intra-model: Multiple features within a model (Ema)
The error category 'Intra-model' collects formatting errors that can be found by jointly considering multiple features within a gene model, such as gene, mRNA, exon, and CDS features. Errors in this category are given an 'Error_Code' starting with 'Ema'.

|Error_Code|Error_Tag|Checked if non-canonical|
|:------|:------|:-----|
|Ema0001|Parent feature start and end coordinates exceed those of child features|Yes|
|Ema0002|Protein sequence contains internal stop codons|No|
|Ema0003|This feature is not contained within the parent feature coordinates|Yes|
|Ema0004|Incomplete gene feature that should contain at least one mRNA, exon, and CDS|No|
|Ema0005|Pseudogene has invalid child feature type|Yes|
|Ema0006|Wrong phase|No|
|Ema0007|CDS and parent feature on different strands|Yes|
|Ema0008|Warning for distinct isoforms that do not share any regions|No|
|Ema0009|Incorrectly merged gene parent? Isoforms that do not share coding sequences are found|No|

### Inter-model: Multiple features across models (Emr)
The error category 'Inter-model' collects formatting errors that can be found by comparing multiple gene models. Errors in this category are given an 'Error_Code' starting with 'Emr'.

|Error_Code|Error_Tag|Checked if non-canonical|
|:------|:------|:-----|
|Emr0001|Duplicate transcript found|No|
|Emr0002|Incorrectly split gene parent?|No|
|Emr0003|Duplicate ID|Yes|

### Single feature (Esf)
The error category 'Single Feature' collects formatting errors that can be found by searching the GFF3 file line by line. Errors in this category are given an 'Error_Code' starting with 'Esf'.

|Error_Code|Error_Tag|Checked if non-canonical|
|:------|:------|:-----|
|Esf0001|Feature type may need to be changed to pseudogene|Yes|
|Esf0002|Start/Stop is not a valid 1-based integer coordinate|Yes|
|Esf0003|strand information missing|Yes|
|Esf0004|Seqid not found in any ##sequence-region|Yes|
|Esf0005|Start is less than the ##sequence-region start|Yes|
|Esf0006|End is greater than the ##sequence-region end|Yes|
|Esf0007|Seqid not found in the embedded ##FASTA|Yes|
|Esf0008|End is greater than the embedded ##FASTA sequence length|Yes|
|Esf0009|Found Ns in a feature using the embedded ##FASTA|Yes|
|Esf0010|Seqid not found in the external FASTA file|Yes|
|Esf0011|End is greater than the external FASTA sequence length|Yes|
|Esf0012|Found Ns in a feature using the external FASTA|Yes|
|Esf0013|White chars not allowed at the start of a line|Yes|
|Esf0014|##gff-version" missing from the first line|Yes|
|Esf0015|Expecting certain fields in the feature|Yes|
|Esf0016|##sequence-region seqid may only appear once|Yes|
|Esf0017|Start/End is not a valid integer|Yes|
|Esf0018|Start is not less than or equal to end|Yes|
|Esf0019|Version is not "3"|Yes|
|Esf0020|Version is not a valid integer|Yes|
|Esf0021|Unknown directive|Yes|
|Esf0022|Features should contain 9 fields|Yes|
|Esf0023|escape certain characters|Yes|
|Esf0024|Score is not a valid floating point number|Yes|
|Esf0025|Strand has illegal characters|Yes|
|Esf0026|Phase is not 0, 1, or 2, or not a valid integer|Yes|
|Esf0027|Phase is required for all CDS features|Yes|
|Esf0028|Attributes must escape the percent (%) sign and any control characters|Yes|
|Esf0029|Attributes must contain one and only one equal (=) sign|Yes|
|Esf0030|Empty attribute tag|Yes|
|Esf0031|Empty attribute value|Yes|
|Esf0032|Found multiple attribute tags|Yes|
|Esf0033|Found ", " in a attribute, possible unescaped|Yes|
|Esf0034|attribute has identical values (count, value)|Yes|
|Esf0035|attribute has unresolved forward reference|Yes|
|Esf0036|Value of a attribute contains unescaped ","|Yes|
|Esf0037|Target attribute should have 3 or 4 values|Yes|
|Esf0038|Start/End value of Target attribute is not a valid integer coordinate|Yes|
|Esf0039|Strand value of Target attribute has illegal characters|Yes|
|Esf0040|Value of Is_circular attribute is not "true"|Yes|
|Esf0041|Unknown reserved (uppercase) attribute|Yes|
33 changes: 33 additions & 0 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# FAQ

## Q: When running one of the GFF3-toolkit programs, the program fails with a stack trace error.
Usually, this means that there is a problem with the input file. We are working on having each program output error messages with the input file line number. In the meantime, send us your input file and we can help figure out what the problem is.

## Q: What are the licensing terms for this project?
This software/database is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the author's official duties as a United States Government employee and thus cannot be copyrighted. This software/database is freely available to the public for use. The National Agriculture Library and the U.S. Government have not placed any restriction on its use or reproduction. (Please see [LICENCE.md](https://github.com/NAL-i5K/GFF3toolkit/blob/master/LICENCE.md))

## Q: What kind of errors can be detected by gff3_QC.py? (Detection of GFF3 format errors: gff3_QC.py)
Currently, ~50 types of formatting errors can be detected. Errors are detected by reviewing three types of feature sets in a GFF3 file, and thus are grouped into three categories (Error category – feature type):
* Intra-model errors (Ema) – multiple features within a model
* Inter-model errors (Emr) – multiple features across models
* Single feature errors (Esf) – each single feature.

Please view the full documentation of [gff3_QC.py](Detection-of-GFF3-format-errors.md) for the full list of detected error types.

## Q: Why is gff3_QC.py taking so long to run? (Detection of GFF3 format errors: gff3_QC.py)
gff3_QC.py can take a while if your gff3 file is large - please be patient!

## Q: Why does the sorted gff3 file have a different number of lines than the input file? (Sort a GFF3 file: gff3_sort.py)
The program gff3_sort.py automatically ignores the hash tag lines other than ##gff-version 3 and ### while sorting a GFF3 file. After sorting, the program puts one line of ### between every gene model in the output GFF3. Therefore, the total lines of the output file might be different from the input. To check the consistency of the lines, please use the following command,

> grep -v "#" input.gff |wc -l
> grep -v "#" sorted.gff |wc -l
In addition, if your input gff file contains a feature that has two or more parent IDs, the program replicates the feature and lists it under each parent. Thus, the output file would have more lines than the input file.

## Q: Which codons are considered for translation? (Generate biological sequences from a GFF3 file: gff3_to_fasta.py)
Translation from 64 combinations of [standard codons](https://www-bimas.cit.nih.gov/molbio/translate/codes.html) (Only standard codons and universal stop codons are considered.)

## Q: Why does gff3_merge.py sometimes reject auto-assigned replace tags when the reference model has multiple isoforms? (Merge 2 GFF3 files: gff3_merge.py)
It is possible for a modified model to have multiple isoforms that do not share CDS with each other - for example with partial models due to a poor genome assembly. In this case, the auto-assignment program will assign different replace tags to each isoform, but will then reject these auto-assigned replace tags because it expects isoforms of a gene model to have the same replace tags (see section "Some notes on multi-isoform models", above). You'll need to add the replace tags manually - all isoforms should carry the replace tags of all models to be replaced by the whole gene model.
Binary file added docs/I5KNAL_gff-merge_part1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/I5KNAL_gff-merge_part2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/I5KNAL_gff-merge_part3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SPHINXPROJ = GFF3Toolkit
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Loading

0 comments on commit 7b81684

Please sign in to comment.