GenHub is a free open-source software framework for analyzing eukaryotic genomes, computing and reporting a variety of statistics reflecting genome content and organization. GenHub works with user-supplied genomes (in Fasta and GFF3 format), and can also retrieve dozens of reference genomes from NCBI RefSeq and other public databases for comparison.
The interval locus (iLocus) is the primary unit of organization in GenHub. Each iLocus captures the genomic context of a single gene, a group of overlapping genes, or an intergenic region. iLoci provide a detailed and granular representation of the entire genome that is robust to improvements to the assembly and annotation. See our upcoming paper (Standage and Brendel, 2016) for more information.
See the complete installation instructions if you have not already installed GenHub and its dependencies.
The Fidibus
program is the primary user interface of the GenHub package.
It is a companion to the LocusPocus
program included in the AEGeAn Toolkit, which computes iLoci from a user-supplied genome annotation in GFF3 format.
Fidibus
provides a comprehensive pipeline around LocusPocus
, integrating genome and protein sequences and performing additional pre-processing, post-processing, error-checking, and calculation of summary statistics for iLoci and additional genome features.
For a complete listing of program options, execute fidibus -h
in your shell.
The most important concepts are discussed below.
If you are analyzing a newly sequenced and/or non-model genome with Fidibus
, you will need to provide 3 input files.
- a genomic DNA file: chromosome, scaffold, and/or contig sequences in Fasta format
- a genome annotation file: genes, mRNAs, and related features in GFF3 format
- unique accessions for each gene and transcript should be provided using the
accession
orName
attribute (theID
attribute is unsuitable as it is for use only within a GFF3 file and is not guaranteed to be consistent between multiple GFF3 files) - genome sequence IDs (GFF3 column 1) much match IDs from gDNA Fasta file
- unique accessions for each gene and transcript should be provided using the
- a protein file: protein sequences in Fasta format
- protein sequence IDs must match their corresponding mRNA accessions
You can also compute statistics for a model organism genome, either simultaneously with or separate from any user-supplied genome.
Run fidibus list
to see the list of all supported reference genomes.
You can use its 4-letter label to download and process the data with fidibus
.
For example, the label for the budding yeast Saccharomyces cerevisiae is Scer
, so you would process its genome like so.
fidibus --workdir=data/ --refr=Scer download prep iloci breakdown stats
If your model organism is not supported by GenHub, but you think it should be, submit a request on our issue tracker at https://github.com/standage/genhub/issues/new. If the genome is in RefSeq, adding support to GenHub is usually trivial. If not, you will need to specify the location from which the genome sequences, genome annotation, and protein sequences can be downloaded.
For each genome you analyze, Fidibus
will create a dedicated subdirectory in your specified working directory.
Details about the working directory structure and contents are discussed in the Working directory section below.
Most users will be interested in the various data tables (.tsv
files) produced by Fidibus
, which can easily be loaded into R, Python, or other data analysis environments for analysis and visualization.
The build program provides 7 primary build tasks.
download
: download the reference genome sequence, annotation, and protein sequences from the official source; in the case of user-supplied genomes on the local file system, verify that the specified files existprep
: pre-process the primary data, tidying it up so that all data files, regardless of source, are in a common formatiloci
: compute iLoci and extract iLocus sequencesbreakdown
: extract sequences and parse annotations for various genome features to facilitate calculating descriptive statisticsstats
: calculate descriptive statistics for various genome featurescluster
: identify putative gene families by clustering iLocus protein products for multiple related genomescleanup
: remove intermediate data files to reduce storage needs
The first five tasks have linear dependencies and must be invoked in the order shown above.
The cluster
task relies on the breakdown
task, and does not require the stats
task to be complete before being run.
A special build task, list
, is provided for displaying all available reference genomes.
Most modern computers, including desktops and laptops, have mutiple processors.
When analyzing multiple genomes, the fidibus
program can utilize these processors to speed up computations by processing multiple genomes simultaneously on different threads.
Specify the number of processors you want to dedicate to GenHub with the --numprocs
option (or -p
for short).
Sometimes the best way to learn is to see some examples.
# Compute iLoci for a user-supplied genome
fidibus --workdir=./ --local --gdna=MyGenome.fasta --gff3=MyAnnotation.gff3 \
--prot=MyProteins.fasta --label=Gnm1 \
prep iloci
# Show all available reference genomes
fidibus list
# Download the budding yeast genome, but do not process
fidibus --workdir=/opt/data/genomes/ --refr=Scer download
# Download and process the Arabidopsis genome
fidibus --workdir=/opt/data/genomes/ --genome=Atha \
download prep iloci breakdown stats cleanup
# Retrieve and pre-process several ant genomes
fidibus --workdir=antgenomes/ --genome=Acep,Aech,Cbir,Cflo,Dqua,Lhum,Pbar,Sinv \
download prep
# Retrieve and process a batch of honeybee genomes
fidibus --workdir=./ --batch=honeybees download prep iloci breakdown stats
All data files produced or downloaded by Fidibus
are stored in a working directory.
If the working directory does not already exist, Fidibus
will create it.
Specify your desired working directory with the --workdir
option (or -w
for short).
Each genome data set has a dedicated sub-directory in the working directory, named with a unique (typically four-letter) label. Consider the following example.
fidibus --workdir=demo --refr=Otau,Oluc download prep
This command will download and pre-process genomes for two species (green algae, in this case). The resulting files and directories will be organized as follows.
demo/
├── Oluc/
│ ├── GCF_000092065.1_ASM9206v1_genomic.fna.gz
│ ├── GCF_000092065.1_ASM9206v1_genomic.gff.gz
│ ├── GCF_000092065.1_ASM9206v1_protein.faa.gz
│ ├── Oluc.all.prot.fa
│ ├── Oluc.gdna.fa
│ └── Oluc.gff3
└── Otau/
├── GCF_000214015.2_version_050606_genomic.fna.gz
├── GCF_000214015.2_version_050606_genomic.gff.gz
├── GCF_000214015.2_version_050606_protein.faa.gz
├── Otau.all.prot.fa
├── Otau.gdna.fa
└── Otau.gff3
The files beginning with GCF
were downloaded directly from the RefSeq database.
The other files comprise genome sequences, genome annotations, and protein sequences that have been pre-processed and are ready for parsing into iLoci.
Running the complete Fidibus
pipeline will produce dozens of data files in each dedicated genome directory.
These include the following.
- unprocessed data files downloaded directly from public databases (in the case of reference genomes)
- usually compressed
- start with
GCF
in the case of RefSeq genomes
- pre-processed genome data (produced by
prep
task)- genome sequences (
Xxxx.gdna.fa
) - genome annotation (
Xxxx.gff3
) - protein sequences (
Xxxx.all.prot.fa
)
- genome sequences (
- iLoci
- locations (
Xxxx.iloci.gff3
) - sequences (
Xxxx.iloci.fa
) - merged iLocus data (
Xxxx.miloci.gff3
andXxxx.miloci.fa
) - genome annotation, 1 gene model per iLocus (
Xxxx.ilocus.mrnas.gff3
) - non-redundant protein sequences, 1 gene model per iLocus (
Xxxx.prot.fa
)
- locations (
- tables of descriptive statistics easily loaded into R/Python data frames for analysis
- iLoci (
Xxxx.iloci.tsv
) - merged iLoci (
Xxxx.miloci.tsv
) - gene models (
Xxxx.pre-mrnas.tsv
) - mature mRNAs (
Xxxx.mrnas.tsv
) - exons (
Xxxx.exons.tsv
) - introns (
Xxxx.introns.tsv
) - coding sequences (
Xxxx.cds.tsv
)
- iLoci (
- various other intermediate or ancillary files
Only a brief summary of each script is provided below. For additional documentation demonstrating how these scripts were used to produce the results reported in (Standage and Brendel, 2016), see https://github.com/BrendelGroup/IntervalLoci.
- pipeline scripts (invoked by
Fidibus
)genhub-filens.py
: report lengths of flanking iiLoci for each giLocusgenhub-format-gff3.py
: perform various annotation pre-processing tasksgenhub-glean-to-gff3.py
: convert GLEAN output to GFF3genhub-namedup.py
: copy GFF3ID
attributes toName
attributesgenhub-stats.py
: calculate descriptive statistics for various data types
- post-pipeline scripts (invoked by user)
genhub-compact.py
: compute (φ, σ) meaures of genome compactnessgenhub-ilocus-summary.py
: compute summary table of iLocus datagenhub-milocus-summary.py
: compute summary table of merged iLocus datagenhub-pilocus-summary.py
: compute summary table of protein-coding iLocus data