Structure of HDF5 data

For all subsequent analyses of ribosome footprinting and RNA-seq datasets, we store summaries of aligned read data in Hierarchical Data Format (HDF5) format. HDF5 allows for rapid access to mapped reads of a particular length to any coding sequence.

To learn more about accessing and manipulating HDF5 files in R, see Bioconductor's HDF5 interface to R and Neon's Introduction to HDF5 Files in R.

HDF5 file architecture

The HDF5 file is organized in the following hierarchy /<Gene>/<Dataset>/reads/data. A snippet from an example HDF5 file is shown below.

                              group              name       otype  dclass       dim
0                                 /           YAL001C   H5I_GROUP                  
1                          /YAL001C 2016_Weinberg_RPF   H5I_GROUP                  
2        /YAL001C/2016_Weinberg_RPF             reads   H5I_GROUP                  
3  /YAL001C/2016_Weinberg_RPF/reads              data H5I_DATASET INTEGER 36 x 3980
4                                 /           YAL002W   H5I_GROUP                  
5                          /YAL002W 2016_Weinberg_RPF   H5I_GROUP                  
6        /YAL002W/2016_Weinberg_RPF             reads   H5I_GROUP                  
7  /YAL002W/2016_Weinberg_RPF/reads              data H5I_DATASET INTEGER 36 x 4322
8                                 /           YAL003W   H5I_GROUP                  
9                          /YAL003W 2016_Weinberg_RPF   H5I_GROUP                  
10       /YAL003W/2016_Weinberg_RPF             reads   H5I_GROUP                  
11 /YAL003W/2016_Weinberg_RPF/reads              data H5I_DATASET INTEGER 36 x 1118

The data table is an integer table with each rows representing a read length and columns representing nucleotide positions. The first row corresponds to reads of length 15 and the last row corresponds to reads of length 50. All reads are mapped to their 5' ends (see below).

Read attributes

The reads group in /<Gene>/<Dataset>/reads/data have several attributes associated with it. These are summary statistics and other information about the gene and dataset within the reads group. The list of attributes are

reads_total : Sum of reads of all lenghts between -25 to +25 of a CDS
buffer_left : Number of nucleotides upstream of the start codon (ATG) - 250nt
buffer_right : Number of nucleotides downstream of the stop codon (TAA/TAG/TGA) - 247nt
start_codon_pos : Positions corresponding to the start codon - (251,252,253)
stop_codon_pos : Positions corresponding to the stop codon (variable)
reads_by_len : Sum of reads between -25 to +25 of a CDS for each length
lengths : Lengths of mapped reads (15-50)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!