CHAPAO (Compressing Alignments using Hierarchical and Probabilistic Approach) is a novel reference-based technique for compressing MSA files. The name of this method is inspired from a bengali word “চাপাও”, which means “compress” or “squeeze”, and is pronounced the same way as CHAPAO (/chɑ:paʊ/). This is to our knowledge the first application of the reference-based technique for compressing MSAs. Unlike conventional reference-based methods where an “extra” sequence (not included in the input sequences to compress) is used as the reference, we used a novel hierarchical referencing technique where a suitable subset of the input sequences in the MSA file is used as reference sequences. CHAPAO offers a substantial improvement in compression gain over the existing best alternate methods for both general purpose compression algorithms (zip, Bzip2, gzip) and special purpose compression algorithm (MFCompress).
Python 3.0 or later
First Name | Uses |
---|---|
compress.py | Python script for compression |
decompress.py | Python script for decompression |
compress.exe | Windows executable for compression |
decompress.exe | Windows executable for decompression |
Windows sequrity system might consider those executables as malwares. If there is any problem in downloading the files, please follow the these alternative links for python script and executables
Input files should be in FASTA or PHYLIP format.
Using Python Script
python3 compress.py PATH_OF_THE_FILE WINDOW_SIZE OVERLAP_AMOUNT
Example:
python3 compress.py /home/Desktop/DATA/avian/chr1_96_s.fasta 30 28
Using EXE
compress.exe PATH_OF_THE_FILE WINDOW_SIZE OVERLAP_AMOUNT
Example:
compress.exe /home/Desktop/DATA/avian/chr1_96_s.fasta 30 28
This will create a folder with extention .chapao which is the output of the compression algorithm. There will be two separate files in the folder, namely metadata.txt and ref.txt.
WINDOW_SIZE and OVERLAP_AMOUNT are two important hyperparameters in our algorithm. With the increase of WINDOW_SIZE, the compression gain generally increases at the cost of more compression time. OVERLAP_AMOUNT should be less than WINDOW_SIZE and its effect is similar to WINDOW_SIZE.
For smaller files (<100MB) WINDOW_SIZE of 40-50 and OVERLAP_AMOUNT of 35-48 will give a high compression ratio within reasonable amounts of time. For larger files (>100MB) WINDOW_SIZE of 5-10 and OVERLAP_AMOUNT of 3-8 should be used to compress files within reasonable amounts of time.
Using Python Script
python3 decompress.py PATH_OF_COMPRESSED_FOLDER OUTPUT_FORMAT(f/p)
OUTPUT_FORMAT = 'f' - to produce decompressed files in FASTA format
OUTPUT_FORMAT = 'p' - to produce decompressed files in PHYLIP format
Example:
python3 decompress.py /home/Desktop/DATA/avian/chr1_96_s.fasta.chapao/ f
This will create the decompressed file named "Decompressed.txt" inside the same folder.
Using EXE
decompress.exe PATH_OF_COMPRESSED_FOLDER OUTPUT_FORMAT(f/p)
Here 'f' and 'p' have the same meaning mentioned above.
Example:
decompress.exe /home/Desktop/DATA/avian/chr1_96_s.fasta.chapao/ f
This will also create the decompressed file named "Decompressed.txt" inside the same folder.
Links of several MSA datasets, which are used in the development and testing of this technique, can be found here.
Abdullah Aman Tutul
Sifat Muhammad Abdullah