An algorithm for producing the whole genome sequence (.fasta format) for Plasmodium ovale species.
Prior to use this pipeline, we suppose you produced a sorted.bam file at this step (for example using Samtools view and sort).
The algorithm was written for Plasmodium ovale species, but can also be used for other Plasmodium species and other organisms.
To use this pipeline, you need the following programs:
- Perl 5
- Python 3.x
- Samtools 1.4
The first step consists in the production of a pileup file using Samtools mpileup that calculates for each position across the genome the depth coverage, the quality and content of the reads:
samtools mpileup -a -f reference_genome.fasta file_sorted.bam > file_sorted.pileup
where reference_genome.fasta is the genome of interest in .fasta format (provided with the parameter -f
, and file_sorted.bam is the sorted bam file previously produced with samtools sort function. The -a
parameter allows to consider positions that were not covered.
The second step allows to formate a table that indicate the number of A, T, G, C and indels for each position of the genome. For that, we developed a perl script, named extract_data_pileup.pl, that takes the previously produced pileup file as an input (option -p
).
perl extract_data_pileup.pl -p file_sorted.pileup -o ./
The algorithm will produce a table, named data_from_pileup.tsv. The table is then simplified to retain only the chrosomosomes, the genomic positions and the major bases (that we named here simplified_data.tsv), using the cut function in Linux. Importantly, we used the IUPAC code to produce the major allele, so other letters than A, T, G and C may appear as the result of mixed alleles.
cut data_from_pileup.tsv -f 1,2,8 > simplified_data.tsv
where data_from_pileup.tsv is the file previously produced, and the -f
option allows to retain columns 1, 2 and 8 of the file.
From the simplified simplified_data.tsv table, we then just concatenated positions to form a fasta sequence. At the end of the program, the total length of the sequence and the number of lacking positions and nucleotides are indicated.
python3 extract_fasta_pileup.py -f simplified_data.tsv -o out.fasta
where -f
is the input file (simplified_data.tsv) and -o
corresponds to the output fasta file (out.fasta).
If you use this pipeline for your own work, please cite:
Development and Optimization of a Selective Whole-Genome Amplification To Study Plasmodium ovale Spp.
DOI: https://doi.org/10.1128/spectrum.00726-22