Scripts used for the study "Cervical and anal HPV whole-genome sequencing reveals high variations in APOBEC3-induced mutations among HPV risk categories in a Togolese key-population"
This repository contains all the scripts that were developed to produce the work "Cervical and anal HPV whole-genome sequencing reveals high variations in APOBEC3-induced mutations among HPV risk categories in a Togolese key-population". You can contact the author if you need some helps to execute or adapt the scripts for your own projects.
Large metadata can be shared on demand. In the data/ subdirectory, you will find the GFF annotation file for most HPV types, the list of HPVs (and corresponding GenBank accession numbers) detected in our samples, the list of HPV reference sequences, and the multiple sequence alignments from GenBank sequences for HPV types 6, 11, 16 and 18.
To use the scripts, you need the following programs/packages:
- R version 4.0.4
- List of R packages: data.table, dplyr, ggplot2, stringr
Before studying the genetic diversity of HPV types, we first applied a common bioinformatics pipeline to produce pileup files (which contain the exhaustive list of bases observed for each position along the genome).
bwa mem hpv_ref_genomes.fasta sample.R1.fastq.gz sample.R2.fastq.gz > sample.sam
samtools view -b -S sample.sam > sample.bam
samtools sort sample.bam -o sample.sorted.bam
samtools index sample.sorted.bam
samtools mpileup -a -f reference_genome.fasta file_sorted.bam > file_sorted.pileup
where sample
is the name of the sample.
For samtools mpileup
, -d 1000000
indicates that we consider a maximum of one million reads for each position of the genome.
To explore the genetic diversity of HPV types and per risk category, you developed the script genetic_diversity_study.R. The script uses as inputs some pileup files and both the HPV_ref_type.txt and reference_observed.txt files. The analysis takes from a few minutes to hours, depending on the number of samples investigated). When the analysis is finished, two plots are generated:
- the total number of mutations per HPV type
- the total number of mutations per risk category
In a second step, the algorithm reads the HPV_annotation.gff file to perform the analysis at the gene level. After a few minutes, two other plots are produced:
- the ratio(c>t) per gene and per risk category
- a focus of the ratio(c>t) on the E6 gene for low and high risk categories
To explore APOBEC3-induced mutations among HPV types and risk categories using our samples, we developed the scripts apobec_genome_study.R (genome level) and apobec_gene_study.R (gene level).
At the genome level, the pipeline is very similar to the one that evaluates the genetic diversity. The major difference is that we focus on C>T mutations, both in TCW and non-TCW motifs. Systematically, we calculate the ratio(c>t) to identify genome sequences that are enriched in APOBEC3-induced mutations rather than random mutations (we suppose here a ratio(c>t) >= 2 as significant). Two plots are produced after a few minutes:
- the ratio(c>t) according to each HPV type
- the ratio(c>t) according to the risk category
At the gene level, a part of the script must be executed for each gene investigated (in the deposited script, we focus on the E1 gene as an example). The pipeline is then the same than the one used at the genome level. Once the analysis is done for each gene (E1, E2, E4, E6, E7, L1 and L2), all the results are combined and a plot that shows the distribution of samples with APOBEC3-induced mutations among genes is produced.
To explore APOBEC3-induced mutations among HPV types 6, 11, 16 and 18, we developed the scripts apobec_genome_genbank.R (genome level) and apobec_gene_genbank.R (gene level).
At the genone level, the script uses one multiple sequence alignment as input (in the deposited script, we focus on HPV 6 as an example, i.e. the hpv6_aligned_clean.fasta file that can be found in the data/ subdirectory). The pipeline counts the number of C>T mutations in the TCW motif. Once the analysis is done for each type investigated (here, HPV types 6, 11, 16 and 18), 100 replicates comparing the number of APOBEC3-induced mutations among a random draw of 50 HPV sequences per type were performed. The script calculates the number of replicates where a signicant difference between two types (in the deposited script, HPVs 6 vs. 16 as an example) is observed. To illustrate the result, a plot based on the last replicate showing the number of APOBEC3-induced mutations for each HPV type is produced.
At the gene level, the strategy is similar to the one used at the genome level. A part of the script must be executed for each gene investigated (in the deposited script, we focus on the E1 and E2 genes as an example). The pipeline is then the same than the one used at the genome level. The algorithm counts, for a given HPV type, the proportion of sequences on a gene that has a ratio(c>t) >= 2, suggesting that the gene is enriched in APOBEC3-induced mutations.
If you use or adapt some scripts for your own work, please cite:
In preparation.