Scripts used for the study "Cervical and anal HPV whole-genome sequencing reveals high variations in APOBEC3-induced mutations among HPV risk categories in a Togolese key-population"

This repository contains all the scripts that were developed to produce the work "Cervical and anal HPV whole-genome sequencing reveals high variations in APOBEC3-induced mutations among HPV risk categories in a Togolese key-population". You can contact the author if you need some helps to execute or adapt the scripts for your own projects.
Large metadata can be shared on demand. In the data/ subdirectory, you will find the GFF annotation file for most HPV types, the list of HPVs (and corresponding GenBank accession numbers) detected in our samples, the list of HPV reference sequences, and the multiple sequence alignments from GenBank sequences for HPV types 6, 11, 16 and 18.

1. Prerequisites

To use the scripts, you need the following programs/packages:

- R version 4.0.4
- List of R packages: data.table, dplyr, ggplot2, stringr

2. Preparing the data

Before studying the genetic diversity of HPV types, we first applied a common bioinformatics pipeline to produce pileup files (which contain the exhaustive list of bases observed for each position along the genome).

bwa mem hpv_ref_genomes.fasta sample.R1.fastq.gz sample.R2.fastq.gz > sample.sam

samtools view -b -S sample.sam > sample.bam

samtools sort sample.bam -o sample.sorted.bam

samtools index sample.sorted.bam

samtools mpileup -a -f reference_genome.fasta file_sorted.bam > file_sorted.pileup

where sample is the name of the sample.
For samtools mpileup, -d 1000000 indicates that we consider a maximum of one million reads for each position of the genome.

3. Genetic diversity of HPV types

To explore the genetic diversity of HPV types and per risk category, you developed the script genetic_diversity_study.R. The script uses as inputs some pileup files and both the HPV_ref_type.txt and reference_observed.txt files. The analysis takes from a few minutes to hours, depending on the number of samples investigated). When the analysis is finished, two plots are generated:
- the total number of mutations per HPV type
- the total number of mutations per risk category

In a second step, the algorithm reads the HPV_annotation.gff file to perform the analysis at the gene level. After a few minutes, two other plots are produced:
- the ratio(c>t) per gene and per risk category
- a focus of the ratio(c>t) on the E6 gene for low and high risk categories

4. APOBEC3-induced mutations in HPV types at the genome and gene levels using our samples

To explore APOBEC3-induced mutations among HPV types and risk categories using our samples, we developed the scripts apobec_genome_study.R (genome level) and apobec_gene_study.R (gene level).

At the genome level, the pipeline is very similar to the one that evaluates the genetic diversity. The major difference is that we focus on C>T mutations, both in TCW and non-TCW motifs. Systematically, we calculate the ratio(c>t) to identify genome sequences that are enriched in APOBEC3-induced mutations rather than random mutations (we suppose here a ratio(c>t) >= 2 as significant). Two plots are produced after a few minutes:
- the ratio(c>t) according to each HPV type
- the ratio(c>t) according to the risk category

At the gene level, a part of the script must be executed for each gene investigated (in the deposited script, we focus on the E1 gene as an example). The pipeline is then the same than the one used at the genome level. Once the analysis is done for each gene (E1, E2, E4, E6, E7, L1 and L2), all the results are combined and a plot that shows the distribution of samples with APOBEC3-induced mutations among genes is produced.

5. APOBEC3-induced mutations in HPV types at the genome and gene levels from GenBank sequences

To explore APOBEC3-induced mutations among HPV types 6, 11, 16 and 18, we developed the scripts apobec_genome_genbank.R (genome level) and apobec_gene_genbank.R (gene level).

At the genone level, the script uses one multiple sequence alignment as input (in the deposited script, we focus on HPV 6 as an example, i.e. the hpv6_aligned_clean.fasta file that can be found in the data/ subdirectory). The pipeline counts the number of C>T mutations in the TCW motif. Once the analysis is done for each type investigated (here, HPV types 6, 11, 16 and 18), 100 replicates comparing the number of APOBEC3-induced mutations among a random draw of 50 HPV sequences per type were performed. The script calculates the number of replicates where a signicant difference between two types (in the deposited script, HPVs 6 vs. 16 as an example) is observed. To illustrate the result, a plot based on the last replicate showing the number of APOBEC3-induced mutations for each HPV type is produced.

At the gene level, the strategy is similar to the one used at the genome level. A part of the script must be executed for each gene investigated (in the deposited script, we focus on the E1 and E2 genes as an example). The pipeline is then the same than the one used at the genome level. The algorithm counts, for a given HPV type, the proportion of sequences on a gene that has a ratio(c>t) >= 2, suggesting that the gene is enriched in APOBEC3-induced mutations.

6. Citation

If you use or adapt some scripts for your own work, please cite:

In preparation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scripts used for the study "Cervical and anal HPV whole-genome sequencing reveals high variations in APOBEC3-induced mutations among HPV risk categories in a Togolese key-population"

1. Prerequisites

2. Preparing the data

3. Genetic diversity of HPV types

4. APOBEC3-induced mutations in HPV types at the genome and gene levels using our samples

5. APOBEC3-induced mutations in HPV types at the genome and gene levels from GenBank sequences

6. Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
LICENSE		LICENSE
README.md		README.md
apobec_gene_genbank.R		apobec_gene_genbank.R
apobec_gene_study.R		apobec_gene_study.R
apobec_genome_genbank.R		apobec_genome_genbank.R
apobec_genome_study.R		apobec_genome_study.R
genetic_diversity_study.R		genetic_diversity_study.R

License

Rcoppee/HPV_genetic_diversity

Folders and files

Latest commit

History

Repository files navigation

Scripts used for the study "Cervical and anal HPV whole-genome sequencing reveals high variations in APOBEC3-induced mutations among HPV risk categories in a Togolese key-population"

1. Prerequisites

2. Preparing the data

3. Genetic diversity of HPV types

4. APOBEC3-induced mutations in HPV types at the genome and gene levels using our samples

5. APOBEC3-induced mutations in HPV types at the genome and gene levels from GenBank sequences

6. Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages