Preprint out on MedRxiv: Using viral genomics to estimate undetected infections and extent of superspreading events for COVID-19.
- The number of missing infectious is an important parameter to estimate because it provides information about the scale of the epidemic, which in turn affects resource allocation. Also, it affects how controllable the epidemic is and for how long the epidemic will go on.
- Incidence or prevalence time series are generally the lower bound of the total number of infections due to under-reporting. Viral phylogenies provide can fill in this knowledge gap because viruses continue to evolve as they spread through asymptomatic human populations.
Inference uses the EpiGenMCMC program. See also Li et al. (2017) for more details.
Genomic data are manually downloaded from GISAID and stored in a file named ‘data/sequences/gisaid_cov2020_sequences.fasta’. The accompanying metadata is also manually downloaded and stored in ‘data/sequences/gisaid_cov2020_acknowledgement_table.xls’.
Time series data are from Johns Hopkins github page.
Input: data from Johns Hopkins GitHub repo.
Output: data/time_series/CSSE/*.csv, data/sequences/gisaid_metadata.tsv
./00_download_data.sh
Input: data from Johns Hopkins GitHub repo at data/time_series/CSSE/*.csv, early WHO sitrep (data/time_series/WHO_sitreps_20200121-20200122.tsv), data from Li et al. (2020) NEJM (data/time_series/li2020nejm_wuhan_incidence.tsv), data/sequences/gisaid_metadata.tsv
Output: data/timeseries/summary_{region}timeseries{cumulative|new_cases}.tsv, data/timeseries/timeseries_new_cases.tsv, data/timeseries/timeseries_cumulative_cases.tsv
Rscript 01_curate_time_series.R
Input: data/sequences/gisaid_cov2020_sequences.fasta, data/sequences/gisaid_metadata.tsv
Output: msa/{region}/input_mafft_{analysisid}.fasta, msa/{region}/input_mafft_{analysisid}.fasta
{analysisid} refers to the date of the last data pull.
Rscript 02_filter_seq.R
Input: msa/{region}/input_mafft_{analysisid}.fasta
Output: msa/{region}/msa_{analysisid}.fasta
find msa -maxdepth 1 -mindepth 1 -type d | parallel -j 8 ./03_multi_sequence_alignment.sh {}
Input: msa/{region}/msa_mafft_{analysisid}.fasta
Output: tree/iqtree_{analysisid}.{bionj|boottrees|ckp.gz|contree|iqtree|log|mldist|model.gz|treefile}
find msa -maxdepth 1 -mindepth 1 -type d | sed "s/msa/tree/" | parallel -j 8 ./04_build_ml_tree.sh {}
Compute dates of branching events using TreeTime
Input: tree/{region}/iqtree_{analysisid}.treefile
Output: tree/{region}/iqtree_{analysisid}dates.tsv, tree/{region}/treetime{analysisid}/{ancestral_sequences.fasta|dates.tsv|divergence_tree.nexus|divergence_tree.nexus|molecular_clock.txt|root_to_tip_regression.pdf|sequence_evolution_model.txt|timetree.nexus|timetree.pdf}
ls tree/*/*treefile | xargs -I{} dirname {} | parallel ./05_treetime.sh {}
Tips with divergence more than expected for the sampling time were excluded.
Generate a grid of parameter combinations to determine which starting parameters to use for the EpiGenMCMC algorithm.
In total, 100 different parameter combinations are tested for each country, using just the time series data, using just the phylogenetic data, and using both time series and phylogenetic data.
EpiGenMCMC estimates parameter values for each parameter combination and for each of the 3 time series dataset.
Input: tree/{region}/treetime_{analysisid}/{timetree.nexus|dates.tsv}, data/timeseries/summary_{region}_timeseries_new_cases.tsv,
Output: - epigenmcmc_results/covid19_{analysisid}_commands - this file contains all the bash commands for running EpiGenMCMC
For each analysis, the R Script generates these files: - prefix: epigenmcmc_results/{region}/covid19_*/{region}{both|epi|gen}{runid}_
- suffix: epi_data.txt, gen_data.txt, init_states.txt, mcmc_options.txt, params.txt
Rscript 06_create_EpiGenMCMC_inputs.R
Input: epigenmcmc_results/covid19_{analysisid}commands Output: epigenmcmc/{region}/covid19{analysisid}/*logfile.txt
./epigenmcmc_results/covid19_{analysisid}/commands
Use the initial grid search to set the input parameter values.
Input: epigenmcmc/{region}/covid19_{analysisid}/{logfile}.txt Output: epigenmcmc/{region}/covid19_{analysisid}/inference_commands, epigenmcmc/{region}/covid19_{analysisid}/inference_.txt
Rscript 07_summarize_initial_param_search.R
Input: epigenmcmc/{region}/covid19_{analysisid}/inference_commands Output: epigenmcmc/{region}/covid19_{analysisid}/inference_*_{logfile|trajfile}.txt
./epigenmcmc/covid19_{analysisid}/inference_commands
Summarize MCMC results.
Input: epigenmcmc/{region}/covid19_{analysisid}/inference_*_{logfile|trajfile}.txt Output: epigenmcmc/{figures|tables|files}
Rscript 08_sumarize_results.R
Rscript 09_params_to_simulate.R