A Snakemake pipeline (map_to_ref.sm
) was implemented to map TRs extracted from the assemblies to the referernce genome (hg38/T2T-chm13)
Software requirements:
- Snakemake - for running the pipeline
- TRF - for extracting TRs from the assembly sequences
- minimap2 - for mapping the assembly to the reference
- bwa mem - for aligning the flanking sequences of all TRs to the reference
- samtools - for generating the BAM file of flanking sequence alignments
- Python - the scripting language of all pipeline scripts
Input | Description |
---|---|
assembly FASTA (asm_fa ) |
Haploid scaffold sequences of diploid assembly (indexed by samtools faidx ) |
TRF result (trf_out ) |
.dat output of TRF |
assembly mappings (mappings_tsv ) |
minimap2's mapping of assembly to reference |
reference FASTA (ref_fa ) |
Reference genome fasta (indexed by samtools faidx ) |
Snakemake's config.yaml
(residing in the same directory of map_to_ref.sm
):
scripts_dir: /locaton/pipeline/directory/
ref_name: "hg38"
reference: /path/of/ref_fa
asm=<asm_fa> trf=<trf_out> snakemake -s map_to_ref.sm -c <num_cores>
- The prefix of
trf_out
is used as prefix of all output files generated downstream. We used<sample>.<hap>
as the prefix in our pipeline i.e.trf_out
must be renamed toprefix.dat
before starting the pipeline - The folder of
trf_out
is assumed to the output directory (out_dir
) mappings_tsv
is generated by runningminimap2
and coverting the resulting.PAF
to TSVminimap2 <ref_fa> <asm_fa> -x asm5 -t 42 > <prefix>.paf python screen_scaffolds.py <prefix>.paf <prefix>.mappings.tsv
mappings_tsv
must be geneated and reside inout_dir
before runningmap_to_ref.sm
Output: <prefix>.<ref_name>.sanitized.tsv
Column | Description |
---|---|
1 | chromosome |
2 | start coordinate of TR |
3 | end coordinate of TR |
4 | motif of TR |
5 | assembly origin: :-: |
size | TR size (bp) |
copy number | size/length(motif) |
reference size | TR size in reference genome (bp) |
sample label | prefix (.) as described above |
A test dataset is generated using the paternal haplotype seqeunce of the HPRC sample HG01361 that corresponds to the neighbourhood of the first exon of HTT and the output files generated from our pipeline are provided(HG01361_HTT.tar.gz
)