Skip to content

Latest commit

 

History

History
51 lines (44 loc) · 2.73 KB

USAGE.md

File metadata and controls

51 lines (44 loc) · 2.73 KB

A Snakemake pipeline (map_to_ref.sm) was implemented to map TRs extracted from the assemblies to the referernce genome (hg38/T2T-chm13)

Software requirements:

  • Snakemake - for running the pipeline
  • TRF - for extracting TRs from the assembly sequences
  • minimap2 - for mapping the assembly to the reference
  • bwa mem - for aligning the flanking sequences of all TRs to the reference
  • samtools - for generating the BAM file of flanking sequence alignments
  • Python - the scripting language of all pipeline scripts
Input Description
assembly FASTA (asm_fa) Haploid scaffold sequences of diploid assembly (indexed by samtools faidx)
TRF result (trf_out) .dat output of TRF
assembly mappings (mappings_tsv) minimap2's mapping of assembly to reference
reference FASTA (ref_fa) Reference genome fasta (indexed by samtools faidx)

Snakemake's config.yaml (residing in the same directory of map_to_ref.sm):

scripts_dir: /locaton/pipeline/directory/
ref_name: "hg38"
reference: /path/of/ref_fa
asm=<asm_fa> trf=<trf_out> snakemake -s map_to_ref.sm -c <num_cores>
  • The prefix of trf_out is used as prefix of all output files generated downstream. We used <sample>.<hap> as the prefix in our pipeline i.e. trf_out must be renamed to prefix.dat before starting the pipeline
  • The folder of trf_out is assumed to the output directory (out_dir)
  • mappings_tsv is generated by running minimap2 and coverting the resulting .PAF to TSV
    minimap2 <ref_fa> <asm_fa> -x asm5 -t 42 > <prefix>.paf
    python screen_scaffolds.py <prefix>.paf <prefix>.mappings.tsv
    
  • mappings_tsv must be geneated and reside in out_dir before running map_to_ref.sm

Output: <prefix>.<ref_name>.sanitized.tsv

Column Description
1 chromosome
2 start coordinate of TR
3 end coordinate of TR
4 motif of TR
5 assembly origin: :-:
size TR size (bp)
copy number size/length(motif)
reference size TR size in reference genome (bp)
sample label prefix (.) as described above

A test dataset is generated using the paternal haplotype seqeunce of the HPRC sample HG01361 that corresponds to the neighbourhood of the first exon of HTT and the output files generated from our pipeline are provided(HG01361_HTT.tar.gz)