Introduction

Count unique reads from multiple samples in an efficient way.

dupcnt deduplicates reads in two steps.

Pick out repetitive reads that exactly match to HG19/HG38 reference utilizing FM-index2 (16GB). A 32-bit integer array is used to count the occurrence of reads at each reference position, requiring 24G for human genome.
A certain number of AVL trees are used to find duplications in the rest of unmatched reads. Specifically, reads in the tree share the same first 6 bases, facilitating parallelism over 4096 trees. Two-bit encoded reads as nodes are inserted into the binary search trees, with each node occupying 40 plus (read_length / 4) bytes. For 6G reads length of 100bp (~200X), it needs < 400GB memory. Considering not exactly matched reads accounts for <50% of total reads, a medium-end server of 256GB RAM could be fitted in. It is a better solution to obtain absolutely accurate results than previously used trie that cannot guarantee correctness because of memory limitation.

Usage

dupcnt [options] Sample_1 ... Sample_n

Options

-i FM-index (BWA-MEM2 index, use mem2idx to create it if not available)
-t Number of threads (16 by default)
-k Input batch size (10M)

Result

dupcnt is as fast as data input in overall performance. After each sample processed, dupcnt outputs statistic information. Of note, Number of reads, Exactly matched reads and Unique reads are accumulative values.

1 sample(s) processed
  Time profile(s):       input 3061.82; Match 1465.94; Trie 268.45
  Number of reads:       863941088
  Exactly matched reads: 693426503 (80.26 %)
  Unique reads:          779264875 (90.20 %)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
FM_index2		FM_index2
bwalib		bwalib
cstl		cstl
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
dupcnt_core.cpp		dupcnt_core.cpp
dupcnt_core.h		dupcnt_core.h
input.cpp		input.cpp
input.h		input.h
kalloc.c		kalloc.c
kalloc.h		kalloc.h
kavl.h		kavl.h
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Usage

Result

About

Releases

Packages

Languages

License

i-xiaohu/dupcnt

Folders and files

Latest commit

History

Repository files navigation

Introduction

Usage

Result

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages