IGNORE EVERYTHING ABOUT INSTALLATION BELOW. THIS IS AVAILABLE AS A DOCKER CONTAINER.
So all you've gotta do to use this is: docker pull mjfos2r/tdfps-designer
To launch an interactive shell within the container, docker run -v $(pwd)/data:/data -it mjfos2r/tdfps-designer
Make sure you mount whatever directory to /data
so that you can write files from the container.
Paths within the container also need to be changed to reflect that our mounted volume is /data
Now that you've got the container, follow along with the instructions below the installation steps.
TDFPS-Designer can be used on the linux system with GPU for CUDA, the main functions as follows:
-
Select some sequences from the whole k-mer space or given sequence space as barcodes. The DTW distance between the nanopore signals corresponding to these barcode sequences is greater than a certain threshold.
-
Given the multi-sample sequencing data generated by Oxford Nanopore Sequencing Company, complete the process of demultiplexing these data. These data include: nanopore signal (format: txt), barcode sequence, adapter sequence and length information of flank sequence.
Our experiments show that TDFPS-Designer can customize barcode kits for users and outperforms current state-of-the-art demultiplexing tools on these kits, improving demultiplexing accuracy by approximately 30% for some barcodes.
To run TDFPS-Designer, users must first install 'conda' according to the following two steps:
- Download Anaconda3
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2021.11-Linux-x86_64.sh
- Install conda
bash Anaconda3-2021.11-Linux-x86 64.sh
When 'conda' is successfully installed, the user only needs to enter the following commands to deploy the TDFPS_Designer. In order to run TDFPS-Designer, five main packages are required, that is, 'scipy','numpy', 'h5py', 'hdf5' and 'pandas'. For convenience, an environment file('TDFPS_Designer.yaml') is provided so that users can deploy DFPS-Designer directly using conda, the command line is as follows:
conda env create -f TDFPS_Designer.yaml
pip install edlib
pip install ont-fast5-api
pip install pod5
git clone https://github.com/junhaiqi/TDFPSDesigner.git
cd TDFPSDesigner/slow5lib
python3 -m pip install .
If necessary, you can recompile the CUDA program:
bash compile.sh
We provide two modes to design barcode. One is' k-mer 'mode, and the other is' fasta' mode. When users use the 'k-mer' mode to design barcode, that is, select some sequences with sufficient differences from each other from the entire k-mer space to form a barcode set, and users can use the following commands:
python selectBarcodeSeq.py --length 10 \
--qsize 10000 \
--outdir test_select_kmer \
--threshold 10 \
--thread-num 8 \
--mode kmer \
--seed 15 \
--training-precison-cutoff 0.95 \
--kit dna-r10-min
When the user uses the 'fasta' mode to design barcode, that is, select some sequences from the sequences contained in the fasta file as barcode sequences. The user can use the following command:
python selectBarcodeSeq.py --length 10 \
--qsize 10000 \
--outdir test_select_fasta \
--threshold 10 \
--thread-num 8 \
--mode fasta \
--fasta test_select_kmer/first_selected_barcodes.fa \
--seed 15 \
--training-precison-cutoff 0.95 \
--kit dna-r10-min
The parameter information about 'selectBarcodeSeq.py' is as follows:
-h, --help show this help message and exit
--length LENGTH Specify the length of the designed barcode.
--qsize QSIZE Specify the size of the initially selected sequence space, which is recommended to be more than 100000.
--outdir OUTDIR Specify the output file, which contains the final barcode sequences.
--seed SEED Specify a random seed to determine the initially selected barcode signal, have a slight impact on the size of the final barcode set.
--threshold THRESHOLD
Specify a value to control the threshold of the TDFPS algorithm, the recommended value is 0~30.
--thread-num THREAD_NUM
Specify the number of threads.
--mode {kmer,fasta} Specify the selected mode. If the mode is "fasta", then -f must be followed by a file of "fasta" type.
--fasta FASTA Specify a file(format: fasta). The sequences contained in this file must be of the same length. When the "--mode" is followed by "fasta", this parameter must be used. In other cases, it
has no effect.
--kit {dna-r9-min,dna-r9-prom,dna-r10-min,dna-r10-prom}
Specify ONT sequencing kit.
--adapter-seq ADAPTER_SEQ
Specify ONT adapter sequence for select barcodes again.
--top-flank-seq TOP_FLANK_SEQ
Specify ONT top flanking sequence for select barcodes again.
--bottom-flank-seq BOTTOM_FLANK_SEQ
Specify ONT bottom flanking sequence for select barcodes again.
--training-num-each-barcode TRAINING_NUM_EACH_BARCODE
Specify the training number for select barcodes again.
--training-precison-cutoff TRAINING_PRECISON_CUTOFF
Specify training precison cutoff for select barcodes again.
--training-recall-cutoff TRAINING_RECALL_CUTOFF
Specify training recall cutoff for select barcodes again.
--training-f1Score-cutoff TRAINING_F1SCORE_CUTOFF
Specify training F1-Score cutoff for select barcodes again.
--bio-criteria Based on biological criteria, sequences with a GC content lower than 0.4 or greater than 0.6, sequences containing reapte triples, sequences containing GGC, and self-complementary
sequences were filtered out.
Given a folder containing nanopore signals, barcode sequence, adpter sequence, and the length of the flanking sequence in barcode sequence, TDFPS-Designer can complete the whole process of demultiplexing. The following is a specific command line:
python demultiplexingByNanoporeSinal.py \
--iAF testData/testAdapter.fasta\
--iBF testData/testBarcode.fasta\
--iNS testData/testSigSet \
--iFL 8 \
--oRes test_dem \
--thread-num 8 \
The parameter information about 'demultiplexingByNanoporeSinal.py' is as follows:
-h, --help show this help message and exit
--iAF IAF Specify a input fasta file, which contains a adpter sequence.
--iBF IBF Specify an input fasta file, which contains barcode sequences with flanking sequence.
--iNS INS Specify an input folder, which contains nanopore signals (.txt) to be demultiplexed.
--oRes ORES Specifies an output folder, which contains the results of the demultiplexing.
--iFL IFL Specify the length of the flanking sequence in the barcode sequence. It needs to be specified when the length of the top flanking sequence is the same as the length of the tail flanking
sequence for better detection of barcode fragment in nanopore signal.
--kit {dna-r9-min,dna-r9-prom,dna-r10-min,dna-r10-prom}
Specify ONT sequencing kit.
--thread-num THREAD_NUM
Specifies the number of threads, which affects the speed of extracting barcde signals.
If necessary, we can filter barcodes again based on edit distance. The following is a specific command line:
python biSelectBaseEditDistance.py --fasta-file test_Fasta_mode.txt --edit-dist 12 --out-file test_again.fasta
The parameter information about 'biSelectBaseEditDistance.py' is as follows:
--fasta-file FASTA_FILE
It is the fasta file that contains short sequence fragments.
--edit-dist EDIT_DIST
It is an edit distance threshold.
--out-file OUT_FILE It is a fasta file to store the selected barcode sequence.
--thread-num THREAD_NUM
It is the number of threads used to execute the task.
ONT currently provides two signal file formats, POD5 and Fast5. The input of our algorithm is a signal folder, which contains txt files containing signal data. We provide a simple script (processFast5Pod5.py) to convert the signal data in POD5/Fast5 into a folder to support users to use our tool to demultiplex.
If you have a folder (example: pod5_test) containing files in pod5 format, you can convert the signal data to 'out_pod5_test' using the following command:
python processFast5Pod5.py POD5 pod5_test out_pod5_test
If you have a folder containing files in fast5 format (example: fast5_test), you can convert the signal data to 'out_fast5_test' using the following command:
python processFast5Pod5.py Fast5 fast5_test out_fast5_test
We have provided two shell scripts, which give examples of designing barcode and demultiplexing based on TDFPS-Designer. The user can run the following script to select barcode in the 10-mer sequence space and select barcode based on 'test_select_kmer/first_selected_barcodes.fa':
bash runSelectBarcodeSeq_exmple.sh
In addition, users can run the following script to complete the demultiplexing of 'testData/testSigSet':
bash runDemultiplexingByNanoporeSinal_example.sh
All the data used to test the TDFPS-Designer and output files of the TDFPS-Designer are in folder 'testData'. In addition, folder tempoutput contains intermediate files output by TDFPS-Designer(txt file about DTW distance matrix).
In the manuscript (TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing), we have introduced all datasets used for evaluating TDFPS-Designer in detail. Users can obtain all the data sets through the following links: link:https://pan.baidu.com/s/1kFyXBekwkvAw-RbWlN9C1g?pwd=hycl password:hycl
In the manuscript (TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing), we finally designed 137 (20bp), 410 (24bp) and 1779 (30bp) barcodes (barcode_kits/*_selected.fa). These barcodes are derived from an already designed initial kit (barcode_kits/*_barcodes.fa). The barcodes in the initial kit ensure the difference in signal (DTW distance describes the difference). The command line for the final kit generation is in "barcode_kits/ex_cmd.sh". It should be noted that the generation of the initial kit and the final kit is automated, and the results generated each time are slightly different because the generation of simulated signals and the initial design of the barcode involve random operations.
Qi, J., Li, Z., Zhang, Yz. et al. TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing. Genome Biol 25, 285 (2024). https://doi.org/10.1186/s13059-024-03423-3