fix: handle TRF processing for sequences fewer than total processes #46

Bluetea577 · 2025-02-12T03:22:48Z

We welcome feedback and issue reporting for all bioBakery tools through our Discourse site. For users that would like to directly contribute to the tools we are happy to field PRs to address bug fixes. Please note the turn around time on our end might be a bit long to field these but that does not mean we don't value the contribution! We currently don't accept PRs to add new functionality to tools but we would be happy to receive your feedback on Discourse.

Also, we will make sure to attribute your contribution in our User’s manual(README.md) and in any associated paper Acknowledgements.

Description

Fix TRF parallel processing for scenarios with fewer sequences than configured processes.

In the current implementation, when input sequences are less than specified processes:

Empty temporary files are created
TRF execution fails
Parallel processing breaks for small datasets

Modified process allocation logic in kneaddata/trf_parallel.py (line 69) to:

Dynamically adjust processes based on actual sequence count
Prevent creation of unnecessary empty files
Ensure TRF can be executed correctly for small input sequences

Example log showing low sequence count:

kneaddata.utilities - INFO: READ COUNT: trimmed pair1 : Total reads after trimming ( /path/to/.trimmed.1.fastq ): 25542560.0
kneaddata.utilities - INFO: READ COUNT: trimmed pair2 : Total reads after trimming ( /path/to/.trimmed.2.fastq ): 25542560.0
kneaddata.utilities - INFO: READ COUNT: trimmed orphan1 : Total reads after trimming ( /path/to/.trimmed.single.1.fastq ): 226.0
kneaddata.utilities - INFO: READ COUNT: trimmed orphan2 : Total reads after trimming (/path/to/.trimmed.single.2.fastq ): 7.0

After modification, TRF now successfully processes the 7 sequences, previously impossible with 16 processes (nproc=16).

Related Issue

Screenshots (if appropriate):

Bluetea577 · 2025-02-12T08:11:33Z

To ensure efficient resource utilization and avoid generating empty temporary files, the logic was updated from nproc = min(nproc, total_sequences) to nproc = nproc if total_sequences > nproc*10 else 1. This guarantees a more reasonable use of resources while preventing unnecessary file segmentation.

Bluetea577 · 2025-02-12T14:52:32Z

Through reasoning, I discovered that with 64 processes, the maximum number of sequences that can cause an error is 4032. Therefore, I modified the splitting process (line 120) to split according to the result obtained on line 112, and then used the condition output_file_number < int(nproc) - 1 to allocate the remainder to the last file. This ensures that regardless of the number of threads, it can handle the potentially small sequences. So, I commited "fix: optimize trf file splitting"

Bluetea577 · 2025-02-13T01:25:09Z

I found that if lines_per_file is an odd number, the number of lines allocated will be one more than that of an even number. This can lead to some empty files. To avoid this situation, I changed if lines_written >= seqs_per_file and output_file_number < int(nproc) - 1 to use the sequence count, which ensures that inconsistencies do not occur. Therefore, I modified lines_per_file to seqs_per_file. So, I commited "fix: Use sequences count to prevent inconsistent judgment logic"

Bluetea577 added 2 commits February 12, 2025 10:53

fix: handle TRF processing for sequences fewer than total processes

04584e5

update to correct resource utilization

40137c7

fix: optimize trf file splitting

2a7e040

fix: use sequences count to prevent inconsistent judgment logic

24494f0

fix: Use sequences count to prevent inconsistent judgment logic

f4b045e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle TRF processing for sequences fewer than total processes #46

fix: handle TRF processing for sequences fewer than total processes #46

Bluetea577 commented Feb 12, 2025

Bluetea577 commented Feb 12, 2025

Bluetea577 commented Feb 12, 2025

Bluetea577 commented Feb 13, 2025 •

edited

Loading

fix: handle TRF processing for sequences fewer than total processes #46

Are you sure you want to change the base?

fix: handle TRF processing for sequences fewer than total processes #46

Conversation

Bluetea577 commented Feb 12, 2025

Description

Related Issue

Screenshots (if appropriate):

Bluetea577 commented Feb 12, 2025

Bluetea577 commented Feb 12, 2025

Bluetea577 commented Feb 13, 2025 • edited Loading

Bluetea577 commented Feb 13, 2025 •

edited

Loading