Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle TRF processing for sequences fewer than total processes #46

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Bluetea577
Copy link

We welcome feedback and issue reporting for all bioBakery tools through our Discourse site. For users that would like to directly contribute to the tools we are happy to field PRs to address bug fixes. Please note the turn around time on our end might be a bit long to field these but that does not mean we don't value the contribution! We currently don't accept PRs to add new functionality to tools but we would be happy to receive your feedback on Discourse.

Also, we will make sure to attribute your contribution in our User’s manual(README.md) and in any associated paper Acknowledgements.

Description

Fix TRF parallel processing for scenarios with fewer sequences than configured processes.

In the current implementation, when input sequences are less than specified processes:

  • Empty temporary files are created
  • TRF execution fails
  • Parallel processing breaks for small datasets

Modified process allocation logic in kneaddata/trf_parallel.py (line 69) to:

  • Dynamically adjust processes based on actual sequence count
  • Prevent creation of unnecessary empty files
  • Ensure TRF can be executed correctly for small input sequences

Example log showing low sequence count:

kneaddata.utilities - INFO: READ COUNT: trimmed pair1 : Total reads after trimming ( /path/to/.trimmed.1.fastq ): 25542560.0
kneaddata.utilities - INFO: READ COUNT: trimmed pair2 : Total reads after trimming ( /path/to/.trimmed.2.fastq ): 25542560.0
kneaddata.utilities - INFO: READ COUNT: trimmed orphan1 : Total reads after trimming ( /path/to/.trimmed.single.1.fastq ): 226.0
kneaddata.utilities - INFO: READ COUNT: trimmed orphan2 : Total reads after trimming (/path/to/.trimmed.single.2.fastq ): 7.0

After modification, TRF now successfully processes the 7 sequences, previously impossible with 16 processes (nproc=16).

Related Issue

Screenshots (if appropriate):

@Bluetea577
Copy link
Author

To ensure efficient resource utilization and avoid generating empty temporary files, the logic was updated from nproc = min(nproc, total_sequences) to nproc = nproc if total_sequences > nproc*10 else 1. This guarantees a more reasonable use of resources while preventing unnecessary file segmentation.

@Bluetea577
Copy link
Author

Through reasoning, I discovered that with 64 processes, the maximum number of sequences that can cause an error is 4032. Therefore, I modified the splitting process (line 120) to split according to the result obtained on line 112, and then used the condition output_file_number < int(nproc) - 1 to allocate the remainder to the last file. This ensures that regardless of the number of threads, it can handle the potentially small sequences. So, I commited "fix: optimize trf file splitting"

@Bluetea577
Copy link
Author

Bluetea577 commented Feb 13, 2025

I found that if lines_per_file is an odd number, the number of lines allocated will be one more than that of an even number. This can lead to some empty files. To avoid this situation, I changed if lines_written >= seqs_per_file and output_file_number < int(nproc) - 1 to use the sequence count, which ensures that inconsistencies do not occur. Therefore, I modified lines_per_file to seqs_per_file. So, I commited "fix: Use sequences count to prevent inconsistent judgment logic"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant