Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No bins generated, bug or feature? #28

Open
Joon-Klaps opened this issue Nov 30, 2023 · 2 comments
Open

No bins generated, bug or feature? #28

Joon-Klaps opened this issue Nov 30, 2023 · 2 comments

Comments

@Joon-Klaps
Copy link

Joon-Klaps commented Nov 30, 2023

I've been running vRhyme on some of my test data SRR11140750-test.zip and vRhyme doesn't generate any output bins (along with some other files).

I'm curious why this is. If vRhyme doesn't determine any bins is this because they all represent a distinct viral genome/segment (but then I would suspect bins with only one sequence in them). If so, it would be good to have a warning mentioning that no sequences were binned. Or is this kind of output unintentional?

Thanks in advance!

Docker container used: quay.io/biocontainers/vrhyme:1.1.0--pyhdfd78af_1
Command used:

vRhyme \
    -i SRR11140750.fa \
    -r SRR11140750_host.unmapped_1.fastq.gz SRR11140750_host.unmapped_2.fastq.gz \
    -o SRR11140750 \
    -t 4 \
    --verbose

Output structure:

$ tree SRR11140750
SRR11140750
├── log_vRhyme_paired_reads.tsv
├── log_vRhyme_SRR11140750.log
├── SRR11140750.circular.tsv
├── vRhyme_bam_files
│   └── SRR11140750_host.unmapped_1.sorted.bam
└── vRhyme_coverage_files
    ├── SRR11140750_host.unmapped_1.coverage.tsv
    ├── vRhyme_coverage_values.tsv
    └── vRhyme_names.txt
2 directories, 7 files

Log file:

Command:  /usr/local/bin/vRhyme -i SRR11140750.fa -r SRR11140750_host.unmapped_1.fastq.gz SRR11140750_host.unmapped_2.fastq.gz -o SRR11140750 -t 4 --verbose

Date:     2023-11-30 (y-m-d)
Start:    17:35:51   (h:m:s)
Program:  vRhyme v1.1.0


Time (min) |  Log                                                   
--------------------------------------------------------------------
0.0           Initializing and validating vRhyme parameters
0.01          Paired end read file(s) identified. Running bowtie2 on 1 set of paired files
              Caution: vRhyme performs optimally with 3+ samples
0.11          Extracting coverage information from BAM files
0.14          Coverage extraction complete. Generating coverage table
0.14          Performing pairwise coverage comparisons
0.14          vRhyme binning complete

Memory usage:       0.18
Runtime (min):      0.14
Bins generated:     0
Binned sequences:   0 (0%)
Input sequences:    42
Binned proteins:    0
Redundant proteins: 0 (0%)
Best iteration:     none
vRhyme score:       none

Output test:

 Python Dependencies
  -------------------
  scikit-learn: Success (v1.2.2)
  numpy: Success (v1.23.5)
  numba: Success (v0.56.4)
  pandas: Success (v2.0.0)
  pysam: Success (v0.21.0)
  networkx: Success (v3.1)


  Program Dependencies
  --------------------
  mmseqs: Success
  samtools: Success
  prodigal: Success
  mash: Success
> nucmer: Not Found! Optional
  bowtie2: Success
  bwa: Success


  Machine Learning Models
  -----------------------
  NN model: Success
  ET model: Success

*Edit: typo

@KrisKieft
Copy link
Member

By default vRhyme does not generate any singleton bins. Any sequence not binned is either not a virus, a single virus, a fragment without sufficient information to bin, or vRhyme made an error by not binning it. There are many reasons for it. Are you binning viral sequences or a mix of viral and non-viral? You only have 42 input sequences and 1 sample so I'd assume there is just little information to go off.

@Joon-Klaps
Copy link
Author

Joon-Klaps commented Apr 4, 2024

Hi @KrisKieft, thanks for the response! These results are from a test dataset with exclusively only viral genomes (all complete covid genomes or fragments of the genome). There is a low depth in general, with it not exceeding 5x. If I were to provide vrhyme a list of bam files containing only one sequence and all the reads mapped towards it would
this be better? Maybe I'm not entirely following the concept samples as it seems counterintuitive to me to determine coverage covariance from a sequence run1 with sequence run2 if run 1 comes from patient A and run 2 comes from patient B.

How can I feed vRhyme data in the best way possible coming from a whole pipeline perspective (read->contig->vrhyme) where input samples might not always be related?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants