Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question regarding native assembly tool prior to scaffolding with ARCS #175

Open
goors-syntezza opened this issue Nov 20, 2024 · 5 comments
Labels

Comments

@goors-syntezza
Copy link

Hi All,

I'm not sure that here is the right place to ask my questions, but hopefully one of you could advise me. I'm dealing of assembly of a diploid plant genome with estimated genome size of ~9Gbp, and over one billion reads, 100bp PE, with insert size of about 900 (stLFR reads).

I tried to perform the initial assembly of the reads using SOAPDeNovo2, but failed to do so (from the looks it seems to me like the tool is not able to handle such amount of data). I also tried to apply SPADEs (latest version), but it crashed very fast as it quickly filled almost of 2TB of disk workspace (though having 750GB RAM). As for Celera ('WGS genome assembler'), the documentation is scarce.

I wonder whether you could kindly provide me some insight and maybe a tool/strategy suggestion, in order to complete this stage of initial assembly.

Thank you in advance,
Goor

@lcoombe
Copy link
Member

lcoombe commented Nov 20, 2024

Hi Goor,

For the initial de novo short read assembly, have you given ABySS (https://github.com/bcgsc/abyss) a try? ABySS uses a Bloom filter de Bruijn graph-based approach, which allows for a lower RAM usage. We have assembled short read datasets for multiple 20 GB spruce genomes using ABySS without any issues.
ABySS won't utilize any linked read information, but you could use it to get your initial assembly, and then run Tigmint/ARCS to correct and scaffold the baseline assembly using the linked read information.

Hope that helps!
Lauren

@goors-syntezza
Copy link
Author

Hi Lauren,

Thank you for your detailed answer. I was able to assemble the reads using Abyss on a 1.5Gbp genome-sized organism (as a test).

After converting my stLFR reads to BX:Z format, I applied the following syntax:

abyss-pe v=-v k=25 j=7 name=test B=50G in='s1_r1.fq.gz s1_r2.fq.gz' scaffolds

Next I wanted to continue with Tigmint, so I used the syntax:

tigmint-make --trace tigmint metrics draft=test-contigs reads="s1_r1 s1_r2".

For some reason, when I looked at the Tigmint logs, I saw that all reads are treated as single-ended:

[M::process] 808742 single-end sequences; 0 paired-end sequences.

I guess I'm doing something wrong, but I'm not sure as to what.

I would be happy to get an idea as to what is wrong.

Thank you in advance,
Goor

@lcoombe
Copy link
Member

lcoombe commented Dec 22, 2024

Hi Goor,

For Tigmint, you need to have your input linked reads in a single, interleaved file, and supply that filename as indicated in the usage page: https://github.com/bcgsc/tigmint?tab=readme-ov-file#usage

For interleaving R1 and R2 short read files, I'd suggest using seqtk mergepe - it's a fast and very useful utility.

Hope that helps!
Lauren

@goors-syntezza
Copy link
Author

Hi Lauren,

Thank you for your reply. I was able to use seqtk pmerge as suggested by you, followed by running tigmint and consecutively arcs. Then, when I examined assembly statistics on the original Abyss contigs and ACRS' output, the statistics are roughly the same. Hence, I understand, that I did something wrong in the process, as there was no improvement.

My main suspicion is that something with the barcodes didn't go well. To the best of my understanding, after the conversion of stLFR reads' barcodes into the BX header format, there was no need to go through the LongRanger Basic coammnd. Was I wrong?

Thank you in advance,
Goor

@warrenlr
Copy link
Collaborator

Hi Goor,

It appears that the barcode information isn’t being captured, though I can’t be 100% certain. I strongly recommend taking a subset of reads and inspecting their barcodes for consistency.

If it helps, carefully reviewing and following each step and data transformation outlined in the provided demo could provide valuable insights.

Unfortunately, we’re currently short-staffed and will be for the foreseeable future, so I apologize for not being able to offer additional support at this time.

Best regards,
Rene

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants