-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance comparison with Rust #40
Comments
Dear @jonathanBieler I've spent lots of time optimizing this library, and looking at where the bottlenecks are. The library is very, very efficient. It's hard - no matter the language, to get meaningful (say 2x) performance improvements on it. All to other libraries I've seen to far tends to reveal that FASTX does much more rigorous input validation. Nonetheless, let's not pat our backs too hard here. There are still many inefficiencies. Not any big ones - they would have been optimized away by now, but lots of small ones that perhaps add 5% to running time, and accumulate. Let me try and break it down. Direct and indirect libraries in useThere are several places inefficiencies hides in your code:
For a baseline, I timed your Rust and Julia code against a 1.1 GB FASTQ file on my computer. Rust is timed by taking the minimum time reported by
So 83% slower. Where does that come from? Your code
Changing your code to operate on views of the I'll skip BioSequences and FASTX, because both of these are highly optimized. I don't think large gains are to be had there. AutomaThis is where the actual parsing happens. Automa is insanely optimized, but there are still losses in a few sections:
One problem is that the original author of Automa have been unreachable for 6 months or so. CodecZlibIn my experience, CodecZlib tends to be a bit slower than other gzip implementations. Not sure why, though, since it basically just calls a C function directly. Haven't dug too deeply into that. It must be either a problem with the compilation of the C code, or with basic inefficiencies of Julia's IO. Basic IO functionsI think quite a bit of time is wasted in the buffering library underneath both Automa and CodecZlib, namely TranscodingStreams.jl. This is a little harder to analyse, but in practical usage, it appears to add a layer of indirection which is not 100% optimized.
Futhermore, it seems that a few Base Julia IO functions are not really 100% optimized either. The losses here are on the margins - but hey, 10 places with 5% loss makes a library 50% slower. Base JuliaFinally, Base Julia not 100% optimized. Again, it's always small things where one could reasonably argue that this doesn't actually matter. Here's a few examples:
ConclusionYou can optimize your code to make it faster. And SIMD capabilities could be added to Automa. This cuts the performance difference down to ~30% to Rust. As usual, performance dies by a thousand cuts. When TranscodingStreams.jl was implemented, I bet no-one thought people would notice, or care about a 5% performance loss. The Julia core devs certainly thought using In contrast, if you take the view that creating a single unnecessary branch or instruction in a low-level function is unacceptable, this attitude compounds across the entire stack, through all the packages. And then, you're no longer looking at a few percent here and there, but differences in tens of percent, or even nearing 100%. |
Oh, and to answer your question: No, I don't think it makes sense to create a bare-bone API. There are no fundamental design mistakes in FASTX or BioSequences that makes it inefficient. It's downstream packages, and tiny inefficiencies scattered across the entire ecosystem. We should be able to make the existing packages fast, it's no use creating new ones. |
Thanks for the very nice answer.
As far as I can tell That said my example is a bit artificial (although some QC stats are pretty close to this), not sure it would be super useful in practice. Also I've been using this package and others on rather large dataset and they do the job just fine, but sometimes it seems like tools written in C++ are faster. Last question, I was thinking maybe doing the same thing for BAM files, do you know if such a comparison as been done already ? |
That's a good point, we could implement I do think these small benchmarks, like the one you've posted here are quite useful. They help keeping development focused on performance. Without it, we could convince ourselves that parsing FASTQ at 250 MB/s was excellent performance. Right now, I don't think FASTX and its underlying functionality is too far off being the fastest accurate parser around. Maybe we are 50% slower. These 50% still matters and should be focused on. Speed is not the most important aspect of BioJulia, of course. Its main purpose is to be a swiss army knife that you can always reach to. So flexibility, modularity and integration with other packages is more important. But we do still need to be so fast that people won't re-implement BioJulia functionality in order to make it faster. I've done comparisons with BAM myself, but I've not seen other people do them. XAM.jl is in a sad state of affairs. It's not spec-compliant and has a number of sub-optimal design decisions. You'll see I have quite a few issues and PR on the repo. But it turned out to be a little hard to fix because once I started fixing some issues, I ended up rewriting large parts of the package, and then I confused myself and became unfocused. I'm still unsure whether it's best to re-write it all from scratch, or to make a long series of breaking PRs and a new release. One issue I did work on was the performance of XAM.jl's underlying BGZF de/compression library, so I created LibDeflate.jl and CodecBGZF.jl. But XAM.jl haven't yet migrated to CodecBGZF.jl. That should improve its performance significantly. |
Closing as resolved. You're welcome to comment to re-open if you disagree |
I was wondering how fast this package is, so I made a small comparison with rust's fastq library. On v1.6, for fastq files rust is about 3-4 times faster, while for fastq.gz it's only 1.6x (I guess decompressing becomes more of a bottleneck). I think I've used every tricks to make the Julia code as fast as possible, have I missed something ?
While I think the performances are pretty good (1.6x doesn't matter that much), improvement are always welcome. It seems Julia's version is still allocating a lot of memory (1.07 M allocations: 225.188 MiB, 0.11% gc time, for a 100MB fastq.gz file), to me it seems the rust library is not really validating or converting anything, it just load the data and memory and goes through it. While have nicely converted records is nice, maybe a more "bare-bone" API could be provided when performance is really crucial, what do you think ?
My rust code (ends up being simpler than the Julia code):
And Julia :
The text was updated successfully, but these errors were encountered: