Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #267

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

sambrightman
Copy link

I haven't looked at this branch for a few weeks, and you should bear in mind that I've never used Cython before. I've rebased it and it passes all current tests.

My observation was that PyVCF is still rather slow in reading & writing large, real-world VCFs (about 6-8x slower than a simplistic split-index-join approach). The individual commits here should be reasonably clear, and I found:

  • integer instead of string comparisons were worth about 5-10% in both Python and Cython implementations
  • INFO parsing became the bottleneck, and a very naive Cython version made about 25% difference
  • formatting strings for writing was also slow, and a very naive Cython version made about 20% difference

I haven't had much luck with line-profiling to improve things further. One idea might be to lazy-parse the INFO fields – keep them as strings until accessed. They still seem to be a bottleneck even with Cython (large real-world VCFs may contain many annotations, for example).

Downside here is further duplication between Python and Cython, but that seems unavoidable if supporting pure Python remains a priority.

@jamescasbon
Copy link
Owner

Nice work. The larger problem, it seems to me, is that VCF is madder than a box of frogs as a file format. eg it includes at least two incompatible delimited field specs IIRC.

How is tooling support for binary call format these days? Shouldn't that be the target format for performance?

@sambrightman
Copy link
Author

sambrightman commented Jan 7, 2018

Seems to me that it's still worth having the best performance in all use-cases. BCF has been discussed for years, but progress is slow. Shall we either merge or close this? There hasn't been a release for a while either, would probably be useful for people.

@dridk
Copy link

dridk commented Jan 28, 2022

Merged on dridk@9e5de0f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants