Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I haven't looked at this branch for a few weeks, and you should bear in mind that I've never used Cython before. I've rebased it and it passes all current tests.
My observation was that PyVCF is still rather slow in reading & writing large, real-world VCFs (about 6-8x slower than a simplistic split-index-join approach). The individual commits here should be reasonably clear, and I found:
I haven't had much luck with line-profiling to improve things further. One idea might be to lazy-parse the INFO fields – keep them as strings until accessed. They still seem to be a bottleneck even with Cython (large real-world VCFs may contain many annotations, for example).
Downside here is further duplication between Python and Cython, but that seems unavoidable if supporting pure Python remains a priority.