-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Appetite for a non-backwards compatible upgrade to core APIs? #520
Comments
Interfaces , Factories, etc.. : +1 |
I am in, given the same circumstances. |
I'm very interested in this. I've been dreaming of a major htsjdk rewrite for a while now. I don't know how much time we can dedicate to it though. |
I'm supportive of this though I'm not sure how much we'll be able to contribute to the actual coding. We'll be happy adopters though! And we have opinions and ideas which we can share. When you have a place to make wishlists, I'm happy to add some wishes too. |
I suggest you take a look at the NGS API. It was designed for exactly this purpose. The API itself is format and archive neutral, and supports pluggable back-end engines. A tool can access SRA/SAM/BAM/CRAM/etc. all within the same process. The back-end architecture makes it possible to interface via REST or file or whatever. |
I have some interest in developing a symbolic allele API for BCF/VCF. As it currently stands, I'm already wrapping VariantContext to expose a specialised API for my purposes, but a properly designed htsjdk API would be much more useful in the general case. |
I would be very happy for a cleaner API, and would be happy to participate I hesitate to promise any assistance in the implementation because As I said, I will be happy to participate in the discussion, reviewing PR's 👍 On Mon, Mar 14, 2016 at 9:25 PM, Daniel Cameron [email protected]
|
+1 As a final user of HTSJDK for development, I would like to have interfaces with well-defined contracts.For instance, I was struggling to made a class to work with base qualities in both both |
I'm in. |
@tfenne I'll echo what other members of the GATK dev team have said: we're supportive of the idea and would like to be involved in the design discussions, but we're unsure how much development effort we could actually contribute. If the superiority of the new APIs is compelling enough, we might be able to allocate a few weeks of an engineer's time if our other obligations permit it. From the GATK perspective, I think we're particularly interested in: -Well-designed Read and Variant interfaces, with clear contracts and able to represent reads from sources other than just SAM/BAM (we've made an initial, very imperfect attempt at this in GATK4 with the -A plugin-based architecture for reading/writing records from sources other than just files (HDFS, Google cloud storage, etc.). We've been exploring the use of |
Any news about this? I would like to contribute if this is going to happen... |
@magicDGS I lost steam on this because I realized that:
|
Can we start something anyway ? For example slowly moving SAMRecord to an interface... :
|
I see, @tfenne... even with a new API it will be difficult to integrate into scala. But I think that the idea is great to have a new and clean API, well tested and with interfaces for Read and Variants (as @droazen and @lindenb suggested), and the use of I'm in for the implementation of an evolution of htsjdk, and I guess that the best way of doing it is like in GATK4: a new repository where we can start implementing interfaces and porting the parts of htsjdk with better documentation and unit tests for a higher coverage. I don't know if this idea it's fine for everyone, but I think that htsjdk is great for bioninformaticians and it could be even better with this changes and some effort from the community to improve it. |
I would like to highlight this again. What do you think about open a new htsjdk3 repository in the samtools account? I will definitely contribute! |
I think that this will have much more enthusiasm if it were to be done in On Thu, Mar 10, 2016 at 5:20 PM, Tim Fennell [email protected]
|
I would definitely be more enthusiastic for a scala version if others were on board. |
My problem is that I'm not that familiar with scala. Could a scala API be used in normal java code? |
Not easily. The reason scala is attractive is that is has a lot of language features that remove a ton of bolierplate, but some of them are implemented via the compiler, so if you're writing Java code you lose a lot of the advantage, and probably make life even harder. If you're writing a lot of Java you might like to take a look a Scala though. Fgbio is an OSS project where @nh13 and I do a lot our bioinfomatics in scala, using htsjdk. |
Thanks for the explanation, @tfenne! I've already had a look to scala, but I didn't use it after knowing about the Java 8 lambda support. I will have a look! |
I'd be very interested in an improved htsjdk, and I personally would be happy to move to scala. However, GATK is pretty entrenched in java-land and isn't likely to to adopt a scala htsjdk. If GATK isn't going to make use of it then I don't think that our team will be able to contribute to it. |
If java-8 hadn't come around I think moving GATK to scala would have been a much more plausible thing. Java 8 has just enough features that it makes remaining in java tolerable for most developers here. |
@lbergelson I think tolerable is about right 😉 I have several Java 8 projects I work on and several scala ones too .... at first Java 8 seemed great because it was so much better than Java 7, but after spending more time with Scala, I now barely tolerate working in Java any more. There are just so many more features that you come to rely on over time and resent having to code manually in Java - at least for me! |
@tfenne I don't disagree with you. Java improvements happen at the bare minimum level necessary to keep the language alive for entrenched users with massive legacy code bases. Unfortunately, that definitely applies to the GATK. As much as GATK4 has been improving old code, there's still so much of it that it would be a huge undertaking to bring it into idiomatic scala. From my limited scala experience I get the feeling that working with old java code from scala is not really much better than just using java, you end up having to deal with all the java annoyances anyway. |
I'm working now a lot with the GATK4 framework, and before I was using htsjdk to develop some projects. That's why I would be interested in improving htsjdk to make easier both APIs, and have a nice way of develop bioinformatic software. But if both APIs change to scala (something that I think is kind of unlikely) I will definitely spend more time on it, to be able to use it and contribute. If it is possible to start a new non-backwards compatible java htsjdk with cleaner code and interfaces I'm definitely in. If not, I will continue developing patches and improving it. Thanks for the nice discussion @tfenne and @lbergelson! For sure I will start having a look to scala after this whenever I have some time! |
I would like to bring this to discussion once again as the project governance changed: @jacarey, @tfenne, @lbergelson! As it seems that htsjdk is not going to move to scala (although maybe that is another possibility), what about trying to move the API to a set of interfaces in a different package and starting from there to expand to the rest of code? It could also be done as I pointed out in #896, generating a sub-module called |
I would like to bring attention to this again - what's about opening a new repository ( |
@magicDGS This is actually starting to move internally now. On my end I've started putting together my own set of goals, but I need to coordinate with the other maintainers. Once we've agreed on what we think is a good draft roadmap we'll post it for general comment and alteration. We want to have some starting point to make sure there's agreement on some core changes, as well as some thought on which the projects the maintainers are able to commit time to to working on. About how to do development, my current thought is to include a new API within htsjdk as a parallel package within the same codebase. (possibly published as a separate artifact to avoid people accidentally incorporating it.) That way it's easy to coordinate movement of code, and it can be easy for people to mix / match between parts of the new api and the old. |
@lbergelson - I would appreciate that the community is informed about the direction and timeframe for this, and involved on the dicussion. For example, I think that several users might have their own proposal for core changes. In particular, I think that the major core change for my own projects is to have an interface-based project, with factory/builder classes which might be use to configure downstream projects by plugin a custom implementation to read other formats in concrete tools (e.g., GATK can configure NIO in their factory method for the readers, without changes in the Picard project, or another use case is Spring dependency-injection). I believe that discussion with the community and not only internal will improve the roadmap, and also is beneficial to encourage to contribute in the basecode. Moreover, it will be a pleasure for me to contribute to the htsjdk3 development in any role. About the development process, I am not in favor of incorporating the new API and implementations in the same codebase for several reasons:
Using GATK4 as a reference, I believe that starting a new repository makes more sense in this case, because a complete re-write (or review of every addition) might be better than adding to the same project history changes in both legacy code and new one. I am not suggesting to move to a new organization, but a clean repository with clear governance settings and contribution guidelines will be ideal. |
@lbergelson (and the other maintainers, @jacarey and @tfenne) - here it comes another of my friendly pings in this respect. I would like to inform you that starting in July, I will be able to dedicate lots of time to software development and contribution to open source projects - thus, I would like to reactivate the discussion about the new htsjdk3 to start contibuting with code and port the current one. I would suggest to start a new project as in Hadoop-BAM in the samtools organization named |
@magicDGS I remain interested in contributing to this and helping drive it, but things are pretty complicated for me for the next 3-4 weeks (I'm moving across the country at the beginning of July). I think one way to kick-start this might be to do one or more online meetings (hangout, group slack, whatever) with a combination of a) the maintainers and b) anyone else who is volunteering significant effort towards this (i.e. you and perhaps a small handful of others). I think that would be more conducive to hashing through some of the big questions about, e.g.:
@lbergelson, @jacarey, @magicDGS - how does this sound to you? I'm happy to help coordinate as I think this will be far more productive than hours of back and forth in github issues and email... |
@tfenne - this sounds great for me! I volunteer for putting efford into the new API, so count with me for online meetings. |
@tfenne count me in please :-) |
I sent the email @tfenne - let me know if you got it. |
Calling @akiezun @lbergelson @vadimzalunin @nh13 @yfarjoun @droazen (and anyone else who's interested)!
I'd like to start some discussion to see if there would be appetite in collaborating on a newer, non-backwards compatible API for some or all of the major APIs in htsjdk? By major I mean SAM/BAM/CRAM/etc. and also VCF/BCF, with a focus on the former first.
To be specific, what I'm proposing is that we embark on a path where we:
samtools
andvariant
packages, for a long time, perhaps eversamtools2
orreads
etc.)Point 5 isn't meant to be absolute - i.e. it might make sense to move a bunch of utility classes into shared packages and use them directly, but rather to indicate that I mean this would not be writing wrappers around existing classes (like
SamReader
did).My motivations for this are many-fold:
getReadUnmappedFlag()
instead ofisUnmapped()
) to more cross-cutting design problems (e.g.SAMRecord
being a class not an interface).SAMRecord
that optimize for, e.g., low-memory consumption vs. low-latency access.htsjdk
APIs to work better with other JVM languages, particularly scala. You can use it now (and I do), but I think there are likely ways we could make the APIs easier to consume from scala or elsewhere without making the Java usage worse.I'm willing to dedicate time and effort to this if:
Would anyone else be interested enough to participate and help?
The text was updated successfully, but these errors were encountered: