Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appetite for a non-backwards compatible upgrade to core APIs? #520

Open
tfenne opened this issue Mar 10, 2016 · 35 comments
Open

Appetite for a non-backwards compatible upgrade to core APIs? #520

tfenne opened this issue Mar 10, 2016 · 35 comments
Assignees

Comments

@tfenne
Copy link
Member

tfenne commented Mar 10, 2016

Calling @akiezun @lbergelson @vadimzalunin @nh13 @yfarjoun @droazen (and anyone else who's interested)!

I'd like to start some discussion to see if there would be appetite in collaborating on a newer, non-backwards compatible API for some or all of the major APIs in htsjdk? By major I mean SAM/BAM/CRAM/etc. and also VCF/BCF, with a focus on the former first.

To be specific, what I'm proposing is that we embark on a path where we:

  1. Leave the old APIs in place in the samtools and variant packages, for a long time, perhaps ever
  2. Create new parallel packages (e.g. samtools2 or reads etc.)
  3. Spend a good amount of time coming up with clean designs for all the major parts (readers, writers, factories, records etc.) before we start implementing anything beyond interfaces/abstract classes.
  4. Implement a battery of tests against the new APIs
  5. Implement all the new APIs by copy/pasting and refactoring code from the old implementations and/or writing new code, but not linking/using classes from the old packages

Point 5 isn't meant to be absolute - i.e. it might make sense to move a bunch of utility classes into shared packages and use them directly, but rather to indicate that I mean this would not be writing wrappers around existing classes (like SamReader did).

My motivations for this are many-fold:

  1. Many of the existing APIs are clunky. This ranges from unwieldy method names (e.g. getReadUnmappedFlag() instead of isUnmapped()) to more cross-cutting design problems (e.g. SAMRecord being a class not an interface).
  2. I would like to have alternative implementations of various classes like SAMRecord that optimize for, e.g., low-memory consumption vs. low-latency access.
  3. I'd like to have uniform access to Files/Streams/URLs regardless of whether the contents is SAM, BAM, CRAM or SRA (currently support for anything other than files is rather type-dependent).
  4. I'd like htsjdk APIs to work better with other JVM languages, particularly scala. You can use it now (and I do), but I think there are likely ways we could make the APIs easier to consume from scala or elsewhere without making the Java usage worse.

I'm willing to dedicate time and effort to this if:

  1. Others also see value in a much cleaner but not backwards compatible API
  2. Others would engage in both discussion, and porting of some of the existing code
  3. The new API would likely, over time, be adopted by GATK & Picard

Would anyone else be interested enough to participate and help?

@tfenne tfenne self-assigned this Mar 10, 2016
@lindenb
Copy link
Contributor

lindenb commented Mar 10, 2016

Interfaces , Factories, etc.. : +1
count me in, please

@nh13
Copy link
Member

nh13 commented Mar 10, 2016

I am in, given the same circumstances.

@lbergelson
Copy link
Member

I'm very interested in this. I've been dreaming of a major htsjdk rewrite for a while now. I don't know how much time we can dedicate to it though.

@akiezun
Copy link
Contributor

akiezun commented Mar 11, 2016

I'm supportive of this though I'm not sure how much we'll be able to contribute to the actual coding. We'll be happy adopters though! And we have opinions and ideas which we can share. When you have a place to make wishlists, I'm happy to add some wishes too.

@kwrodarmer
Copy link

I suggest you take a look at the NGS API. It was designed for exactly this purpose.

The API itself is format and archive neutral, and supports pluggable back-end engines. A tool can access SRA/SAM/BAM/CRAM/etc. all within the same process. The back-end architecture makes it possible to interface via REST or file or whatever.

@d-cameron
Copy link
Contributor

I have some interest in developing a symbolic allele API for BCF/VCF. As it currently stands, I'm already wrapping VariantContext to expose a specialised API for my purposes, but a properly designed htsjdk API would be much more useful in the general case.

@yfarjoun
Copy link
Contributor

I would be very happy for a cleaner API, and would be happy to participate
in the discussion about user-stories and API. I would also be happy to help
with the move of various picard tools over to the new api assuming (of
course) that it indeed results in cleaner and clearer code.

I hesitate to promise any assistance in the implementation because
a) I'm not a software engineer, so I am uncertain of the quality of the
results I might produce
b) Sadly, I doubt that this will be a priority for me, and thusly I am
unsure how much time I would be able to dedicate to this effort

As I said, I will be happy to participate in the discussion, reviewing PR's
and so forth.

👍

On Mon, Mar 14, 2016 at 9:25 PM, Daniel Cameron [email protected]
wrote:

I have some interest in developing a symbolic allele API for BCF/VCF. As
it currently stands, I'm already wrapping VariantContext to expose a
specialised API for my purposes, but a properly designed htsjdk API would
be much more useful in the general case.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#520 (comment)

@magicDGS
Copy link
Member

+1

As a final user of HTSJDK for development, I would like to have interfaces with well-defined contracts.For instance, I was struggling to made a class to work with base qualities in both both FastqRecord and SAMRecord, and Read interface with a contract for getBaseQualities() will be very useful.

@vadimzalunin
Copy link
Contributor

I'm in.
NGS API approach is certainly interesting. Also there is GA4GH attempt to describe reads in a formal way.

@droazen
Copy link
Contributor

droazen commented Mar 16, 2016

@tfenne I'll echo what other members of the GATK dev team have said: we're supportive of the idea and would like to be involved in the design discussions, but we're unsure how much development effort we could actually contribute. If the superiority of the new APIs is compelling enough, we might be able to allocate a few weeks of an engineer's time if our other obligations permit it.

From the GATK perspective, I think we're particularly interested in:

-Well-designed Read and Variant interfaces, with clear contracts and able to represent reads from sources other than just SAM/BAM (we've made an initial, very imperfect attempt at this in GATK4 with the GATKRead interface)

-A plugin-based architecture for reading/writing records from sources other than just files (HDFS, Google cloud storage, etc.). We've been exploring the use of java.nio.file.Path as a pluggable universal resource handle, as NIO plugins already exist for HDFS and GCS as separate projects, and relying on third-party NIO plugins would not compel us to bloat htsjdk's dependencies or make any code changes at all to support new filesystems/protocols.

@magicDGS
Copy link
Member

Any news about this? I would like to contribute if this is going to happen...

@tfenne
Copy link
Member Author

tfenne commented Jul 26, 2016

@magicDGS I lost steam on this because I realized that:

  1. What I really want is an API that works really well in scala
  2. It's hard to make an API in Java that feels anything like as good as a native scala API
  3. I don't think a native scala API would see the uptake a Java one would
  4. I think putting scala wrappers around a Java API would degrade performance too much

@lindenb
Copy link
Contributor

lindenb commented Jul 26, 2016

Can we start something anyway ? For example slowly moving SAMRecord to an interface... :

interface SAMRecordI { ... }
class SAMRecord implements SAMRecordI { }
...

@magicDGS
Copy link
Member

I see, @tfenne... even with a new API it will be difficult to integrate into scala.

But I think that the idea is great to have a new and clean API, well tested and with interfaces for Read and Variants (as @droazen and @lindenb suggested), and the use of java.nio.file.Path instead of File for easier file handling in different filesystems.

I'm in for the implementation of an evolution of htsjdk, and I guess that the best way of doing it is like in GATK4: a new repository where we can start implementing interfaces and porting the parts of htsjdk with better documentation and unit tests for a higher coverage.

I don't know if this idea it's fine for everyone, but I think that htsjdk is great for bioninformaticians and it could be even better with this changes and some effort from the community to improve it.

@magicDGS
Copy link
Member

magicDGS commented Sep 7, 2016

I would like to highlight this again. What do you think about open a new htsjdk3 repository in the samtools account? I will definitely contribute!

@yfarjoun
Copy link
Contributor

yfarjoun commented Sep 7, 2016

I think that this will have much more enthusiasm if it were to be done in
scala.....

On Thu, Mar 10, 2016 at 5:20 PM, Tim Fennell [email protected]
wrote:

Calling @akiezun https://github.com/akiezun @lbergelson
https://github.com/lbergelson @vadimzalunin
https://github.com/vadimzalunin @nh13 https://github.com/nh13
@yfarjoun https://github.com/yfarjoun @droazen
https://github.com/droazen (and anyone else who's interested)!

I'd like to start some discussion to see if there would be appetite in
collaborating on a newer, non-backwards compatible API for some or all of
the major APIs in htsjdk? By major I mean SAM/BAM/CRAM/etc. and also
VCF/BCF, with a focus on the former first.

To be specific, what I'm proposing is that we embark on a path where we:

  1. Leave the old APIs in place in the samtools and variant packages,
    for a long time, perhaps ever
  2. Create new parallel packages (e.g. samtools2 or reads etc.)
  3. Spend a good amount of time coming up with clean designs for all
    the major parts (readers, writers, factories, records etc.) before we start
    implementing anything beyond interfaces/abstract classes.
  4. Implement a battery of tests against the new APIs
  5. Implement all the new APIs by copy/pasting and refactoring code
    from the old implementations and/or writing new code, but not
    linking/using classes from the old packages

Point 5 isn't meant to be absolute - i.e. it might make sense to move a
bunch of utility classes into shared packages and use them directly, but
rather to indicate that I mean this would not be writing wrappers around
existing classes (like SamReader did).

My motivations for this are many-fold:

  1. Many of the existing APIs are clunky. This ranges from unwieldy
    method names (e.g. getReadUnmappedFlag() instead of isUnmapped()) to
    more cross-cutting design problems (e.g. SAMRecord being a class not
    an interface).
  2. I would like to have alternative implementations of various classes
    like SAMRecord that optimize for, e.g., low-memory consumption vs.
    low-latency access.
  3. I'd like to have uniform access to Files/Streams/URLs regardless of
    whether the contents is SAM, BAM, CRAM or SRA (currently support for
    anything other than files is rather type-dependent).
  4. I'd like htsjdk APIs to work better with other JVM languages,
    particularly scala. You can use it now (and I do), but I think there are
    likely ways we could make the APIs easier to consume from scala or
    elsewhere without making the Java usage worse.

I'm willing to dedicate time and effort to this if:

  1. Others also see value in a much cleaner but not backwards
    compatible API
  2. Others would engage in both discussion, and porting of some of the
    existing code
  3. The new API would likely, over time, be adopted by GATK & Picard

Would anyone else be interested enough to participate and help?


Reply to this email directly or view it on GitHub
#520.

@tfenne
Copy link
Member Author

tfenne commented Sep 7, 2016

I would definitely be more enthusiastic for a scala version if others were on board.

@magicDGS
Copy link
Member

magicDGS commented Sep 8, 2016

My problem is that I'm not that familiar with scala. Could a scala API be used in normal java code?

@tfenne
Copy link
Member Author

tfenne commented Sep 8, 2016

Not easily. The reason scala is attractive is that is has a lot of language features that remove a ton of bolierplate, but some of them are implemented via the compiler, so if you're writing Java code you lose a lot of the advantage, and probably make life even harder.

If you're writing a lot of Java you might like to take a look a Scala though. Fgbio is an OSS project where @nh13 and I do a lot our bioinfomatics in scala, using htsjdk.

@magicDGS
Copy link
Member

magicDGS commented Sep 8, 2016

Thanks for the explanation, @tfenne! I've already had a look to scala, but I didn't use it after knowing about the Java 8 lambda support. I will have a look!

@lbergelson
Copy link
Member

I'd be very interested in an improved htsjdk, and I personally would be happy to move to scala. However, GATK is pretty entrenched in java-land and isn't likely to to adopt a scala htsjdk. If GATK isn't going to make use of it then I don't think that our team will be able to contribute to it.

@lbergelson
Copy link
Member

If java-8 hadn't come around I think moving GATK to scala would have been a much more plausible thing. Java 8 has just enough features that it makes remaining in java tolerable for most developers here.

@tfenne
Copy link
Member Author

tfenne commented Sep 8, 2016

@lbergelson I think tolerable is about right 😉 I have several Java 8 projects I work on and several scala ones too .... at first Java 8 seemed great because it was so much better than Java 7, but after spending more time with Scala, I now barely tolerate working in Java any more. There are just so many more features that you come to rely on over time and resent having to code manually in Java - at least for me!

@lbergelson
Copy link
Member

@tfenne I don't disagree with you. Java improvements happen at the bare minimum level necessary to keep the language alive for entrenched users with massive legacy code bases. Unfortunately, that definitely applies to the GATK. As much as GATK4 has been improving old code, there's still so much of it that it would be a huge undertaking to bring it into idiomatic scala. From my limited scala experience I get the feeling that working with old java code from scala is not really much better than just using java, you end up having to deal with all the java annoyances anyway.

@magicDGS
Copy link
Member

magicDGS commented Sep 8, 2016

I'm working now a lot with the GATK4 framework, and before I was using htsjdk to develop some projects. That's why I would be interested in improving htsjdk to make easier both APIs, and have a nice way of develop bioinformatic software. But if both APIs change to scala (something that I think is kind of unlikely) I will definitely spend more time on it, to be able to use it and contribute.

If it is possible to start a new non-backwards compatible java htsjdk with cleaner code and interfaces I'm definitely in. If not, I will continue developing patches and improving it. Thanks for the nice discussion @tfenne and @lbergelson! For sure I will start having a look to scala after this whenever I have some time!

@magicDGS
Copy link
Member

magicDGS commented Jun 9, 2017

I would like to bring this to discussion once again as the project governance changed: @jacarey, @tfenne, @lbergelson! As it seems that htsjdk is not going to move to scala (although maybe that is another possibility), what about trying to move the API to a set of interfaces in a different package and starting from there to expand to the rest of code? It could also be done as I pointed out in #896, generating a sub-module called htsjdk-api to have just the interfaces...

@magicDGS
Copy link
Member

I would like to bring attention to this again - what's about opening a new repository (htsjdk3) in the samtools organization as the one for HadoodBAM non-backwards compatible implementation (https://github.com/nameless-gos/nameless)?

@lbergelson
Copy link
Member

lbergelson commented May 31, 2018

@magicDGS This is actually starting to move internally now. On my end I've started putting together my own set of goals, but I need to coordinate with the other maintainers. Once we've agreed on what we think is a good draft roadmap we'll post it for general comment and alteration. We want to have some starting point to make sure there's agreement on some core changes, as well as some thought on which the projects the maintainers are able to commit time to to working on.

About how to do development, my current thought is to include a new API within htsjdk as a parallel package within the same codebase. (possibly published as a separate artifact to avoid people accidentally incorporating it.) That way it's easy to coordinate movement of code, and it can be easy for people to mix / match between parts of the new api and the old.

@magicDGS
Copy link
Member

magicDGS commented Jun 1, 2018

@lbergelson - I would appreciate that the community is informed about the direction and timeframe for this, and involved on the dicussion. For example, I think that several users might have their own proposal for core changes. In particular, I think that the major core change for my own projects is to have an interface-based project, with factory/builder classes which might be use to configure downstream projects by plugin a custom implementation to read other formats in concrete tools (e.g., GATK can configure NIO in their factory method for the readers, without changes in the Picard project, or another use case is Spring dependency-injection). I believe that discussion with the community and not only internal will improve the roadmap, and also is beneficial to encourage to contribute in the basecode. Moreover, it will be a pleasure for me to contribute to the htsjdk3 development in any role.

About the development process, I am not in favor of incorporating the new API and implementations in the same codebase for several reasons:

  • It will be nice to have a multi-module project for separating artefacts based on functionality (e.g., SAM/BAM vs. trible/tabix - see Proposal: separate HTSJDK in different gradle build artifacts #896 and Proposal to evolve to HTSJDK version 3 (DO NOT MERGE) #928).
  • Coordinates in maven are not concordant with the package names. It will be nice if for htsjdk3 all the classes and packages live in the com.github.samtools package.
  • I am in favor of a almost-complete rewrite of the code to avoid legacy implementation constraints (e.g. IO-handling) and extensibility problems (several of my contributions were to fix problems when trying to implement extensions of factory/builder methods). Also, the package organization can be improved (e.g., htsjdk.samtoolshas a lot of classes, which can be organized in a better way).

Using GATK4 as a reference, I believe that starting a new repository makes more sense in this case, because a complete re-write (or review of every addition) might be better than adding to the same project history changes in both legacy code and new one. I am not suggesting to move to a new organization, but a clean repository with clear governance settings and contribution guidelines will be ideal.

@magicDGS
Copy link
Member

@lbergelson (and the other maintainers, @jacarey and @tfenne) - here it comes another of my friendly pings in this respect. I would like to inform you that starting in July, I will be able to dedicate lots of time to software development and contribution to open source projects - thus, I would like to reactivate the discussion about the new htsjdk3 to start contibuting with code and port the current one.

I would suggest to start a new project as in Hadoop-BAM in the samtools organization named htsjdk3, where we could start discussing the roadmap, governance, code of conduct and design to correct the contract from current API and make it more user-friendly (extensible). Please, count with me for this discussion and to start porting code.

@tfenne
Copy link
Member Author

tfenne commented Jun 22, 2018

@magicDGS I remain interested in contributing to this and helping drive it, but things are pretty complicated for me for the next 3-4 weeks (I'm moving across the country at the beginning of July).

I think one way to kick-start this might be to do one or more online meetings (hangout, group slack, whatever) with a combination of a) the maintainers and b) anyone else who is volunteering significant effort towards this (i.e. you and perhaps a small handful of others). I think that would be more conducive to hashing through some of the big questions about, e.g.:

  1. Where is this work going to take place
  2. How much we care about backwards compatibility etc.
  3. Who's committing how much time and effort

@lbergelson, @jacarey, @magicDGS - how does this sound to you? I'm happy to help coordinate as I think this will be far more productive than hours of back and forth in github issues and email...

@magicDGS
Copy link
Member

@tfenne - this sounds great for me! I volunteer for putting efford into the new API, so count with me for online meetings.

@lindenb
Copy link
Contributor

lindenb commented Jun 25, 2018

@tfenne count me in please :-)

@tfenne
Copy link
Member Author

tfenne commented Jun 27, 2018

@magicDGS & @lindenb happy to include you guys. Would be helpful to be able to send you email outside of GitHub for coordinating some of this. Could you shoot me a quick email (to [email redacted]) so I have your email(s) and can invite you to calls?

@magicDGS
Copy link
Member

I sent the email @tfenne - let me know if you got it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests