-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss project governance, relationships upstream and down #180
Comments
I will coincidentally be in boston already that day so will definitely go to the workshop, looking forward to chatting w folks there, thanks for mentioning @heuermh! |
Hi, I am Keijo Heljanko, Associate Professor at Aalto University, who originally funded the Hadoop-BAM project (coded by Matti Niemenmaa and Andre Schumacher) at Aalto University, and we on purpose released it under the MIT licence to have it available to a maximum number of Hadoop and Spark based NGS processing pipelines. An Apache licence would have probably been even better but I did not understand that at the time, and I and I think all the developers also wanted a licence that allows for maximum flexibility on use of the Hadoop-BAM library by different projects. I am based in Helsinki, Finland, so I will not be able to join you in Boston, but I would love to be involved in developing Hadoop-BAM further. In fact, my PhD student Ilari Maarala created another NGS pipeline using Hadoop-BAM as base technology, which just got published: Altti Ilari Maarala Zurab Bzhalava Joakim Dillner Keijo Heljanko Davit Bzhalava We are currently also working on a Spark based pan-genomics pipeline, which will eventually require new file formats and would make sense to be eventually supported by Hadoop-BAM. We would love to be in discussions on how to improve Hadoop-BAM and related projects and techniques, having a common codebase with an Apache Incubation project would sound great! Please keep me in the loop, I can be reached by email or by Skype at "keijo.heljanko". Keijo Heljanko |
Oh, and Ilari's paper is called: "ViraPipe: scalable parallel pipeline for viral metagenome analysis from next generation sequencing reads " Here is the GitHub page for the ViraPipe project: |
First of all, thanks for including me in the discussion of the Hadoop-BAM governance. I won't be able to join the Codefest, but I would love to contribute to the project in whatever way it is possible as a downstream API user. I like the idea of extracting common utilities of downstream project into the Hadoop-BAM to have a common framework to work in HDFS and other
I am probably the less familiar with the codebase, but I am interested in learning more about it and contribute to the codebase as much as possible. |
Hi, Also, having a list of downstream projects and contacts from the projects using Hadoop-BAM would be quite useful to discuss the future of libraries handling genomics file formats, and how to contact people potentially working with the same issues. |
As far as I can tell, the list of Winter Codefest attendees has strong representation from the Broad for Cromwell but less so for GATK, and none for htsjdk or Hail. I am available to visit the Broad while in Boston, if that would help. The first part of collaborating is showing up! @ryan-williams Looking forward to finally meeting you in person! @kheljanko When I refer to project governance, I'm specifically thinking about the software license, copyright assignment, code of conduct, a Contributor License Agreement (CLA) process, Github project administration, documentation hosting, the release process, code signing keys, evolutionary vs. revolutionary changes, etc. We have a lot of this in place with the Big Data Genomics organization and can share our experience. @magicDGS Thanks for joining the conversation here! I've spent many hours working around problems with htsjdk and have been frustrated by that project's inability to consider revolutionary changes. I hope that whatever process and governance we can put into place here, expanding the scope as necessary (whether through the Apache incubation process or otherwise), will provide a way forward for those changes to happen. |
@heuermh I may be able to attend the codefest, I'm not certain yet. We've been pretty busy preparing the gatk4 release and subsequently following up issues exposed through the launch. We'd definitely like to avoid duplicating effort, so anything that helps avoid fragmentation and duplication would be great. I'm a little worried that creating an Apache project to combine the "best I'm sorry you've had so much trouble with htsjdk. Htsjdk has been very neglected recently, but we're planning on investing a lot more effort into it in the near future. We're in the planning stage of a major revamp which should begin to fix some of the underlying issues, but we can only fix the problems we know about. It would be great if you could file issues describing the problems you've run into. |
@lbergelson Thank you for the reply, this is exactly the conversation I'd like to have. If you aren't able to attend the Codefest, perhaps we might be able to find some other time while I'm out in Boston. |
Just getting back after being on leave for a month... Glad to see this being discussed. Are there any outcomes or discussions from the Codefest that you can share here @heuermh? @kheljanko thanks for sharing the information about the virus paper - sounds interesting! |
I've written a bit about a new Spark API on this ticket: #196. I've now added a page describing the scope and features of the new API. There's also a bit about a home and governance, which it would be good to discuss more. @cmnbroad, @droazen, @fnothaft, @heuermh, @kheljanko, @lbergelson, @magicDGS, @ryan-williams (and others who are interested) - I'd like to propose an online meeting. If you are interested in participating, please fill in this Doodle poll to select a day: https://doodle.com/poll/fikirnp8wwsh6bfa Thanks! |
I'd love to join the discussion, but unfortunatelly I will be on vacation next week. I'm looking forward to see the result of the meeting here, so please keep us update on this thread. Thanks! |
Thanks to everyone who responded to the Doodle poll! The result is here: https://doodle.com/poll/fikirnp8wwsh6bfa, Apr 10, 2018 at 5pm UK time (GMT+1). Unfortunately, there wasn't a slot that suited everyone, so @magicDGS and @kheljanko won't be able to make it - please let me know if you have anything you'd like to relay to the meeting. Here's the hangout link: https://meet.google.com/amb-zxti-qwe |
I would like to:
|
V short summary of the meeting. There were three technical areas (raised by Frank):
There was general agreement for all of these being in scope for a new project. There may need to be some phasing - e.g. have RDD implementations with existing htsjdk classes, and add others (e.g. dataset) in the future. After discussing governance and hosting the next steps are (summarized by Ryan):
Regarding naming, in the meeting a couple of names were suggested:
I'd also like to put forward the following (in the Spark sequencing vein):
I said I would send out a poll for the name. If you have any ideas or suggestions, please post them here so I can include them in the vote. |
Also
I wouldn't say that any decisions have been made with regards to hosting under the samtools organization or that the software license should be MIT. Part of the trouble with not having project governance is that no process is in place to make decisions. How about this for a proposal:
I will help with any or all of these. |
My opinion about some points brought by @heuermh and the summary by @tomwhite:
Thus, my proposal is a bit different than the one from @heuermh:
|
It seems like I may have misunderstood what was agreed (if only tentatively) in the meeting regarding github org (samtools) and license (MIT). In the meantime, here’s the code I’ve been working on (temporary location): https://github.com/tomwhite/squark. @magicDGS regarding the fate of this project - it still needs to be maintained at least until any replacement has a superset of functionality. I plan to do another Hadoop-BAM release next week with a few changes. |
I've created a nameless Genomics on Apache Spark organization and repository I'll flesh out the issues there with details from this issue and the email thread from the meeting. |
I'd like to organise another meeting to see if we can make some progress on the new project. I've created a poll at https://doodle.com/poll/k9hc9sgbf9uhue7i to find a time. Please select which times you can make if you are interested in attending. Thanks. |
Hi @tomwhite! I would love to join, but can't make any of the dates due to travel and other commitments. Can we do a time in June? |
@fnothaft Sure - I was thinking of running a monthly meeting at least while we get things set up so I'll add a new poll for June - and go ahead with a meeting this month too (probably next week now). |
Thanks everyone who responded to the poll! The result is here: https://doodle.com/poll/k9hc9sgbf9uhue7i. The time of the meeting is 5pm (GMT+1) on Tue May 22. Here's the hangout link: https://hangouts.google.com/hangouts/_/calendar/dG9tLmUud2hpdGVAZ21haWwuY29t.7grvb1g73s1svncmcg840jt6ee?authuser=0. |
Thanks to everyone who attended the meeting. There was a desire expressed to work together on foundational projects to avoid duplication, starting with a more tightly-scoped focus than perhaps before - i.e. a Spark-native Hadoop-BAM project (Squark). We have actually been working on Hadoop-BAM together over the last few years, so it would be worth seeing how we can continue that with a better defined governance model. Actions: please comment on the governance issues here: https://github.com/nameless-gos/nameless/issues. |
Note the Github organization and repository have changed, the new link is: |
@tomwhite and I would like to submit an abstract on Disq to the BOSC 2019 conference. Please consider adding your contact information and author affiliation to the following shared doc https://docs.google.com/document/d/1by-YA5FQra8CyqMHOwa6278fNM0fZpD8PDD5XxLOIPM/edit?usp=sharing |
I took the liberty of adding @lbergelson in his absence, since he probably won't see this in time. |
Sorry for the ping, closing an old issue. |
@tomwhite
@fnothaft
@ryan-williams
@jacarey
@tfenne
@lbergelson
@cmnbroad
@droazen
@magicDGS
@vdauwera
@cseed
Sorry to @-mention you all here on this issue, but unfortunately I only know some of you by your github handles.
I would like to invite you all to attend, in person or virtually, the OpenBio Winter Codefest 2018 on Thursday Jan 18th and Friday Jan 19th in Boston, MA.
https://www.open-bio.org/wiki/Codefest_Winter_2018
Hadoop-BAM is an upstream dependency shared by GATK4 and ADAM (and possibly also Hail?) and is in need of clarification around project governance, similar to the process recently undertaken by htsjdk (samtools/htsjdk#871). I see the Codefest as a good opportunity to discuss this and then branch out into other areas of possible collaboration between the various projects.
One possible way forward would be to incubate at Apache Software Foundation a new project starting from Hadoop-BAM, extending out into bdg-formats, bdg-utils, and ADAM on our side and common/utility code extracted from GATK4 and Hail, and perhaps even upstream into htsjdk. We welcome friendly competition on algorithms and analyses, but there is no reason to duplicate effort on the underlying technology stack.
Please feel free to discuss here on this issue or in the Codefest Gitter chat room. We can refine an agenda on the Codefest shared project ideas doc. Hope to meet some of you in Boston!
The text was updated successfully, but these errors were encountered: