Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement Study View's data binning algorithm using SQL (instead of Java) #117

Open
alisman opened this issue Feb 4, 2025 · 3 comments

Comments

@alisman
Copy link

alisman commented Feb 4, 2025

Background:
cBioPortal is an open-source platform designed to provide a web interface for exploring, visualizing, and analyzing cancer genomics data, and has grown to be widely used by researchers and clinicians worldwide. The current interface provides comprehensive tools for individual patient data exploration, including mutations, copy number variations, and clinical information as well as cohort exploration, analytics, and cohort comparisons.

The endpoints which drive the histogram charts on the cBioPortal Study View calculate data bins and return them to the frontend for display. To do this, they must fetch the underlying data from the database and run it through custom binning logic written in Java. This is not performant for large data sets. The binning should be done in the database query so that we don't have to return voluminous data and keep it in web server memory. Clickhouse, the new database we are adopting, provides functions to do this.

Image

Goal:
Optimize the cBioPortal Study View data binning algorithm by replacing the existing logic written in Java and re-implementing it so that the heavy lifting is performed by the database instead.

Approach:
We believe this is possible using the RoundDown function of Clickhouse (our new OLAP database).

This project requires:

  1. Understanding the specific requirements of binning in the cBioPortal (e.g. custom bin definitions)
  2. Meeting these requirements using RoundDown.
  3. If 2 proves unfeasible, we may resort to Clickhouse's User Defined Functions.

Possible mentors:
@alisman

@Aillie-Ifraeem
Copy link

hi @alisman
I am Aillie, and I use java as my programming language and I am willing to contribute to cBioPortal with
project: Re-implement Study View's data binning algorithm using SQL. (I also have done some work with SQL).

could you give me any advice how may I start. Pardon me I know it's very basic question but I'm still in my learning phase. Its my first time contributing to GSoc projects.

really looking forward to work with this project through GSoC 2025.

@Nitish-Naik
Copy link

Hello mentor @alisman ,
I am Nitish, a Computer Science undergraduate from India. I am interested in contributing to this project. I had a few inquiries.

I've read the goal above and have understood that you need a Optimized cBioPortal Study View data binning algorithm by replacing the existing logic written in Java..

I'm familiar with Java and SQL.
I'm also familiar with the database that you mentioned (Clickhouse )..

Can you please provide more info about this?
where is the data for testing , documentation etc.

In my application, I want to be transparent about my current familiarity. While have a good experience in SQL, Java, I do have experience of data visualization in general. I am genuinely eager to learn and contribute to this project.

I want to highlight my skills and experience in this application.

Eagerly waiting for your reply.
Thank you,
Nitish

@DeepamJha
Copy link

Hi @alisman,

I'm Deepam Jha, a sophomore engineering student from GGSIPU, India, and I'm excited about contributing to cBioPortal through GSoC 2025. I'm particularly interested in Issue #117: Re-implement Study View’s Data Binning Algorithm Using SQL.

I have experience with Java and SQL and have been exploring ClickHouse, especially its RoundDown function for binning. I understand that the current Java-based binning logic can be inefficient for large datasets, and moving this logic to the database should help optimize performance.

To get started, I’d love some guidance on:
Could you point me to where the current Java-based binning logic is implemented? Also, are there any constraints or test datasets I should consider while working on the SQL-based binning?

Best,
Deepam Jha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants