Re-implement Study View's data binning algorithm using SQL (instead of Java) #117

alisman · 2025-02-04T17:40:54Z

Background:
cBioPortal is an open-source platform designed to provide a web interface for exploring, visualizing, and analyzing cancer genomics data, and has grown to be widely used by researchers and clinicians worldwide. The current interface provides comprehensive tools for individual patient data exploration, including mutations, copy number variations, and clinical information as well as cohort exploration, analytics, and cohort comparisons.

The endpoints which drive the histogram charts on the cBioPortal Study View calculate data bins and return them to the frontend for display. To do this, they must fetch the underlying data from the database and run it through custom binning logic written in Java. This is not performant for large data sets. The binning should be done in the database query so that we don't have to return voluminous data and keep it in web server memory. Clickhouse, the new database we are adopting, provides functions to do this.

Goal:
Optimize the cBioPortal Study View data binning algorithm by replacing the existing logic written in Java and re-implementing it so that the heavy lifting is performed by the database instead.

Approach:
We believe this is possible using the RoundDown function of Clickhouse (our new OLAP database).

This project requires:

Understanding the specific requirements of binning in the cBioPortal (e.g. custom bin definitions)
Meeting these requirements using RoundDown.
If 2 proves unfeasible, we may resort to Clickhouse's User Defined Functions.

Possible mentors:
@alisman

Aillie-Ifraeem · 2025-02-28T08:53:20Z

hi @alisman
I am Aillie, and I use java as my programming language and I am willing to contribute to cBioPortal with
project: Re-implement Study View's data binning algorithm using SQL. (I also have done some work with SQL).

could you give me any advice how may I start. Pardon me I know it's very basic question but I'm still in my learning phase. Its my first time contributing to GSoc projects.

really looking forward to work with this project through GSoC 2025.

Nitish-Naik · 2025-03-05T03:46:17Z

Hello mentor @alisman ,
I am Nitish, a Computer Science undergraduate from India. I am interested in contributing to this project. I had a few inquiries.

I've read the goal above and have understood that you need a Optimized cBioPortal Study View data binning algorithm by replacing the existing logic written in Java..

I'm familiar with Java and SQL.
I'm also familiar with the database that you mentioned (Clickhouse )..

Can you please provide more info about this?
where is the data for testing , documentation etc.

In my application, I want to be transparent about my current familiarity. While have a good experience in SQL, Java, I do have experience of data visualization in general. I am genuinely eager to learn and contribute to this project.

I want to highlight my skills and experience in this application.

Eagerly waiting for your reply.
Thank you,
Nitish

DeepamJha · 2025-03-06T06:04:19Z

Hi @alisman,

I'm Deepam Jha, a sophomore engineering student from GGSIPU, India, and I'm excited about contributing to cBioPortal through GSoC 2025. I'm particularly interested in Issue #117: Re-implement Study View’s Data Binning Algorithm Using SQL.

I have experience with Java and SQL and have been exploring ClickHouse, especially its RoundDown function for binning. I understand that the current Java-based binning logic can be inefficient for large datasets, and moving this logic to the database should help optimize performance.

To get started, I’d love some guidance on:
Could you point me to where the current Java-based binning logic is implemented? Also, are there any constraints or test datasets I should consider while working on the SQL-based binning?

Best,
Deepam Jha

alisman added GSoC-2025 GSoC 2025 Candidate Projects Size: Medium (175h) Difficulty: Medium labels Feb 4, 2025

ao508 added enhancement cBioPortal Java SQL labels Feb 6, 2025

alisman assigned alisman and unassigned alisman Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-implement Study View's data binning algorithm using SQL (instead of Java) #117

Re-implement Study View's data binning algorithm using SQL (instead of Java) #117

alisman commented Feb 4, 2025 •

edited

Loading

Aillie-Ifraeem commented Feb 28, 2025

Nitish-Naik commented Mar 5, 2025

DeepamJha commented Mar 6, 2025

Re-implement Study View's data binning algorithm using SQL (instead of Java) #117

Re-implement Study View's data binning algorithm using SQL (instead of Java) #117

Comments

alisman commented Feb 4, 2025 • edited Loading

Aillie-Ifraeem commented Feb 28, 2025

Nitish-Naik commented Mar 5, 2025

DeepamJha commented Mar 6, 2025

alisman commented Feb 4, 2025 •

edited

Loading