Using GraphRAG for Automated Data Analysis with LLMs to Explore Cancer Pharmacogenomics Data #260

cannin · 2025-02-14T20:19:38Z

Background

CellMiner Cross-Database (CellMinerCDB, discover.nci.nih.gov/cellminercdb) allows integration and analysis of molecular and pharmacological data within and across cancer cell line datasets. The software is written in the R programming language using the R Shiny web framework. The database contains data on -omics (e.g., gene expression, mutations, etc.), drug response, and sample annotation information that internally exist as tables that have been processed to simplify data analysis. The web interface a set of analyses to be performed with accompanying plots, but it is limited. The amount of data across individual datasets is also growing. The goal is to create tools that leverage large language models (LLMs) to explore this data from a chat interface.

Goal

The goal of the project develop a chat interface using LLM-based AI for exploring the available data. The interface will need to interpret user instructions to generate working code making use of the CellMinerCDB API (rcellminer, see related links) that will then be run with a response returned. The data within the CellminerCDB database is amendable to mapping as a graph for use with GraphRAG (e.g., 1: publications describe datasets, genes, drugs, samples, 2: drugs target proteins, etc.) for data retrieval.

Getting Started

Look at the links; read the CellMinerCDB publication and rcellminer documentation. Try to find similar projects or ideas online using LangChain. Draft a proposal with your ideas.

Difficulty Level: Medium

Simple to start, but a major challenge is in getting "working code" that necessitates efforts such as advanced prompt engineering and other techniques the candidate needs to identify.

Size and Length of Project

medium: 175 hours
12 weeks

Skills

Essential skills: Python
Nice to have skills: R, LangChain, LangGraph, data analysis, and LLM APIs (GPT4, Ollama, etc.), Neo4J/Cypher

Related links

Potential Mentors

Augustin Luna

Neilblaze · 2025-02-15T08:43:08Z

Hi @cannin, although I understand that NRNB is currently awaiting approval from GSoC admin, this project seems very promising, and I'd love to contribute to this project if it's up for grabs. I'm a grad student in CS located in IST (+05:30). As per the compatibility, I have experience working with most of the mentioned skill sets, including Python (~2 years), and have a good hold on data analysis, LLMs, Prompt Engineering + Designing, RAG, Knowledge Graphs, etc. from my past research and hackathon experiences.

Please let me know the next steps to help me get started. Thanks! 🙏🏼

Also, in the meantime, I'd like to know the motivation behind opting for GraphRAG and not LightRAG, as LightRAG is comparatively much faster, more cost-efficient (i.e. affordable) and most importantly allows incremental updates to graphs without full regeneration, which saves time, every time the graph is recursively updated, which is very essential to ensure high throughput & consistency (of newly inserted data) assuming the chat system will be used by many at any instant of time. Is there any reference design doc, that I can look upto to get a holistic picture of the system?

xts-Michi · 2025-02-20T10:47:32Z

Hey @cannin,

This project looks super interesting, and I'd love to contribute! I’m a Bioinformatics student with experience in Python and currently learning R, which I should be comfortable with by the time GSoC starts. I’ve worked with LLMs privately and have a strong interest in AI and Machine Learning.
Previously, I contributed to TFpredict, where we processed large biological datasets, so I’m comfortable working with structured data. This project excites me because it combines AI with bioinformatics, which is exactly the kind of challenge I enjoy.
I’d love to get started—are there any tips for familiarizing myself with CellMinerCDB, or should I just explore the repo and docs on my own?

Looking forward to your response!

cannin · 2025-02-20T14:56:06Z

@Neilblaze @xts-Michi I have added a Getting Started section.

Neilblaze · 2025-02-21T07:06:00Z

@cannin Perfect, thanks! 🙏🏼

xts-Michi · 2025-02-24T18:10:00Z

@cannin I was just told by Dr. Dräger that we already met at the COMBINE conference in Stuttgart in 2024. Maybe you remember that we talked about Systems Biology and the TFpredict project I presented?
I would also like to ask what the next steps are for the application and how you could become my mentor. I’m very interested and would love to know the concrete next steps so that I can apply for GSoC

cannin · 2025-02-24T19:30:44Z

@xts-Michi thanks. check out Getting Started section.

khanspers · 2025-02-27T21:35:19Z

NRNB has been accepted as a mentoring organization for GSoC 2025. The contributor application period is March 24 – April 8. Here are some useful links:

GSoC contributor guide
NRNB project proposal template
Eligibility requirements
Full program timeline

AryanPrakhar · 2025-02-28T22:21:56Z

Hi @cannin,

I've been exploring CellMinerCDB, rcellminer, and GraphRAG, and this project aligns well with my experience. I'm a student at IIT (BHU) Varanasi and have worked on LLM benchmarks (ICLR’25) and multi-agent data-driven scientific discovery (ICML’24), along with open-source contributions through Code4GovTech.

I’ve started drafting my proposal and would appreciate any feedback. Would you be open to reviewing it? Looking forward to your thoughts!

cannin · 2025-03-03T22:44:33Z

@AryanPrakhar I am willing to review an application draft 1-2 times; [email protected] send if you think it is advanced enough.

sriramsowmithri9807 · 2025-03-07T02:43:45Z

Hi @cannin @khanspers

I’m really excited about the CellMinerCDB chat interface project and would love to contribute! The idea of using LLMs to simplify complex data exploration is fascinating, and I’m eager to help build a solution that makes the database more accessible.

I’ve started exploring the CellMinerCDB publication, rcellminer docs, and LangChain examples to get up to speed. I’m particularly interested in tackling the challenge of translating user queries into actionable code and leveraging GraphRAG for data retrieval.

Let me know how I can help or if there’s a good starting point to dive in! Looking forward to collaborating.

stellaxu2077 · 2025-03-09T19:54:17Z

Hi @cannin,

This project looks fascinating, and I’d love to contribute! I’m a master’s student in bioinformatics at the University of Copenhagen, with experience in deep generative models for cancer multi-omics data analysis. My current research focuses on deep learning models for RNA-binding site prediction, and I have hands-on experience with PyTorch, NumPy, Pandas, and high-performance computing.

I’m also familiar with GitHub workflows and have worked with protein language models especially those based on transformers. Recently, I started learning about knowledge graphs and their applications in biomedical research, which makes me particularly excited about the GraphRAG approach in this project.

I’d love to learn more about how I can get involved. Looking forward to your response!

cannin added Difficulty: Medium LLM Python R Shiny labels Feb 14, 2025

cannin self-assigned this Feb 14, 2025

khanspers added the Size: 175h label Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using GraphRAG for Automated Data Analysis with LLMs to Explore Cancer Pharmacogenomics Data #260

Using GraphRAG for Automated Data Analysis with LLMs to Explore Cancer Pharmacogenomics Data #260

cannin commented Feb 14, 2025 •

edited

Loading

Neilblaze commented Feb 15, 2025 •

edited

Loading

xts-Michi commented Feb 20, 2025 •

edited

Loading

cannin commented Feb 20, 2025 •

edited

Loading

Neilblaze commented Feb 21, 2025

xts-Michi commented Feb 24, 2025 •

edited

Loading

cannin commented Feb 24, 2025

khanspers commented Feb 27, 2025

AryanPrakhar commented Feb 28, 2025

cannin commented Mar 3, 2025

sriramsowmithri9807 commented Mar 7, 2025

stellaxu2077 commented Mar 9, 2025

Using GraphRAG for Automated Data Analysis with LLMs to Explore Cancer Pharmacogenomics Data #260

Using GraphRAG for Automated Data Analysis with LLMs to Explore Cancer Pharmacogenomics Data #260

Comments

cannin commented Feb 14, 2025 • edited Loading

Background

Goal

Getting Started

Difficulty Level: Medium

Size and Length of Project

Skills

Related links

Potential Mentors

Neilblaze commented Feb 15, 2025 • edited Loading

xts-Michi commented Feb 20, 2025 • edited Loading

cannin commented Feb 20, 2025 • edited Loading

Neilblaze commented Feb 21, 2025

xts-Michi commented Feb 24, 2025 • edited Loading

cannin commented Feb 24, 2025

khanspers commented Feb 27, 2025

AryanPrakhar commented Feb 28, 2025

cannin commented Mar 3, 2025

sriramsowmithri9807 commented Mar 7, 2025

stellaxu2077 commented Mar 9, 2025

cannin commented Feb 14, 2025 •

edited

Loading

Neilblaze commented Feb 15, 2025 •

edited

Loading

xts-Michi commented Feb 20, 2025 •

edited

Loading

cannin commented Feb 20, 2025 •

edited

Loading

xts-Michi commented Feb 24, 2025 •

edited

Loading