Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using GraphRAG for Automated Data Analysis with LLMs to Explore Cancer Pharmacogenomics Data #260

Open
cannin opened this issue Feb 14, 2025 · 11 comments

Comments

@cannin
Copy link

cannin commented Feb 14, 2025

Background

CellMiner Cross-Database (CellMinerCDB, discover.nci.nih.gov/cellminercdb) allows integration and analysis of molecular and pharmacological data within and across cancer cell line datasets. The software is written in the R programming language using the R Shiny web framework. The database contains data on -omics (e.g., gene expression, mutations, etc.), drug response, and sample annotation information that internally exist as tables that have been processed to simplify data analysis. The web interface a set of analyses to be performed with accompanying plots, but it is limited. The amount of data across individual datasets is also growing. The goal is to create tools that leverage large language models (LLMs) to explore this data from a chat interface.

Goal

The goal of the project develop a chat interface using LLM-based AI for exploring the available data. The interface will need to interpret user instructions to generate working code making use of the CellMinerCDB API (rcellminer, see related links) that will then be run with a response returned. The data within the CellminerCDB database is amendable to mapping as a graph for use with GraphRAG (e.g., 1: publications describe datasets, genes, drugs, samples, 2: drugs target proteins, etc.) for data retrieval.

Getting Started

Look at the links; read the CellMinerCDB publication and rcellminer documentation. Try to find similar projects or ideas online using LangChain. Draft a proposal with your ideas.

Difficulty Level: Medium

Simple to start, but a major challenge is in getting "working code" that necessitates efforts such as advanced prompt engineering and other techniques the candidate needs to identify.

Size and Length of Project

  • medium: 175 hours
  • 12 weeks

Skills

  • Essential skills: Python
  • Nice to have skills: R, LangChain, LangGraph, data analysis, and LLM APIs (GPT4, Ollama, etc.), Neo4J/Cypher

Related links

Potential Mentors

Augustin Luna

@Neilblaze
Copy link

Neilblaze commented Feb 15, 2025

Hi @cannin, although I understand that NRNB is currently awaiting approval from GSoC admin, this project seems very promising, and I'd love to contribute to this project if it's up for grabs. I'm a grad student in CS located in IST (+05:30). As per the compatibility, I have experience working with most of the mentioned skill sets, including Python (~2 years), and have a good hold on data analysis, LLMs, Prompt Engineering + Designing, RAG, Knowledge Graphs, etc. from my past research and hackathon experiences.

Please let me know the next steps to help me get started. Thanks! 🙏🏼


Also, in the meantime, I'd like to know the motivation behind opting for GraphRAG and not LightRAG, as LightRAG is comparatively much faster, more cost-efficient (i.e. affordable) and most importantly allows incremental updates to graphs without full regeneration, which saves time, every time the graph is recursively updated, which is very essential to ensure high throughput & consistency (of newly inserted data) assuming the chat system will be used by many at any instant of time. Is there any reference design doc, that I can look upto to get a holistic picture of the system?

@xts-Michi
Copy link

xts-Michi commented Feb 20, 2025

Hey @cannin,

This project looks super interesting, and I'd love to contribute! I’m a Bioinformatics student with experience in Python and currently learning R, which I should be comfortable with by the time GSoC starts. I’ve worked with LLMs privately and have a strong interest in AI and Machine Learning.
Previously, I contributed to TFpredict, where we processed large biological datasets, so I’m comfortable working with structured data. This project excites me because it combines AI with bioinformatics, which is exactly the kind of challenge I enjoy.
I’d love to get started—are there any tips for familiarizing myself with CellMinerCDB, or should I just explore the repo and docs on my own?

Looking forward to your response!

@cannin
Copy link
Author

cannin commented Feb 20, 2025

@Neilblaze @xts-Michi I have added a Getting Started section.

@Neilblaze
Copy link

@cannin Perfect, thanks! 🙏🏼

@xts-Michi
Copy link

xts-Michi commented Feb 24, 2025

@cannin I was just told by Dr. Dräger that we already met at the COMBINE conference in Stuttgart in 2024. Maybe you remember that we talked about Systems Biology and the TFpredict project I presented?
I would also like to ask what the next steps are for the application and how you could become my mentor. I’m very interested and would love to know the concrete next steps so that I can apply for GSoC

@cannin
Copy link
Author

cannin commented Feb 24, 2025

@xts-Michi thanks. check out Getting Started section.

@khanspers
Copy link
Contributor

NRNB has been accepted as a mentoring organization for GSoC 2025. The contributor application period is March 24 – April 8. Here are some useful links:

GSoC contributor guide
NRNB project proposal template
Eligibility requirements
Full program timeline

@AryanPrakhar
Copy link

Hi @cannin,

I've been exploring CellMinerCDB, rcellminer, and GraphRAG, and this project aligns well with my experience. I'm a student at IIT (BHU) Varanasi and have worked on LLM benchmarks (ICLR’25) and multi-agent data-driven scientific discovery (ICML’24), along with open-source contributions through Code4GovTech.

I’ve started drafting my proposal and would appreciate any feedback. Would you be open to reviewing it? Looking forward to your thoughts!

@cannin
Copy link
Author

cannin commented Mar 3, 2025

@AryanPrakhar I am willing to review an application draft 1-2 times; [email protected] send if you think it is advanced enough.

@sriramsowmithri9807
Copy link

Hi @cannin @khanspers

I’m really excited about the CellMinerCDB chat interface project and would love to contribute! The idea of using LLMs to simplify complex data exploration is fascinating, and I’m eager to help build a solution that makes the database more accessible.

I’ve started exploring the CellMinerCDB publication, rcellminer docs, and LangChain examples to get up to speed. I’m particularly interested in tackling the challenge of translating user queries into actionable code and leveraging GraphRAG for data retrieval.

Let me know how I can help or if there’s a good starting point to dive in! Looking forward to collaborating.

@stellaxu2077
Copy link

Hi @cannin,

This project looks fascinating, and I’d love to contribute! I’m a master’s student in bioinformatics at the University of Copenhagen, with experience in deep generative models for cancer multi-omics data analysis. My current research focuses on deep learning models for RNA-binding site prediction, and I have hands-on experience with PyTorch, NumPy, Pandas, and high-performance computing.

I’m also familiar with GitHub workflows and have worked with protein language models especially those based on transformers. Recently, I started learning about knowledge graphs and their applications in biomedical research, which makes me particularly excited about the GraphRAG approach in this project.

I’d love to learn more about how I can get involved. Looking forward to your response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants