-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using GraphRAG for Automated Data Analysis with LLMs to Explore Cancer Pharmacogenomics Data #260
Comments
Hi @cannin, although I understand that NRNB is currently awaiting approval from GSoC admin, this project seems very promising, and I'd love to contribute to this project if it's up for grabs. I'm a grad student in CS located in IST (+05:30). As per the compatibility, I have experience working with most of the mentioned skill sets, including Python (~2 years), and have a good hold on data analysis, LLMs, Prompt Engineering + Designing, RAG, Knowledge Graphs, etc. from my past research and hackathon experiences. Please let me know the next steps to help me get started. Thanks! 🙏🏼 Also, in the meantime, I'd like to know the motivation behind opting for GraphRAG and not LightRAG, as LightRAG is comparatively much faster, more cost-efficient (i.e. affordable) and most importantly allows incremental updates to graphs without full regeneration, which saves time, every time the graph is recursively updated, which is very essential to ensure high throughput & consistency (of newly inserted data) assuming the chat system will be used by many at any instant of time. Is there any reference design doc, that I can look upto to get a holistic picture of the system? |
Hey @cannin, This project looks super interesting, and I'd love to contribute! I’m a Bioinformatics student with experience in Python and currently learning R, which I should be comfortable with by the time GSoC starts. I’ve worked with LLMs privately and have a strong interest in AI and Machine Learning. Looking forward to your response! |
@Neilblaze @xts-Michi I have added a Getting Started section. |
@cannin Perfect, thanks! 🙏🏼 |
@cannin I was just told by Dr. Dräger that we already met at the COMBINE conference in Stuttgart in 2024. Maybe you remember that we talked about Systems Biology and the TFpredict project I presented? |
@xts-Michi thanks. check out Getting Started section. |
NRNB has been accepted as a mentoring organization for GSoC 2025. The contributor application period is March 24 – April 8. Here are some useful links: GSoC contributor guide |
Hi @cannin, I've been exploring CellMinerCDB, rcellminer, and GraphRAG, and this project aligns well with my experience. I'm a student at IIT (BHU) Varanasi and have worked on LLM benchmarks (ICLR’25) and multi-agent data-driven scientific discovery (ICML’24), along with open-source contributions through Code4GovTech. I’ve started drafting my proposal and would appreciate any feedback. Would you be open to reviewing it? Looking forward to your thoughts! |
@AryanPrakhar I am willing to review an application draft 1-2 times; [email protected] send if you think it is advanced enough. |
I’m really excited about the CellMinerCDB chat interface project and would love to contribute! The idea of using LLMs to simplify complex data exploration is fascinating, and I’m eager to help build a solution that makes the database more accessible. I’ve started exploring the CellMinerCDB publication, rcellminer docs, and LangChain examples to get up to speed. I’m particularly interested in tackling the challenge of translating user queries into actionable code and leveraging GraphRAG for data retrieval. Let me know how I can help or if there’s a good starting point to dive in! Looking forward to collaborating. |
Hi @cannin, This project looks fascinating, and I’d love to contribute! I’m a master’s student in bioinformatics at the University of Copenhagen, with experience in deep generative models for cancer multi-omics data analysis. My current research focuses on deep learning models for RNA-binding site prediction, and I have hands-on experience with PyTorch, NumPy, Pandas, and high-performance computing. I’m also familiar with GitHub workflows and have worked with protein language models especially those based on transformers. Recently, I started learning about knowledge graphs and their applications in biomedical research, which makes me particularly excited about the GraphRAG approach in this project. I’d love to learn more about how I can get involved. Looking forward to your response! |
Background
CellMiner Cross-Database (CellMinerCDB, discover.nci.nih.gov/cellminercdb) allows integration and analysis of molecular and pharmacological data within and across cancer cell line datasets. The software is written in the R programming language using the R Shiny web framework. The database contains data on -omics (e.g., gene expression, mutations, etc.), drug response, and sample annotation information that internally exist as tables that have been processed to simplify data analysis. The web interface a set of analyses to be performed with accompanying plots, but it is limited. The amount of data across individual datasets is also growing. The goal is to create tools that leverage large language models (LLMs) to explore this data from a chat interface.
Goal
The goal of the project develop a chat interface using LLM-based AI for exploring the available data. The interface will need to interpret user instructions to generate working code making use of the CellMinerCDB API (rcellminer, see related links) that will then be run with a response returned. The data within the CellminerCDB database is amendable to mapping as a graph for use with GraphRAG (e.g., 1: publications describe datasets, genes, drugs, samples, 2: drugs target proteins, etc.) for data retrieval.
Getting Started
Look at the links; read the CellMinerCDB publication and rcellminer documentation. Try to find similar projects or ideas online using LangChain. Draft a proposal with your ideas.
Difficulty Level: Medium
Simple to start, but a major challenge is in getting "working code" that necessitates efforts such as advanced prompt engineering and other techniques the candidate needs to identify.
Size and Length of Project
Skills
Related links
Potential Mentors
Augustin Luna
The text was updated successfully, but these errors were encountered: