scFM-Bench

Introduction

This is the official code repository of the paper Biology-Driven Insights into the Power of Single-Cell Foundation Models. Until now, we have included 6 methods as shown below:

Model	Paper link	Github
scVI	https://doi.org/10.1038/s41592-018-0229-2	https://github.com/scverse/scvi-tools
Geneformer	https://www.nature.com/articles/s41586-023-06139-9	https://huggingface.co/ctheodoris/Geneformer
scGPT	https://www.nature.com/articles/s41592-024-02201-0	https://github.com/bowang-lab/scGPT
UCE	http://biorxiv.org/lookup/doi/10.1101/2023.11.28.568918	https://github.com/snap-stanford/UCE
scFoundation	https://www.nature.com/articles/s41592-024-02305-7	https://github.com/biomap-research/scFoundation
LangCell	http://arxiv.org/abs/2405.06708	https://github.com/PharMolix/LangCell

Dependencies

Currently the code requires the GPUs supported by flash attention, required for scGPT to run.

GPUs supported by flash attention are:

Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100).
Turing GPUs (T4, RTX 2080)

Packages version

This code has been tested with the following versions of the packages:

Python - tested with 3.8
PyTorch - tested with - 1.13.1+cu117
CUDA - tested with 11.7
lightning - tested with 2.2.0
torch-geometric - tested with 2.6.1
FlashAttention - depends on v1.0.4
scGPT - depends on v0.2.1
Geneformer - depends on commit 2a9eb7f
UCE - depends on commit 7b31528
scFoundation - depends on commit 948a8cc
LangCell - depends on commit a60096b
scvi-tools - depends on v0.16.4
sc_foundation_evals - depends on v0.1.0
scIB - depends on v1.0.5
Islander - depends on commit 7934aa4

Prepare the environment for applying scFMs

You can download the conda-packed file, and then unzip it in ${anaconda_install_dir}/envs (the directory where the anaconda is installed).

mkdir ${anaconda_install_dir}/envs/singlecell
tar -xzvf singlecell.tar.gz -C ${anaconda_install_dir}/envs/singlecell
conda activate singlecell

We modify the codes in UCE and scFoudnation and apply git patch to sync the modifications.

UCE

git clone https://github.com/snap-stanford/UCE
cd UCE
git checkout 7b31528b84e4c8e7a9717c61e3d03ff7559c61af
git apply ../patch/uce_changes.patch

All necessary model files will be downloaded automatically when first running the eval_single_anndata.py script.

scFoundation

git clone https://github.com/biomap-research/scFoundation
mv scFoundation xTrimoGene
cd xTrimoGene
git checkout 948a8ccb950d096148cf03418d870acdcadebd7b
git apply ../patch/scfoundation_changes.patch

Others

The scgpt and geneformer packages have been already installed in our provided conda environment. The codes for applying scGPT, Geneformer and LangCell are located in the scFM-Bench/sc_foundation_evals subfolder.

Download data and checkpoints

Download the datasets and checkpoints of scFMs used in this benchmarking work from zenodo. Please unzip datasets.tar.gz, TISCH.tar.gz and weights.tar.gz in the scFM-Bench/data directory, which looks like this:

├── datasets
│   ├── HLCA_core.h5ad
│   ├── Immune_all_human.h5ad
│   ├── pancreas_scib.h5ad
│   └── Tabula_Sapiens_all.h5ad
├── TISCH
│   ├── Blood
│   │   ├── AEL_GSE142213_CellMetainfo_table.tsv
│   │   ├── AEL_GSE142213_expression.h5
│   │   ├── ALL_GSE132509_CellMetainfo_table.tsv
│   │   ├── ALL_GSE132509_expression.h5
│   │   ├── AML_GSE116256_CellMetainfo_table.tsv
│   │   └── AML_GSE116256_expression.h5
│   ├── Bone
│   │   ├── MM_GSE117156_CellMetainfo_table.tsv
│   │   └── MM_GSE117156_expression.h5
│   ├── Brain
│   │   ├── Glioma_GSE131928_10X_CellMetainfo_table.tsv
│   │   ├── Glioma_GSE131928_10X_expression.h5
│   │   ├── Glioma_GSE138794_CellMetainfo_table.tsv
│   │   ├── Glioma_GSE138794_expression.h5
│   │   ├── Glioma_GSE139448_CellMetainfo_table.tsv
│   │   ├── Glioma_GSE139448_expression.h5
│   │   ├── Glioma_GSE141982_CellMetainfo_table.tsv
│   │   ├── Glioma_GSE141982_expression.h5
│   │   ├── MB_GSE119926_CellMetainfo_table.tsv
│   │   └── MB_GSE119926_expression.h5
│   ├── Eye
│   │   ├── UVM_GSE139829_CellMetainfo_table.tsv
│   │   └── UVM_GSE139829_expression.h5
│   └── preprocess_data.ipynb
└── weights
    ├── Geneformer
    │   ├── default
    │   │   ├── 12L
    │   │   │   ├── config.json
    │   │   │   ├── pytorch_model.bin
    │   │   │   └── training_args.bin
    │   │   └── 6L
    │   │       ├── config.json
    │   │       ├── pytorch_model.bin
    │   │       ├── README.md
    │   │       └── training_args.bin
    │   └── dicts
    │       ├── gene_median_dictionary.pkl
    │       ├── gene_name_id_dict.pkl
    │       └── token_dictionary.pkl
    ├── LangCell
    │   ├── cell_bert
    │   │   ├── config.json
    │   │   └── pytorch_model.bin
    │   ├── cell_proj.bin
    │   ├── config.json
    │   ├── ctm_head.bin
    │   ├── text_bert
    │   │   ├── config.json
    │   │   └── pytorch_model.bin
    │   ├── text_proj.bin
    │   └── tokenizer
    │       └── BiomedBERT
    │           ├── tokenizer_config.json
    │           └── vocab.txt
    ├── scFoundation
    │   └── models.ckpt
    ├── scgpt
    │   └── scGPT_human
    │       ├── args.json
    │       ├── best_model.pt
    │       └── vocab.json
    └── UCE
        └── 33l_8ep_1024t_1280.torch

Note 1: The TISCH datasets shoule be firstly processed via running the codes in data/TISCH/preprocess_data.ipynb.

Note 2: The checkpoint for xTrimoGene (scFoundation) should be moved to the xTrimoGene/model/models directory.

# cd to the scFM-Bench project folder
mv data/weights/scFoundation/models.ckpt xTrimoGene/model/models

Extract pretrained embeddings

Gene embeddings

python 1_extract_gene_embeddings.py

Cell embeddings

Note: Please extract geneformer embeddings before LangCell because the data preprocessing is implemented in the geneformer module.

# for datasets from scib (Pancreas and Immune)
bash scripts/get_cell_embeddings_scib.sh

# for datasets from cellxgene (HLCA and Tabula Sapiens)
bash scripts/get_cell_embeddings_cellxgene.sh

# for datasets that are already processed (TISCH)
bash scripts/get_cell_embeddings_normalized.sh

Downstream tasks

Gene function prediction

The baseline code is from FRoGS. See details in the scFM-Bench/FRoGS subfolder.

Batch integration

The evaluation metrics is based on scvi-tools and scGraph.

scIB metrics

# for datasets from scib
bash scripts/calculate_cluster_metrics.sh

# for datasets from cellxgene
bash scripts/calculate_cluster_metrics_cellxgene.sh

Output files:

clustering_metrics.csv: a csv file contains the results of scIB metrics.
X_umap.npy: a ndarray contains the umap coordinates of cell embeddings.
clustering_umap_batch.png: the umap plot of cell embeddings colored by batch labels.
clustering_umap_celltype.png: the umap plot of cell embeddings colored by cell type labels.

The umap coordinates and png will be saved in the output directory.

scGraph metrics

cd scGraph
python scGraph_cl_ontology.py {dataset_name}

By default, the output files will be saved the output/scGraph directory. For each specific dataset, there are two output files:

{dataset_name}.csv: a csv file contains the results of the original scGraph and our proposed scGraph-OntoRWR metrics (average across all cell types).
{dataset_name}_rwr_detailed.csv: a csv file contains the cell type-specific scGraph-OntoRWR scores.

Cell type annotation

The baseline code is from Onclass. See details in the scFM-Bench/Onclass subfolder.

Cancer cell identification

The baseline code is from SequencingCancerFinder. See details in the scFM-Bench/SequencingCancerFinder subfolder.

Drug sensitivity prediction

The baseline code is from SCAD. See details in the scFM-Bench/DrugSensitivity subfolder.

Acknowledgments

Our implementation uses microsoft's zero-shot-foundation code as a starting point. Thanks for their great work and code, hope readers of interest could check their work, too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scFM-Bench

Introduction

Dependencies

Prepare the environment for applying scFMs

UCE

scFoundation

Others

Download data and checkpoints

Extract pretrained embeddings

Gene embeddings

Cell embeddings

Downstream tasks

Gene function prediction

Batch integration

scIB metrics

scGraph metrics

Cell type annotation

Cancer cell identification

Drug sensitivity prediction

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
DrugSensitivity		DrugSensitivity
FRoGS		FRoGS
OnClass		OnClass
SequencingCancerFinder		SequencingCancerFinder
patch		patch
scGraph		scGraph
sc_foundation_evals		sc_foundation_evals
scripts		scripts
.gitignore		.gitignore
1_extract_gene_embeddings.py		1_extract_gene_embeddings.py
2_extract_cell_embeddings.py		2_extract_cell_embeddings.py
3_cell_clustering.py		3_cell_clustering.py
README.md		README.md
loggers.py		loggers.py

wujialu/scFM-Bench

Folders and files

Latest commit

History

Repository files navigation

scFM-Bench

Introduction

Dependencies

Prepare the environment for applying scFMs

UCE

scFoundation

Others

Download data and checkpoints

Extract pretrained embeddings

Gene embeddings

Cell embeddings

Downstream tasks

Gene function prediction

Batch integration

scIB metrics

scGraph metrics

Cell type annotation

Cancer cell identification

Drug sensitivity prediction

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages