This is the official code repository of the paper Biology-Driven Insights into the Power of Single-Cell Foundation Models. Until now, we have included 6 methods as shown below:
Currently the code requires the GPUs supported by flash attention, required for scGPT to run.
GPUs supported by flash attention are:
- Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100).
- Turing GPUs (T4, RTX 2080)
Packages version
This code has been tested with the following versions of the packages:
- Python - tested with
3.8
- PyTorch - tested with -
1.13.1+cu117
- CUDA - tested with
11.7
- lightning - tested with
2.2.0
- torch-geometric - tested with
2.6.1
- FlashAttention - depends on
v1.0.4
- scGPT - depends on
v0.2.1
- Geneformer - depends on commit
2a9eb7f
- UCE - depends on commit
7b31528
- scFoundation - depends on commit
948a8cc
- LangCell - depends on commit
a60096b
- scvi-tools - depends on
v0.16.4
- sc_foundation_evals - depends on
v0.1.0
- scIB - depends on
v1.0.5
- Islander - depends on commit
7934aa4
You can download the conda-packed file, and then unzip it in ${anaconda_install_dir}/envs
(the directory where the anaconda is installed).
mkdir ${anaconda_install_dir}/envs/singlecell
tar -xzvf singlecell.tar.gz -C ${anaconda_install_dir}/envs/singlecell
conda activate singlecell
We modify the codes in UCE and scFoudnation and apply git patch to sync the modifications.
git clone https://github.com/snap-stanford/UCE
cd UCE
git checkout 7b31528b84e4c8e7a9717c61e3d03ff7559c61af
git apply ../patch/uce_changes.patch
All necessary model files will be downloaded automatically when first running the eval_single_anndata.py
script.
git clone https://github.com/biomap-research/scFoundation
mv scFoundation xTrimoGene
cd xTrimoGene
git checkout 948a8ccb950d096148cf03418d870acdcadebd7b
git apply ../patch/scfoundation_changes.patch
The scgpt and geneformer packages have been already installed in our provided conda environment.
The codes for applying scGPT, Geneformer and LangCell are located in the scFM-Bench/sc_foundation_evals
subfolder.
Download the datasets and checkpoints of scFMs used in this benchmarking work from zenodo.
Please unzip datasets.tar.gz
, TISCH.tar.gz
and weights.tar.gz
in the scFM-Bench/data
directory, which looks like this:
├── datasets
│ ├── HLCA_core.h5ad
│ ├── Immune_all_human.h5ad
│ ├── pancreas_scib.h5ad
│ └── Tabula_Sapiens_all.h5ad
├── TISCH
│ ├── Blood
│ │ ├── AEL_GSE142213_CellMetainfo_table.tsv
│ │ ├── AEL_GSE142213_expression.h5
│ │ ├── ALL_GSE132509_CellMetainfo_table.tsv
│ │ ├── ALL_GSE132509_expression.h5
│ │ ├── AML_GSE116256_CellMetainfo_table.tsv
│ │ └── AML_GSE116256_expression.h5
│ ├── Bone
│ │ ├── MM_GSE117156_CellMetainfo_table.tsv
│ │ └── MM_GSE117156_expression.h5
│ ├── Brain
│ │ ├── Glioma_GSE131928_10X_CellMetainfo_table.tsv
│ │ ├── Glioma_GSE131928_10X_expression.h5
│ │ ├── Glioma_GSE138794_CellMetainfo_table.tsv
│ │ ├── Glioma_GSE138794_expression.h5
│ │ ├── Glioma_GSE139448_CellMetainfo_table.tsv
│ │ ├── Glioma_GSE139448_expression.h5
│ │ ├── Glioma_GSE141982_CellMetainfo_table.tsv
│ │ ├── Glioma_GSE141982_expression.h5
│ │ ├── MB_GSE119926_CellMetainfo_table.tsv
│ │ └── MB_GSE119926_expression.h5
│ ├── Eye
│ │ ├── UVM_GSE139829_CellMetainfo_table.tsv
│ │ └── UVM_GSE139829_expression.h5
│ └── preprocess_data.ipynb
└── weights
├── Geneformer
│ ├── default
│ │ ├── 12L
│ │ │ ├── config.json
│ │ │ ├── pytorch_model.bin
│ │ │ └── training_args.bin
│ │ └── 6L
│ │ ├── config.json
│ │ ├── pytorch_model.bin
│ │ ├── README.md
│ │ └── training_args.bin
│ └── dicts
│ ├── gene_median_dictionary.pkl
│ ├── gene_name_id_dict.pkl
│ └── token_dictionary.pkl
├── LangCell
│ ├── cell_bert
│ │ ├── config.json
│ │ └── pytorch_model.bin
│ ├── cell_proj.bin
│ ├── config.json
│ ├── ctm_head.bin
│ ├── text_bert
│ │ ├── config.json
│ │ └── pytorch_model.bin
│ ├── text_proj.bin
│ └── tokenizer
│ └── BiomedBERT
│ ├── tokenizer_config.json
│ └── vocab.txt
├── scFoundation
│ └── models.ckpt
├── scgpt
│ └── scGPT_human
│ ├── args.json
│ ├── best_model.pt
│ └── vocab.json
└── UCE
└── 33l_8ep_1024t_1280.torch
Note 1: The TISCH datasets shoule be firstly processed via running the codes in data/TISCH/preprocess_data.ipynb
.
Note 2: The checkpoint for xTrimoGene (scFoundation) should be moved to the xTrimoGene/model/models
directory.
# cd to the scFM-Bench project folder
mv data/weights/scFoundation/models.ckpt xTrimoGene/model/models
python 1_extract_gene_embeddings.py
Note: Please extract geneformer embeddings before LangCell because the data preprocessing is implemented in the geneformer module.
# for datasets from scib (Pancreas and Immune)
bash scripts/get_cell_embeddings_scib.sh
# for datasets from cellxgene (HLCA and Tabula Sapiens)
bash scripts/get_cell_embeddings_cellxgene.sh
# for datasets that are already processed (TISCH)
bash scripts/get_cell_embeddings_normalized.sh
The baseline code is from FRoGS. See details in the scFM-Bench/FRoGS
subfolder.
The evaluation metrics is based on scvi-tools and scGraph.
# for datasets from scib
bash scripts/calculate_cluster_metrics.sh
# for datasets from cellxgene
bash scripts/calculate_cluster_metrics_cellxgene.sh
Output files:
clustering_metrics.csv
: a csv file contains the results of scIB metrics.X_umap.npy
: a ndarray contains the umap coordinates of cell embeddings.clustering_umap_batch.png
: the umap plot of cell embeddings colored by batch labels.clustering_umap_celltype.png
: the umap plot of cell embeddings colored by cell type labels.
The umap coordinates and png will be saved in the output
directory.
cd scGraph
python scGraph_cl_ontology.py {dataset_name}
By default, the output files will be saved the output/scGraph
directory. For each specific dataset, there are two output files:
{dataset_name}.csv
: a csv file contains the results of the original scGraph and our proposed scGraph-OntoRWR metrics (average across all cell types).{dataset_name}_rwr_detailed.csv
: a csv file contains the cell type-specific scGraph-OntoRWR scores.
The baseline code is from Onclass. See details in the scFM-Bench/Onclass
subfolder.
The baseline code is from SequencingCancerFinder. See details in the scFM-Bench/SequencingCancerFinder
subfolder.
The baseline code is from SCAD. See details in the scFM-Bench/DrugSensitivity
subfolder.
Our implementation uses microsoft's zero-shot-foundation code as a starting point. Thanks for their great work and code, hope readers of interest could check their work, too.