A framework for processing and analyzing Electronic Health Records (EHR) data using BERT-based models.
COREBEHRT helps researchers and data scientists preprocess EHR data, train models, and generate outcomes for downstream clinical predictions and analyses.
- End-to-end EHR Pipeline: Tools for data ingestion, cleaning, and feature extraction.
- BERT-based Modeling: Pretraining on massive EHR corpora followed by task-specific finetuning.
- Cohort Management: Flexible inclusion/exclusion logic, temporal alignment, outcome definition.
- Scalable: Designed to run both locally or on cloud infrastructure (Azure).
- Built-in Validation: Cross-validation and out-of-time evaluation strategies.
Below is a high-level overview of the most important directories:
- main: Primary pipeline scripts (create_data, pretrain, finetune, etc.)
- modules: Core implementation of model architecture and data processing (detailed overview)
- configs: YAML configuration files for each pipeline stage
- functional: Pure utility functions supporting module operations (detailed overview)
- azure: Cloud deployment and execution utilities (azure instructions)
For running tests and pipelines, create and activate a virtual environment, then install the required dependencies:
python -m venv .venv
source .venv/bin/activate
(.venv) pip install -r requirements.txt
Below is a high-level description of the steps in the COREBEHRT pipeline. For detailed configuration options, see the main README. The pipeline can be run from the root directory by executing the following commands:
(.venv) python -m corebehrt.main.create_data
(.venv) python -m corebehrt.main.prepare_training_data --config_path corebehrt/configs/prepare_pretrain.yaml
(.venv) python -m corebehrt.main.pretrain
(.venv) python -m corebehrt.main.create_outcomes
(.venv) python -m corebehrt.main.select_cohort
(.venv) python -m corebehrt.main.prepare_training_data --config_path corebehrt/configs/prepare_finetune.yaml
(.venv) python -m corebehrt.main.finetune_cv
(.venv) python -m corebehrt.main.select_cohort --config_path corebehrt/configs/select_cohort_held_out.yaml
(.venv) python -m corebehrt.main.prepare_training_data --config_path corebehrt/configs/prepare_held_out.yaml
(.venv) python -m corebehrt.main.evaluate_finetune --config_path corebehrt/configs/evaluate_finetune.yaml
Before using COREBEHRT, you need to convert your raw healthcare data into the MEDS (Medical-Event-Data-Standard) format format. We provide a companion tool ehr2meds to help with this conversion:
- Converts source data (e.g., hospital EHR dumps, registry data) into MEDS
- Performs code normalization and standardization
- Provides configuration options for handling different data sources
- Includes validation to ensure data quality
- Goal: Convert MEDS into tokenized features suitable for model training.
- Key Tasks:
- Vocabulary Mapping: Translates raw medical concepts (e.g., diagnoses, procedures) into numerical tokens.
- Temporal Alignment: Converts timestamps into relative positions (e.g., hours or days from an index date).
- Incorporate Background Variables: Incorporates static features such as age, gender, or other demographics.
- Efficient Output: Produces a structured parquet format that can be rapidly loaded in subsequent steps.
- Goal: Train a ModernBERT model via masked language modeling.
- Key Tasks:
- Large scale self-supervised training on EHR sequences
- Embedding temporal relationships between medical events
- Saves checkpoints for downstream finetuning
- Goal: Generate outcomes from the formatted data for supervised learning.
- Key Tasks:
- Search for specific concepts (medications, diagnoses, procedures) in the data
- Optionally create exposure definitions for more complex study designs
-
Goal: Define the study population
-
Key Tasks:
- Apply inclusion/exclusion criteria (e.g. age, prior outcomes)
- Generate index dates for each patient
- Produce folds and test set for cross-validation
- Goal: Adapt the pretrained model for specific binary outcomes
- Key Tasks:
- K-fold cross-validation
- Includes early stopping and evaluation on test set
For a detailed overview of the pipeline, see the main README.
For running COREBEHRT on Azure cloud infrastructure using SDK v2, refer to the Azure guide. This includes:
- Configuration setup for Azure
- Data store management
- Job execution in the cloud
- Environment preparation
We welcome contributions! Please see our Contributing Guidelines for details on:
- Code style and formatting
- Testing requirements
- Pull request process
- Issue reporting
This project is licensed under the MIT License - see the LICENSE file for details.
If you use COREBEHRT in your research, please cite the following paper: