COREBEHRT

A framework for processing and analyzing Electronic Health Records (EHR) data using BERT-based models.

COREBEHRT helps researchers and data scientists preprocess EHR data, train models, and generate outcomes for downstream clinical predictions and analyses.

Key Features

End-to-end EHR Pipeline: Tools for data ingestion, cleaning, and feature extraction.
BERT-based Modeling: Pretraining on massive EHR corpora followed by task-specific finetuning.
Cohort Management: Flexible inclusion/exclusion logic, temporal alignment, outcome definition.
Scalable: Designed to run both locally or on cloud infrastructure (Azure).
Built-in Validation: Cross-validation and out-of-time evaluation strategies.

Directory Overview

Below is a high-level overview of the most important directories:

main: Primary pipeline scripts (create_data, pretrain, finetune, etc.)
modules: Core implementation of model architecture and data processing (detailed overview)
configs: YAML configuration files for each pipeline stage
functional: Pure utility functions supporting module operations (detailed overview)
azure: Cloud deployment and execution utilities (azure instructions)

Getting Started

Virtual Environment Setup

For running tests and pipelines, create and activate a virtual environment, then install the required dependencies:

python -m venv .venv
source .venv/bin/activate
(.venv) pip install -r requirements.txt

Pipeline

Below is a high-level description of the steps in the COREBEHRT pipeline. For detailed configuration options, see the main README. The pipeline can be run from the root directory by executing the following commands:

(.venv) python -m corebehrt.main.create_data
(.venv) python -m corebehrt.main.prepare_training_data --config_path corebehrt/configs/prepare_pretrain.yaml
(.venv) python -m corebehrt.main.pretrain
(.venv) python -m corebehrt.main.create_outcomes
(.venv) python -m corebehrt.main.select_cohort
(.venv) python -m corebehrt.main.prepare_training_data --config_path corebehrt/configs/prepare_finetune.yaml
(.venv) python -m corebehrt.main.finetune_cv
(.venv) python -m corebehrt.main.select_cohort --config_path corebehrt/configs/select_cohort_held_out.yaml
(.venv) python -m corebehrt.main.prepare_training_data --config_path corebehrt/configs/prepare_held_out.yaml
(.venv) python -m corebehrt.main.evaluate_finetune --config_path corebehrt/configs/evaluate_finetune.yaml

Converting to MEDS

Before using COREBEHRT, you need to convert your raw healthcare data into the MEDS (Medical-Event-Data-Standard) format format. We provide a companion tool ehr2meds to help with this conversion:

Converts source data (e.g., hospital EHR dumps, registry data) into MEDS
Performs code normalization and standardization
Provides configuration options for handling different data sources
Includes validation to ensure data quality

1. Create Data

Goal: Convert MEDS into tokenized features suitable for model training.
Key Tasks:
- Vocabulary Mapping: Translates raw medical concepts (e.g., diagnoses, procedures) into numerical tokens.
- Temporal Alignment: Converts timestamps into relative positions (e.g., hours or days from an index date).
- Incorporate Background Variables: Incorporates static features such as age, gender, or other demographics.
Efficient Output: Produces a structured parquet format that can be rapidly loaded in subsequent steps.

2. Pretrain

Goal: Train a ModernBERT model via masked language modeling.
Key Tasks:
- Large scale self-supervised training on EHR sequences
- Embedding temporal relationships between medical events
- Saves checkpoints for downstream finetuning

3. Create Outcomes

Goal: Generate outcomes from the formatted data for supervised learning.
Key Tasks:
- Search for specific concepts (medications, diagnoses, procedures) in the data
- Optionally create exposure definitions for more complex study designs

3.1 Create Cohort

Goal: Define the study population
Key Tasks:
- Apply inclusion/exclusion criteria (e.g. age, prior outcomes)
- Generate index dates for each patient
- Produce folds and test set for cross-validation

4. Finetune

Goal: Adapt the pretrained model for specific binary outcomes
Key Tasks:
- K-fold cross-validation
- Includes early stopping and evaluation on test set

For a detailed overview of the pipeline, see the main README.

Azure Integration

For running COREBEHRT on Azure cloud infrastructure using SDK v2, refer to the Azure guide. This includes:

Configuration setup for Azure
Data store management
Job execution in the cloud
Environment preparation

Contributing

We welcome contributions! Please see our Contributing Guidelines for details on:

Code style and formatting
Testing requirements
Pull request process
Issue reporting

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use COREBEHRT in your research, please cite the following paper:

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
.github		.github
corebehrt		corebehrt
docs		docs
example_data		example_data
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COREBEHRT

Table of Contents

Key Features

Directory Overview

Getting Started

Virtual Environment Setup

Pipeline

Converting to MEDS

1. Create Data

2. Pretrain

3. Create Outcomes

3.1 Create Cohort

4. Finetune

Azure Integration

Contributing

License

Citation

About

Releases 3

Packages

Contributors 4

Languages

License

FGA-DIKU/EHR

Folders and files

Latest commit

History

Repository files navigation

COREBEHRT

Table of Contents

Key Features

Directory Overview

Getting Started

Virtual Environment Setup

Pipeline

Converting to MEDS

1. Create Data

2. Pretrain

3. Create Outcomes

3.1 Create Cohort

4. Finetune

Azure Integration

Contributing

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 4

Languages

Packages