Skip to content

eXascaleInfolab/ImputeGAP

Repository files navigation



Welcome to ImputeGAP

ImputeGAP is a comprehensive Python library for imputation of missing values in time series data. It implements user-friendly APIs to easily visualize, analyze, and repair your own time series datasets. The library supports a diverse range of imputation methods and modular missing data simulation catering to datasets with varying characteristics. ImputeGAP includes extensive customization options, such as automated hyperparameter tuning, benchmarking, explainability, downstream evaluation, and compatibility with popular time series frameworks.

In detail, the package provides: Access to commonly used datasets in time series research (Datasets).

  • Configurable contamination module that simulates real-world missingness patterns.
  • Automated preprocessing with built-in methods for normalizing time series.
  • Parameterized state-of-the-art time series imputation algorithms.
  • Modular tools to analyze the behavior of these algorithms and assess their impact on key downstream tasks in time series analysis.
  • Experiment benchmarking, fostering research reproducibility in time series.
  • Fine-grained analysis of the impact of time series features on imputation results.
  • Plug-and-play integration of new datasets and algorithms in various languages such as Python, C++, Matlab, Java, and R.

Python Release License Coverage PyPI Language Platform Docs



Quick Navigation


Families of Algorithms

Algorithms Table

Family Algorithm Venue -- Year
Matrix Completion CDRec [1] KAIS -- 2020
Matrix Completion TRMF [8] NeurIPS -- 2016
Matrix Completion GROUSE [3] PMLR -- 2016
Matrix Completion ROSL [4] CVPR -- 2014
Matrix Completion SoftImpute [6] JMLR -- 2010
Matrix Completion SVT [7] SIAM J. OPTIM -- 2010
Matrix Completion SPIRIT [5] VLDB -- 2005
Matrix Completion IterativeSVD [2] BIOINFORMATICS -- 2001
Pattern Search TKCM [11] EDBT -- 2017
Pattern Search ST-MVL [9] IJCAI -- 2016
Pattern Search DynaMMo [10] KDD -- 2009
Machine Learning IIM [12] ICDE -- 2019
Machine Learning XGBI [13] KDD -- 2016
Machine Learning Mice [14] Statistical Software -- 2011
Machine Learning MissForest [15] BioInformatics -- 2011
Deep Learning BITGraph [32] ICLR -- 2024
Deep Learning BayOTIDE [30] PMLR -- 2024
Deep Learning MPIN [25] PVLDB -- 2024
Deep Learning MissNet [27] KDD -- 2024
Deep Learning PriSTI [26] ICDE -- 2023
Deep Learning GRIN [29] ICLR -- 2022
Deep Learning HKMF-T [31] TKDE -- 2021
Deep Learning DeepMVI [24] PVLDB -- 2021
Deep Learning MRNN [22] IEEE Trans on BE -- 2019
Deep Learning BRITS [23] NeurIPS -- 2018
Deep Learning GAIN [28] ICML -- 2018
Statistics KNNImpute -
Statistics Interpolation -
Statistics Min Impute -
Statistics Zero Impute -
Statistics Mean Impute -
Statistics Mean Impute By Series -

System Requirements

ImputeGAP is compatible with Python>=3.10 (except 3.13) and Unix-compatible environment.

To create and set up an environment with Python 3.12, please refer to the installation guide.


Installation

To install the latest version of ImputeGAP from PyPI, run the following command:

pip install imputegap

Alternatively, you can install the library from source:

git init
git clone https://github.com/eXascaleInfolab/ImputeGAP
cd ./ImputeGAP
pip install -e .

Loading and Preprocessing

ImputeGAP comes with several time series datasets. You can find them inside the submodule ts.datasets.

As an example, we start by using eeg-alcohol, a standard dataset composed of individuals with a genetic predisposition to alcoholism. The dataset contains measurements from 64 electrodes placed on subject’s scalps, sampled at 256 Hz (3.9-ms epoch) for 1 second. The dimensions of the dataset are 64 series, each containing 256 values.

Example Loading

You can find this example in the file runner_loading.py.

from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the TimeSeries()
ts = TimeSeries()
print(f"ImputeGAP datasets : {ts.datasets}")

# load the timeseries from file or from the code
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# plot a subset of time series
ts.plot(input_data=ts.data, nbr_series=9, nbr_val=100, save_path="./imputegap/assets")

# print a subset of time series
ts.print(nbr_series=6, nbr_val=20)

Contamination

We now describe how to simulate missing values in the loaded dataset. ImputeGAP implements eight different missingness patterns. You can find them inside the module ts.patterns.

For more details, please refer to the documentation in this page.

Example Contamination

You can find this example in the file runner_contamination.py.

As example, we show how to contaminate the eeg-alcohol dataset with the MCAR pattern:

from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the TimeSeries() object
ts = TimeSeries()
print(f"Missingness patterns : {ts.patterns}")

# load and normalize the timeseries
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# contaminate the time series with MCAR pattern
ts_m = ts.Contamination.missing_completely_at_random(ts.data, rate_dataset=0.2, rate_series=0.4, block_size=10, seed=True)

# [OPTIONAL] plot the contaminated time series
ts.plot(ts.data, ts_m, nbr_series=9, subplot=True, save_path="./imputegap/assets")

Imputation

In this section, we will illustrate how to impute the contaminated time series. Our library implements five families of imputation algorithms. Statistical, Machine Learning, Matrix Completion, Deep Learning, and Pattern Search Methods. You can find the list of algorithms inside the module ts.algorithms.

Example Imputation

You can find this example in the file runner_imputation.py.

Imputation can be performed using either default values or user-defined values. To specify the parameters, please use a dictionary in the following format:

params = {"param_1": 42.1, "param_2": "some_string", "params_3": True}

Let's illustrate the imputation using the CDRec Algorithm from the Matrix Completion family.

from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the TimeSeries() object
ts = TimeSeries()
print(f"Imputation algorithms : {ts.algorithms}")

# load and normalize the timeseries
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# contaminate the time series
ts_m = ts.Contamination.missing_completely_at_random(ts.data)

# impute the contaminated series
imputer = Imputation.MatrixCompletion.CDRec(ts_m)
imputer.impute()

# compute and print the imputation metrics
imputer.score(ts.data, imputer.recov_data)
ts.print_results(imputer.metrics)

# plot the recovered time series
ts.plot(input_data=ts.data, incomp_data=ts_m, recov_data=imputer.recov_data, nbr_series=9, subplot=True, save_path="./imputegap/assets")

Parameterization

The Optimizer component manages algorithm configuration and hyperparameter tuning. To invoke the tuning process, users need to specify the optimization option during the Impute call by selecting the appropriate input for the algorithm. The parameters are defined by providing a dictionary containing the ground truth, the chosen optimizer, and the optimizer's options. Several search algorithms are available, including those provided by Ray Tune.

Example Auto-ML

You can find this example in the file runner_optimization.py.

Let's illustrate the imputation using the CDRec Algorithm and Ray-Tune AutoML:

from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the TimeSeries() object
ts = TimeSeries()
print(f"AutoML Optimizers : {ts.optimizers}")

# load and normalize the timeseries
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# contaminate and impute the time series
ts_m = ts.Contamination.missing_completely_at_random(ts.data)
imputer = Imputation.MatrixCompletion.CDRec(ts_m)

# use Ray Tune to fine tune the imputation algorithm
imputer.impute(user_def=False, params={"input_data": ts.data, "optimizer": "ray_tune"})

# compute and print the imputation metrics
imputer.score(ts.data, imputer.recov_data)
ts.print_results(imputer.metrics)

# plot the recovered time series
ts.plot(input_data=ts.data, incomp_data=ts_m, recov_data=imputer.recov_data, nbr_series=9, subplot=True, save_path="./imputegap/assets", display=True)

# save hyperparameters
utils.save_optimization(optimal_params=imputer.parameters, algorithm=imputer.algorithm, dataset="eeg-alcohol", optimizer="ray_tune")

Explainer

ImputeGAP provides insights into the algorithm’s behavior by identifying the features that impact the most the imputation results. It trains a regression model to predict imputation results across various methods and uses SHapley Additive exPlanations (SHAP) to reveal how different time series features influence the model’s predictions.

Example Explainer

You can find this example in the file runner_explainer.py.

Let’s illustrate the explainer using the CDRec Algorithm and MCAR missingness pattern:

from imputegap.recovery.manager import TimeSeries
from imputegap.recovery.explainer import Explainer
from imputegap.tools import utils

# initialize the TimeSeries() object
ts = TimeSeries()

# load and normalize the timeseries
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# configure the explanation
shap_values, shap_details = Explainer.shap_explainer(input_data=ts.data, 
                                                     extractor="pycatch", 
                                                     pattern="missing_completely_at_random", 
                                                     file_name=ts.name,
                                                     algorithm="CDRec")

# print the impact of each feature
Explainer.print(shap_values, shap_details)

Downstream

ImputeGAP includes a dedicated module for systematically evaluating the impact of data imputation on downstream tasks. Currently, forecasting is the primary supported task, with plans to expand to additional applications in the future. The example below demonstrates how to define the forecasting task and specify Prophet as the predictive model

Below is an example of how to call the downstream process for the model Prophet by defining a dictionary for the evaluator and selecting the model:

Example Downstream

You can find this example in the file runner_downstream.py.

Below is an example of how to call the downstream process for the model Prophet by defining a dictionary for the evaluator and selecting the model:

from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the TimeSeries() object
ts = TimeSeries()
print(f"ImputeGAP downstream models for forcasting : {ts.downstream_models}")

# load and normalize the timeseries
ts.load_series(utils.search_path("chlorine"))
ts.normalize(normalizer="min_max")

# contaminate the time series
ts_m = ts.Contamination.missing_percentage(ts.data, rate_series=0.8)

# define and impute the contaminated series
imputer = Imputation.MatrixCompletion.CDRec(ts_m)
imputer.impute()

# compute print the downstream results
downstream_config = {"task": "forecast", "model": "prophet"}
imputer.score(ts.data, imputer.recov_data, downstream=downstream_config)
ts.print_results(imputer.downstream_metrics, algorithm=imputer.algorithm)

Benchmark

ImputeGAP can serve as a common test-bed for comparing the effectiveness and efficiency of time series imputation algorithms[33] . Users have full control over the benchmark by customizing various parameters, including the list of datasets to evaluate, the algorithms to compare, the choice of optimizer to fine-tune the algorithms on the chosen datasets, the missingness patterns, and the range of missing rates.

Example Benchmark

You can find this example in the file runner_benchmark.py.

The benchmarking module can be utilized as follows:

from imputegap.recovery.benchmark import Benchmark

save_dir = "./analysis"
nbr_run = 2

datasets = ["eeg-alcohol", "eeg-reading"]

optimizer = {"optimizer": "ray_tune", "options": {"n_calls": 1, "max_concurrent_trials": 1}}
optimizers = [optimizer]

algorithms = ["MeanImpute", "CDRec", "STMVL", "IIM", "MRNN"]

patterns = ["missing_completely_at_random"]

range = [0.05, 0.1, 0.2, 0.4, 0.6, 0.8]

# launch the analysis
list_results, sum_scores = Benchmark().eval(algorithms=algorithms, datasets=datasets, patterns=patterns, x_axis=range, optimizers=optimizers, save_dir=save_dir, runs=nbr_run)

Integration

To add your own imputation algorithm in Python or C++, please refer to the detailed integration guide.


Articles

Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, Philippe Cudre-Mauroux: Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series. Proc. VLDB Endow. 13(5): 768-782 (2020)

Mourad Khayati, Quentin Nater, Jacques Pasquier: ImputeVIS: An Interactive Evaluator to Benchmark Imputation Techniques for Time Series Data. Proc. VLDB Endow. 17(12): 4329-4332 (2024)


Core Contributors

Quentin Nater - ImputeGAP Mourad Khayati - ImputeGAP
Quentin Nater [email protected] Mourad Khayati [email protected]

Citing

If you use ImputeGAP in your research, please cite the library

@software{ImputeGAP_2025,
    author = {Nater, Quentin and Khayati, Mourad},
    license = {MIT},
    title = {{ImputeGAP Library}},
    url = {https://github.com/eXascaleInfolab/ImputeGAP},
    year = {2025}
}

References

[1]: Mourad Khayati, Philippe Cudré-Mauroux, Michael H. Böhlen: Scalable recovery of missing blocks in time series with high and low cross-correlations. Knowl. Inf. Syst. 62(6): 2257-2280 (2020)

[2]: Olga G. Troyanskaya, Michael N. Cantor, Gavin Sherlock, Patrick O. Brown, Trevor Hastie, Robert Tibshirani, David Botstein, Russ B. Altman: Missing value estimation methods for DNA microarrays. Bioinform. 17(6): 520-525 (2001)

[3]: Dejiao Zhang, Laura Balzano: Global Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation. AISTATS 2016: 1460-1468

[4]: Xianbiao Shu, Fatih Porikli, Narendra Ahuja: Robust Orthonormal Subspace Learning: Efficient Recovery of Corrupted Low-Rank Matrices. CVPR 2014: 3874-3881

[5]: Spiros Papadimitriou, Jimeng Sun, Christos Faloutsos: Streaming Pattern Discovery in Multiple Time-Series. VLDB 2005: 697-708

[6]: Rahul Mazumder, Trevor Hastie, Robert Tibshirani: Spectral Regularization Algorithms for Learning Large Incomplete Matrices. J. Mach. Learn. Res. 11: 2287-2322 (2010)

[7]: Jian-Feng Cai, Emmanuel J. Candès, Zuowei Shen: A Singular Value Thresholding Algorithm for Matrix Completion. SIAM J. Optim. 20(4): 1956-1982 (2010)

[8]: Hsiang-Fu Yu, Nikhil Rao, Inderjit S. Dhillon: Temporal Regularized Matrix Factorization for High-dimensional Time Series Prediction. NIPS 2016: 847-855

[9]: Xiuwen Yi, Yu Zheng, Junbo Zhang, Tianrui Li: ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data. IJCAI 2016: 2704-2710

[10]: Lei Li, James McCann, Nancy S. Pollard, Christos Faloutsos: DynaMMo: mining and summarization of coevolving sequences with missing values. 507-516

[11]: Kevin Wellenzohn, Michael H. Böhlen, Anton Dignös, Johann Gamper, Hannes Mitterer: Continuous Imputation of Missing Values in Streams of Pattern-Determining Time Series. EDBT 2017: 330-341

[12]: Aoqian Zhang, Shaoxu Song, Yu Sun, Jianmin Wang: Learning Individual Models for Imputation (Technical Report). CoRR abs/2004.03436 (2020)

[13]: Tianqi Chen, Carlos Guestrin: XGBoost: A Scalable Tree Boosting System. KDD 2016: 785-794

[14]: Royston Patrick , White Ian R.: Multiple Imputation by Chained Equations (MICE): Implementation in Stata. Journal of Statistical Software 2010: 45(4), 1–20.

[15]: Daniel J. Stekhoven, Peter Bühlmann: MissForest - non-parametric missing value imputation for mixed-type data. Bioinform. 28(1): 112-118 (2012)

[22]: Jinsung Yoon, William R. Zame, Mihaela van der Schaar: Estimating Missing Data in Temporal Data Streams Using Multi-Directional Recurrent Neural Networks. IEEE Trans. Biomed. Eng. 66(5): 1477-1490 (2019)

[23]: Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, Yitan Li: BRITS: Bidirectional Recurrent Imputation for Time Series. NeurIPS 2018: 6776-6786

[24]: Parikshit Bansal, Prathamesh Deshpande, Sunita Sarawagi: Missing Value Imputation on Multidimensional Time Series. Proc. VLDB Endow. 14(11): 2533-2545 (2021)

[25]: Xiao Li, Huan Li, Hua Lu, Christian S. Jensen, Varun Pandey, Volker Markl: Missing Value Imputation for Multi-attribute Sensor Data Streams via Message Propagation (Extended Version). CoRR abs/2311.07344 (2023)

[26]: Mingzhe Liu, Han Huang, Hao Feng, Leilei Sun, Bowen Du, Yanjie Fu: PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation. ICDE 2023: 1927-1939

[27]: Kohei Obata, Koki Kawabata, Yasuko Matsubara, Yasushi Sakurai: Mining of Switching Sparse Networks for Missing Value Imputation in Multivariate Time Series. KDD 2024: 2296-2306

[28]: Jinsung Yoon, James Jordon, Mihaela van der Schaar: GAIN: Missing Data Imputation using Generative Adversarial Nets. ICML 2018: 5675-5684

[29]: Andrea Cini, Ivan Marisca, Cesare Alippi: Multivariate Time Series Imputation by Graph Neural Networks. CoRR abs/2108.00298 (2021)

[30]: Shikai Fang, Qingsong Wen, Yingtao Luo, Shandian Zhe, Liang Sun: BayOTIDE: Bayesian Online Multivariate Time Series Imputation with Functional Decomposition. ICML 2024

[31]: Liang Wang, Simeng Wu, Tianheng Wu, Xianping Tao, Jian Lu: HKMF-T: Recover From Blackouts in Tagged Time Series With Hankel Matrix Factorization. IEEE Trans. Knowl. Data Eng. 33(11): 3582-3593 (2021)

[32]: Xiaodan Chen, Xiucheng Li, Bo Liu, Zhijun Li: Biased Temporal Convolution Graph Network for Time Series Forecasting with Missing Values. ICLR 2024

[33] Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, Philippe Cudré-Mauroux: Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series. Proc. VLDB Endow. 13(5): 768-782 (2020)

[34] Mourad Khayati, Quentin Nater, Jacques Pasquier: ImputeVIS: An Interactive Evaluator to Benchmark Imputation Techniques for Time Series Data. Proc. VLDB Endow. 17(12): 4329-4332 (2024)

Releases

No releases published

Packages

No packages published

Languages