ScalableXplain - Efficient XAI Tool for Scalable Machine Learning Models

Project Lead(s) / Mentor(s)
1. Dr. Uthayasanker Thayasivam
2. Dr. Sumanaruban Rajadurai
Contributor(s)
1. Inuka Ampavila
2. Saadha Salim
3. Bojitha Liyanage

Useful Links

GitHub: ScalableXplainLibrary
Distributed IMM GitHub Repository: distributed_imm

Summary

ScalableXplain is a unified, efficient, and scalable Explainable AI (XAI) library designed to work seamlessly with both single-node and distributed machine learning environments. The project aims to make interpretation of complex models practical at scale—bridging the gap between model accuracy and human understanding.

This tool integrates multiple explanation techniques, including SHAP (KernelSHAP and TreeSHAP), LIME, and IMM (Iterative Mistake Minimization), under a single interface that automatically detects the runtime environment (e.g., pandas or PySpark). With distributed support via Apache Spark, ScalableXplain empowers practitioners to interpret large-scale models trained on massive datasets, without compromising on performance or usability.

Description

ScalableXplain is developed to address a key challenge in modern machine learning workflows: how to generate meaningful explanations at scale. While traditional XAI tools work well on small datasets, they struggle with the size and complexity of modern pipelines. ScalableXplain introduces a modular and extensible library that supports:

Local (single-node) explainers using NumPy, pandas, and scikit-learn
Distributed explainers using PySpark and SynapseML
Seamless switching between environments through unified APIs

Distributed Iterative Mistake Minimization

D-IMM is a scalable and interpretable algorithm for explaining k-means clustering results using decision trees. Built for distributed environments, D-IMM extends the original Iterative Mistake Minimization (IMM) algorithm to handle large-scale datasets with millions of instances efficiently using Apache Spark. This is a novel algorithm presented by us in this package through our research. It is based on the IMM algorithm introduced in https://arxiv.org/abs/2002.12538.

🔍 Overview

Traditional clustering methods like k-means are powerful but hard to interpret—especially on large datasets. D-IMM bridges this gap by constructing human-readable decision trees that approximate the original clustering assignments. It provides global, post-hoc explanations that scale seamlessly with data volume and dimensionality.

✨ Key Features

✅ Scalable to 10M+ records using distributed Spark execution.
✅ Faithful explanations that minimize mismatches with k-means.
✅ Histogram-based binning for fast, repeatable split evaluations.
✅ Distributed mistake counting and node refinement loop.
✅ Produces interpretable global decision trees.

📈 Performance Highlights

Achieves up to 3.2× speedup compared to single-node IMM.
Preserves or improves clustering fidelity (mistake %, surrogate cost).
Demonstrates linear scalability with increasing Spark executors.
Tested on real-world datasets like HIGGS (11M points) and SUSY.

🛠 Built With

Apache Spark 3.5.x
Scala 2.12
Java 17
Compatible with PySpark via wrapper interface

Project Phases

Exploration & Benchmarking
- Surveyed existing explanation techniques and frameworks
- Ran initial experiments on SHAP, LIME, and IMM for synthetic and benchmark datasets
Research on creating a scalable algorithm that is equivalent to IMM
- Researched existing algorithms for threshold tree building in Apache Spark
- Introduced histogram based candidate split discovery and histogram based mistake calculation to the existing IMM algorithm to make it scalable and efficient
Experiments and Testing
- Tested the novel algorithm on a set of large scale datasets to verify scalability and validity of results
Implementation of Package
- Developed unified wrapper classes for SHAP, LIME, and IMM
- Implemented both single-node and distributed versions
- Created automatic backend detection and dispatch
Optimization & Visualization
- Added visual support: SHAP bar plots, beeswarm plots, LIME text highlights
- Implemented efficient histogram-based mistake calculations for IMM
- Optimized runtime and memory for large datasets using Spark
Integration & Packaging
- Integrated Scala-based D-IMM using Py4J bridge
- Packaged the system as a pip-installable module
- Added command-line utilities and Jupyter notebook demos

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.vscode		.vscode
scalableXplain		scalableXplain
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ScalableXplain - Efficient XAI Tool for Scalable Machine Learning Models

Summary

Description

Distributed Iterative Mistake Minimization

🔍 Overview

✨ Key Features

📈 Performance Highlights

🛠 Built With

Project Phases

More References

License

Code of Conduct

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

aaivu/ScalableXplainLibrary

Folders and files

Latest commit

History

Repository files navigation

ScalableXplain - Efficient XAI Tool for Scalable Machine Learning Models

Summary

Description

Distributed Iterative Mistake Minimization

🔍 Overview

✨ Key Features

📈 Performance Highlights

🛠 Built With

Project Phases

More References

License

Code of Conduct

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages