-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #7 from transferwise/regularization
Add Regularization and tidy up
- Loading branch information
Showing
7 changed files
with
308 additions
and
337 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,18 @@ | ||
# Repository created from the dev portal | ||
|
||
Owner: data-scientists | ||
|
||
Slack channels: #shap-select | ||
|
||
## Table of Contents | ||
|
||
- [Overview](#overview) | ||
|
||
## Overview | ||
`shap-select` implements a heuristic to do fast feature selection for tabular regression and classification models. | ||
|
||
The basic idea is running a linear or logistic regression of the target on the Shapley values on the validation set, | ||
discarding the features with negative coefficients, and ranking/filtering the rest according to their | ||
statistical significance. For motivation and details, see the [example notebook](https://github.com/transferwise/shap-select/blob/main/docs/Quick%20feature%20selection%20through%20regression%20on%20Shapley%20values.ipynb) | ||
|
||
A library for feature selection for gradient boosting models using regression on feature Shapley values | ||
Earlier packages using Shapley values for feature selection exist, the advantages of this one are | ||
* Regression on the **validation set** to combat overfitting | ||
* A single pass regression, not an iterative approach | ||
* A single intuitive hyperparameter for feature selection: statistical significance | ||
* Bonferroni correction for multiclass classification | ||
## Usage | ||
```python | ||
from shap_select import shap_select | ||
# Here model is any model supported by the shap library, fitted on a different (train) dataset | ||
selected_features_df = shap_select(model, X_val, y_val, task="multiclass", threshold=0.05) | ||
``` |
530 changes: 228 additions & 302 deletions
530
docs/Quick feature selection through regression on Shapley values.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
pandas | ||
scikit_learn | ||
scipy | ||
shap | ||
statsmodels |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
from .select import score_features | ||
from .select import shap_select |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters