Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SequentialFeatureSelection Early Stopping Criterion #886

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

aldder
Copy link

@aldder aldder commented Feb 2, 2022

Description

According to some studies reported on papers (like this: https://www.researchgate.net/publication/220804144_Overfitting_in_Wrapper-Based_Feature_Subset_Selection_The_Harder_You_Try_the_Worse_it_Gets), the feature selection methodologies known as Wrapper suffer from overfitting as the number of explored states increases.
A method to reduce this overfitting is to use automatic stop criteria (early stop, as the one most known for neural networks).
In this PR I have implemented the criterion of early stopping for the class SequentialFeatureSelector.

One parameter has been added in during instantiation of the object:

early_stop_rounds : int (default 0)
    Enable early stopping criterion when > 0, this value determines the
    number of iterations after which, if no performance boost has been
    seen, execution is stopped.
    Used only when `k_features == 'best'` or `k_features == 'parsimonious'`

Code Example:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

np.random.seed(0)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# add some noise in order to have features to discard
X_iris_with_noise = np.concatenate(
    (X_iris,
    np.random.randn(X_iris.shape[0], X_iris.shape[1])),
    axis=1)

knn = KNeighborsClassifier()
sfs = SFS(
    estimator=knn,
    k_features='best',
    forward=True,
    early_stop_rounds=0,
    verbose=0)

sfs.fit(X_iris_with_noise, y_iris)
plot_sfs(sfs.get_metric_dict());

1

sfs = SFS(
    estimator=knn,
    k_features='best',
    forward=True,
    early_stop_rounds=2,
    verbose=0)

sfs.fit(X_iris_with_noise, y_iris)
plot_sfs(sfs.get_metric_dict());

... Performances not improved for 2 rounds. Stopping now!

2

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
  • Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./mlxtend

@rasbt
Copy link
Owner

rasbt commented Feb 2, 2022

Thanks for the PR! I agree that overfitting can become an issue. Currently, there is the option

  • k_features="parsimonious"

which will select the smallest feature set within 1 standard error of the best feature set, which helps with this.

I like adding an early_stop option, but I have a few suggestions / concerns regarding the API:

1)

I think that the two parameters

early_stop and early_stop_rounds can be consolidated into a single one. E.g.,

                 if self.early_stop and k != k_to_select:
                     if k_score <= best_score:

could be

                 if self.early_stop_rounds and k != k_to_select:
                     if k_score <= best_score:

What I mean is instead of having

  • early_stop : bool (default: False)
  • early_stop_rounds : int (default 3)

this could be simplified to

  • early_stop_rounds : int (default 0)

2)

The second concern I have is that if a user selects e.g., k_features=(1, 3), early_stop_rounds=3, it's not necessarily guaranteed that there will be 1 to 3 features, which can be confusing.

I wonder if it makes sense to allow early_stopping_rounds only for k_features='best' and k_features='parsimonious', which both explore the whole feature subset size space.

E.g.,

  • if k_features='best' is selected with early_stop_rounds=0, it will evaluate and select the best feature subset in the range 1 to m.
  • if k_features='best' is selected with early_stop_rounds=2, it will evaluate and select the best feature subset in the range 1 to m but it may stop early, which means for forward selection, it may not explore the whole feature subset sizes up to m.

If a user selects e.g., k_features=3 and early_stop_rounds=2 we could

  • a) raise an error saying that f"Early stopping is not supported with fixed feature subset sizes."
  • b) raise a warning saying f"Early stopping may lead to feature subset sizes that are different from k_features={k_features}."

What are your thoughts?

@aldder
Copy link
Author

aldder commented Feb 2, 2022

Thanks for your suggestions, I agree with your points.
I will edit this PR with the followings:

  • have only one parameter early_stop_rounds : int >= 0
  • enable early stopping only if early_stop_rounds > 0 and k_features in ['best', 'parsimonious']

@rasbt
Copy link
Owner

rasbt commented Feb 2, 2022

Sounds great! Looking forward to it!

@pep8speaks
Copy link

pep8speaks commented Feb 2, 2022

Hello @aldder! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-02-03 10:37:22 UTC

if not isinstance(early_stop_rounds, int) or early_stop_rounds < 0:
raise ValueError('Number of early stopping round should be '
'an integer value greater than or equal to 0.'
'Got %d' % early_stop_rounds)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ...Got %d' % early_stop_rounds might not work if early_stop_rounds is not an integer. Maybe it's better to replace it with

...Got %s' % early_stop_rounds

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

# early stop
if self.early_stop_rounds \
and k != k_to_select \
and self.k_features in {'best', 'parsimonious'}:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the check here, i would maybe change it to raising a ValueError in the top of the function if self.k_features is not in {'best', 'parsimonious'} and self.early_stop_rounds. This way the user is aware, and we only have to perform the check once.

Copy link
Author

@aldder aldder Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rasbt Do you prefer having this check on top of fit function or during initialization?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, totally lost track and missed your comment! Sorry! Regarding your question, I think fit might be better to keep it more consistent with scikit-learn behavior.

aldder added a commit to aldder/mlxtend that referenced this pull request Feb 4, 2022
@jimmy927
Copy link

jimmy927 commented Oct 2, 2023

What is the status on this ?

I use the k_features="parsimonious" on my model.
But it continues to add more and more features even after it is obvious the model will not improve, and after that it will select one of the very early model anywys.

I think this PR could get my runtime from 10 days into hours ;-)

@rasbt
Copy link
Owner

rasbt commented Oct 14, 2023

Thanks for the ping, and I need to look into this some time -- sorry, haven't had a chance recently due to too many other commitments. Unfortunately, I currently don't have a timeline for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants