-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SequentialFeatureSelection Early Stopping Criterion #886
base: master
Are you sure you want to change the base?
Conversation
Thanks for the PR! I agree that overfitting can become an issue. Currently, there is the option
which will select the smallest feature set within 1 standard error of the best feature set, which helps with this. I like adding an 1)I think that the two parameters
if self.early_stop and k != k_to_select:
if k_score <= best_score: could be if self.early_stop_rounds and k != k_to_select:
if k_score <= best_score: What I mean is instead of having
this could be simplified to
2)The second concern I have is that if a user selects e.g., I wonder if it makes sense to allow E.g.,
If a user selects e.g., k_features=3 and
What are your thoughts? |
Thanks for your suggestions, I agree with your points.
|
Sounds great! Looking forward to it! |
if not isinstance(early_stop_rounds, int) or early_stop_rounds < 0: | ||
raise ValueError('Number of early stopping round should be ' | ||
'an integer value greater than or equal to 0.' | ||
'Got %d' % early_stop_rounds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ...Got %d' % early_stop_rounds
might not work if early_stop_rounds
is not an integer. Maybe it's better to replace it with
...Got %s' % early_stop_rounds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right!
# early stop | ||
if self.early_stop_rounds \ | ||
and k != k_to_select \ | ||
and self.k_features in {'best', 'parsimonious'}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the check here, i would maybe change it to raising a ValueError
in the top of the function if self.k_features is not in {'best', 'parsimonious'} and self.early_stop_rounds. This way the user is aware, and we only have to perform the check once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rasbt Do you prefer having this check on top of fit
function or during initialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, totally lost track and missed your comment! Sorry! Regarding your question, I think fit
might be better to keep it more consistent with scikit-learn behavior.
What is the status on this ? I use the k_features="parsimonious" on my model. I think this PR could get my runtime from 10 days into hours ;-) |
Thanks for the ping, and I need to look into this some time -- sorry, haven't had a chance recently due to too many other commitments. Unfortunately, I currently don't have a timeline for this. |
Description
According to some studies reported on papers (like this: https://www.researchgate.net/publication/220804144_Overfitting_in_Wrapper-Based_Feature_Subset_Selection_The_Harder_You_Try_the_Worse_it_Gets), the feature selection methodologies known as Wrapper suffer from overfitting as the number of explored states increases.
A method to reduce this overfitting is to use automatic stop criteria (early stop, as the one most known for neural networks).
In this PR I have implemented the criterion of early stopping for the class SequentialFeatureSelector.
One parameter has been added in during instantiation of the object:
Code Example:
Pull Request Checklist
./docs/sources/CHANGELOG.md
file (if applicable)./mlxtend/*/tests
directories (if applicable)mlxtend/docs/sources/
(if applicable)PYTHONPATH='.' pytest ./mlxtend -sv
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv
)flake8 ./mlxtend