Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add catboost integration tests #17931

Merged

Conversation

Matt711
Copy link
Contributor

@Matt711 Matt711 commented Feb 6, 2025

Description

Apart of #17490. This PR adds back the catboost integration tests, which were originally added in #17267 but were later removed due to ABI incompatability between the version of numpy catboost is compiled against and the version of numpy installed in the test environment. This PR adds the tests back but pins a compatible numpy version in the catboost tests.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 added feature request New feature or request non-breaking Non-breaking change labels Feb 6, 2025
Copy link

copy-pr-bot bot commented Feb 6, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 6, 2025

/ok to test

@github-actions github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Feb 6, 2025
@Matt711
Copy link
Contributor Author

Matt711 commented Feb 6, 2025

/ok to test

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 6, 2025

/ok to test

common:
- output_types: conda
packages:
# TODO: Remove numpy pinning once https://github.com/catboost/catboost/issues/2671 is resolved
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this paragraph from the numpy 2 release

Breaking changes to the NumPy ABI. As a result, binaries of packages
that use the NumPy C API and were built against a NumPy 1.xx release
will not work with NumPy 2.0. On import, such packages will see an
ImportError with a message about binary incompatibility.

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 7, 2025

/ok to test

Copy link
Contributor Author

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the reviewer: These were just for testing. I'll remove before I merge.

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 7, 2025

/ok to test

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 7, 2025

@Matt711 Matt711 marked this pull request as ready for review February 7, 2025 07:35
@Matt711 Matt711 requested review from a team as code owners February 7, 2025 07:35
Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving you a ci-codeowners / packaging-codeowners approval because the description says that this is just bringing back tests that already used to exist, and that's a net gain for test coverage here.

But please do see my suggestions about more thoroughly testing the CatBoost integration.

def classification_data():
X, y = make_classification(
n_samples=100, n_features=10, n_classes=2, random_state=42
)
Copy link
Member

@jameslamb jameslamb Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_classification() returns a dataset that has only continuous features.

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=100, n_features=10, n_classes=2, random_state=42
)
X
array([[-1.14052601,  1.35970566,  0.86199147,  0.84609208,  0.60600995,
        -1.55662917,  1.75479418,  1.69645637, -1.28042935, -2.08192941],
...

For catboost in particular, I strongly suspect you'll get better effective test coverage of this integration by including some categorical features.

Encoding and decoding categorical features is critical to how CatBoost works (docs), and there are lots of things that have to go exactly right when providing pandas-like categorical input. Basically, everything here: https://pandas.pydata.org/docs/user_guide/categorical.html

I really think you should provide an input dataset that has some categorical features, ideally in 2 forms:

  • integer-type columns
  • pandas.categorical type columns

And ideally with varying cardinality.

You could consider adapting this code used in xgboost's tests: https://github.com/dmlc/xgboost/blob/105aa4247abb3ce787be2cef2f9beb4c24b30049/demo/guide-python/categorical.py#L29

And here are some docs on how to tell CatBoost which features are categorical: https://catboost.ai/docs/en/concepts/python-usages-examples#class-with-array-like-data-with-numerical,-categorical-and-embedding-features

@pytest.fixture
def classification_data():
X, y = make_classification(
n_samples=100, n_features=10, n_classes=2, random_state=42
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n_samples=100, n_features=10, n_classes=2, random_state=42
n_samples=1_000, n_features=10, n_classes=2, random_state=42

You may want to use slightly more data, here an in regression_data(). There are some types of encoding and data access bugs that will only show up in certain codepaths in CatBoost that are exercised when there are enough splits per tree.

I've seen this before in LightGBM and XGBoost... someone will write a test that fits on a very small dataset and it'll look like nothing went wrong, only to later find that actually the dataset was so small that the model was just a collection of decision stumps (no splits), and so the test could never catch issues like "this encoding doesn't preserve NAs" or "these outputs are different because of numerical precision issues".

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 8, 2025

Giving you a ci-codeowners / packaging-codeowners approval because the description says that this is just bringing back tests that already used to exist, and that's a net gain for test coverage here.

But please do see my suggestions about more thoroughly testing the CatBoost integration.

Thanks for the suggestions on improving these tests @jameslamb! This library is new to me to me so I appreciate the time you took to investigate some of the APIs. I think what's in this PR is a good starting point, but I agree with your suggestions so I'll include them in a follow-up PR. I think I'll also ask others offline who are more familiar with Catboost/XGBoost for their suggestions.

@Matt711 Matt711 marked this pull request as draft February 10, 2025 20:23
@Matt711 Matt711 marked this pull request as ready for review February 10, 2025 20:26
@Matt711 Matt711 requested a review from mroeschke February 10, 2025 20:26
@Matt711
Copy link
Contributor Author

Matt711 commented Feb 11, 2025

/merge

@rapids-bot rapids-bot bot merged commit e6b1c0f into rapidsai:branch-25.04 Feb 13, 2025
108 of 109 checks passed
rapids-bot bot pushed a commit that referenced this pull request Mar 31, 2025
Follow up of #17931. This PR tests catboost with categorical features and increases the size of the the test data we fit our catboost models on.

Authors:
  - Matthew Murray (https://github.com/Matt711)
  - Ray Douglass (https://github.com/raydouglass)
  - Marco Edward Gorelli (https://github.com/MarcoGorelli)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #18126
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.pandas Issues specific to cudf.pandas feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants