Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add catboost integration tests #17931

Open
wants to merge 6 commits into
base: branch-25.04
Choose a base branch
from

Conversation

Matt711
Copy link
Contributor

@Matt711 Matt711 commented Feb 6, 2025

Description

Apart of #17490. This PR adds back the catboost integration tests, which were originally added in #17267 but were later removed due to ABI incompatability between the version of numpy catboost is compiled against and the version of numpy installed in the test environment. This PR adds the tests back but pins a compatible numpy version in the catboost tests.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 added feature request New feature or request non-breaking Non-breaking change labels Feb 6, 2025
Copy link

copy-pr-bot bot commented Feb 6, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 6, 2025

/ok to test

@github-actions github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Feb 6, 2025
@Matt711
Copy link
Contributor Author

Matt711 commented Feb 6, 2025

/ok to test

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 6, 2025

/ok to test

common:
- output_types: conda
packages:
# TODO: Remove numpy pinning once https://github.com/catboost/catboost/issues/2671 is resolved
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this paragraph from the numpy 2 release

Breaking changes to the NumPy ABI. As a result, binaries of packages
that use the NumPy C API and were built against a NumPy 1.xx release
will not work with NumPy 2.0. On import, such packages will see an
ImportError with a message about binary incompatibility.

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 7, 2025

/ok to test

Copy link
Contributor Author

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the reviewer: These were just for testing. I'll remove before I merge.

@@ -41,6 +41,7 @@ jobs:
- pandas-tests
- pandas-tests-diff
- telemetry-setup
- third-party-integration-tests-cudf-pandas
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- third-party-integration-tests-cudf-pandas

Comment on lines +360 to +373
third-party-integration-tests-cudf-pandas:
needs: wheel-build-cudf
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: pull-request
arch: "amd64"
branch: ${{ inputs.branch }}
date: ${{ inputs.date }}
sha: ${{ inputs.sha }}
node_type: "gpu-v100-latest-1"
container_image: "rapidsai/ci-conda:latest"
run_script: |
ci/cudf_pandas_scripts/third-party-integration/test.sh python/cudf/cudf_pandas_tests/third_party_integration_tests/dependencies.yaml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
third-party-integration-tests-cudf-pandas:
needs: wheel-build-cudf
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: pull-request
arch: "amd64"
branch: ${{ inputs.branch }}
date: ${{ inputs.date }}
sha: ${{ inputs.sha }}
node_type: "gpu-v100-latest-1"
container_image: "rapidsai/ci-conda:latest"
run_script: |
ci/cudf_pandas_scripts/third-party-integration/test.sh python/cudf/cudf_pandas_tests/third_party_integration_tests/dependencies.yaml

@@ -262,7 +278,7 @@ dependencies:
packages:
- pip
- pip:
- ibis-framework[pandas]
- ibis-framework[pandas]==9.5.*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ibis-framework[pandas]==9.5.*
- ibis-framework[pandas]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #17945

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 7, 2025

/ok to test

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 7, 2025

@Matt711 Matt711 marked this pull request as ready for review February 7, 2025 07:35
@Matt711 Matt711 requested review from a team as code owners February 7, 2025 07:35
Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving you a ci-codeowners / packaging-codeowners approval because the description says that this is just bringing back tests that already used to exist, and that's a net gain for test coverage here.

But please do see my suggestions about more thoroughly testing the CatBoost integration.

@@ -0,0 +1,128 @@
# Copyright (c) 2023-2025, NVIDIA CORPORATION.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright (c) 2023-2025, NVIDIA CORPORATION.
# Copyright (c) 2025, NVIDIA CORPORATION.

This is a brand new file, shouldn't the copyright date only be 2025? Or was it copied from somewhere else?

model = CatBoostRegressor(iterations=10, verbose=0)
model.fit(X.values, y.values)
predictions = model.predict(X.values)
return predictions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry in advance, I'm not that familiar with these tests but... I'm surprised to see pytest test cases with a return statement. What is the interaction between these test cases and this a few lines up?

pytestmark = pytest.mark.assert_eq(fn=assert_catboost_equal)

Did you mean for there to be some kind of testing assertion here? Or does that custom marker somehow end up invoking that function and comparing the output of the test case with pandas inputs to its output with cudf inputs?

def classification_data():
X, y = make_classification(
n_samples=100, n_features=10, n_classes=2, random_state=42
)
Copy link
Member

@jameslamb jameslamb Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_classification() returns a dataset that has only continuous features.

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=100, n_features=10, n_classes=2, random_state=42
)
X
array([[-1.14052601,  1.35970566,  0.86199147,  0.84609208,  0.60600995,
        -1.55662917,  1.75479418,  1.69645637, -1.28042935, -2.08192941],
...

For catboost in particular, I strongly suspect you'll get better effective test coverage of this integration by including some categorical features.

Encoding and decoding categorical features is critical to how CatBoost works (docs), and there are lots of things that have to go exactly right when providing pandas-like categorical input. Basically, everything here: https://pandas.pydata.org/docs/user_guide/categorical.html

I really think you should provide an input dataset that has some categorical features, ideally in 2 forms:

  • integer-type columns
  • pandas.categorical type columns

And ideally with varying cardinality.

You could consider adapting this code used in xgboost's tests: https://github.com/dmlc/xgboost/blob/105aa4247abb3ce787be2cef2f9beb4c24b30049/demo/guide-python/categorical.py#L29

And here are some docs on how to tell CatBoost which features are categorical: https://catboost.ai/docs/en/concepts/python-usages-examples#class-with-array-like-data-with-numerical,-categorical-and-embedding-features

@pytest.fixture
def classification_data():
X, y = make_classification(
n_samples=100, n_features=10, n_classes=2, random_state=42
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n_samples=100, n_features=10, n_classes=2, random_state=42
n_samples=1_000, n_features=10, n_classes=2, random_state=42

You may want to use slightly more data, here an in regression_data(). There are some types of encoding and data access bugs that will only show up in certain codepaths in CatBoost that are exercised when there are enough splits per tree.

I've seen this before in LightGBM and XGBoost... someone will write a test that fits on a very small dataset and it'll look like nothing went wrong, only to later find that actually the dataset was so small that the model was just a collection of decision stumps (no splits), and so the test could never catch issues like "this encoding doesn't preserve NAs" or "these outputs are different because of numerical precision issues".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.pandas Issues specific to cudf.pandas feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants