Add catboost integration tests #17931

Matt711 · 2025-02-06T16:39:36Z

Description

Apart of #17490. This PR adds back the catboost integration tests, which were originally added in #17267 but were later removed due to ABI incompatability between the version of numpy catboost is compiled against and the version of numpy installed in the test environment. This PR adds the tests back but pins a compatible numpy version in the catboost tests.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-02-06T16:39:40Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Matt711 · 2025-02-06T16:39:46Z

/ok to test

Matt711 · 2025-02-06T18:51:59Z

/ok to test

Matt711 · 2025-02-06T22:47:09Z

/ok to test

Matt711 · 2025-02-06T22:56:28Z

python/cudf/cudf_pandas_tests/third_party_integration_tests/dependencies.yaml

+    common:
+      - output_types: conda
+        packages:
+        # TODO: Remove numpy pinning once https://github.com/catboost/catboost/issues/2671 is resolved


See this paragraph from the numpy 2 release

Breaking changes to the NumPy ABI. As a result, binaries of packages
that use the NumPy C API and were built against a NumPy 1.xx release
will not work with NumPy 2.0. On import, such packages will see an
ImportError with a message about binary incompatibility.

Matt711 · 2025-02-07T01:29:42Z

/ok to test

Matt711

For the reviewer: These were just for testing. I'll remove before I merge.

.github/workflows/pr.yaml

python/cudf/cudf_pandas_tests/third_party_integration_tests/dependencies.yaml

Matt711 · 2025-02-07T02:48:00Z

/ok to test

Matt711 · 2025-02-07T07:31:02Z

CI passes: https://github.com/rapidsai/cudf/actions/runs/13192187930/job/36827890440#step:10:1

jameslamb

Giving you a ci-codeowners / packaging-codeowners approval because the description says that this is just bringing back tests that already used to exist, and that's a net gain for test coverage here.

But please do see my suggestions about more thoroughly testing the CatBoost integration.

python/cudf/cudf_pandas_tests/third_party_integration_tests/tests/test_catboost.py

jameslamb · 2025-02-07T21:16:51Z

python/cudf/cudf_pandas_tests/third_party_integration_tests/tests/test_catboost.py

+def classification_data():
+    X, y = make_classification(
+        n_samples=100, n_features=10, n_classes=2, random_state=42
+    )


make_classification() returns a dataset that has only continuous features.

from sklearn.datasets import make_classification X, y = make_classification( n_samples=100, n_features=10, n_classes=2, random_state=42 ) X

array([[-1.14052601, 1.35970566, 0.86199147, 0.84609208, 0.60600995, -1.55662917, 1.75479418, 1.69645637, -1.28042935, -2.08192941], ...

For catboost in particular, I strongly suspect you'll get better effective test coverage of this integration by including some categorical features.

Encoding and decoding categorical features is critical to how CatBoost works (docs), and there are lots of things that have to go exactly right when providing pandas-like categorical input. Basically, everything here: https://pandas.pydata.org/docs/user_guide/categorical.html

I really think you should provide an input dataset that has some categorical features, ideally in 2 forms:

integer-type columns

pandas.categorical type columns

And ideally with varying cardinality.

You could consider adapting this code used in xgboost's tests: https://github.com/dmlc/xgboost/blob/105aa4247abb3ce787be2cef2f9beb4c24b30049/demo/guide-python/categorical.py#L29

And here are some docs on how to tell CatBoost which features are categorical: https://catboost.ai/docs/en/concepts/python-usages-examples#class-with-array-like-data-with-numerical,-categorical-and-embedding-features

jameslamb · 2025-02-07T21:21:41Z

python/cudf/cudf_pandas_tests/third_party_integration_tests/tests/test_catboost.py

+@pytest.fixture
+def classification_data():
+    X, y = make_classification(
+        n_samples=100, n_features=10, n_classes=2, random_state=42


Suggested change

n_samples=100, n_features=10, n_classes=2, random_state=42

n_samples=1_000, n_features=10, n_classes=2, random_state=42

You may want to use slightly more data, here an in regression_data(). There are some types of encoding and data access bugs that will only show up in certain codepaths in CatBoost that are exercised when there are enough splits per tree.

I've seen this before in LightGBM and XGBoost... someone will write a test that fits on a very small dataset and it'll look like nothing went wrong, only to later find that actually the dataset was so small that the model was just a collection of decision stumps (no splits), and so the test could never catch issues like "this encoding doesn't preserve NAs" or "these outputs are different because of numerical precision issues".

Matt711 · 2025-02-08T01:25:59Z

Giving you a ci-codeowners / packaging-codeowners approval because the description says that this is just bringing back tests that already used to exist, and that's a net gain for test coverage here.

But please do see my suggestions about more thoroughly testing the CatBoost integration.

Thanks for the suggestions on improving these tests @jameslamb! This library is new to me to me so I appreciate the time you took to investigate some of the APIs. I think what's in this PR is a good starting point, but I agree with your suggestions so I'll include them in a follow-up PR. I think I'll also ask others offline who are more familiar with Catboost/XGBoost for their suggestions.

…pendencies.yaml

python/cudf/cudf_pandas_tests/third_party_integration_tests/tests/test_catboost.py

python/cudf/cudf_pandas_tests/third_party_integration_tests/tests/test_cugraph.py

Matt711 · 2025-02-11T14:17:05Z

/merge

Follow up of #17931. This PR tests catboost with categorical features and increases the size of the the test data we fit our catboost models on. Authors: - Matthew Murray (https://github.com/Matt711) - Ray Douglass (https://github.com/raydouglass) - Marco Edward Gorelli (https://github.com/MarcoGorelli) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #18126

Add catboost integration tests

647b9bc

Matt711 added feature request New feature or request non-breaking Non-breaking change labels Feb 6, 2025

github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Feb 6, 2025

github-actions bot assigned Matt711 Feb 6, 2025

add tests

2da2981

add todo

acab0c4

Matt711 commented Feb 6, 2025

View reviewed changes

Matt711 and others added 2 commits February 6, 2025 17:27

update numpy

ebf3cd3

Merge branch 'branch-25.04' into fea/add-catboost-integration-tests

7828b38

Matt711 commented Feb 7, 2025

View reviewed changes

.github/workflows/pr.yaml Outdated Show resolved Hide resolved

.github/workflows/pr.yaml Outdated Show resolved Hide resolved

python/cudf/cudf_pandas_tests/third_party_integration_tests/dependencies.yaml Outdated Show resolved Hide resolved

skip more tests

14668ab

Matt711 marked this pull request as ready for review February 7, 2025 07:35

Matt711 requested review from a team as code owners February 7, 2025 07:35

Matt711 requested review from jameslamb, wence- and galipremsagar February 7, 2025 07:35

jameslamb approved these changes Feb 7, 2025

View reviewed changes

Update python/cudf/cudf_pandas_tests/third_party_integration_tests/de…

ee7890c

…pendencies.yaml

Matt711 added 3 commits February 7, 2025 21:29

Update .github/workflows/pr.yaml

37caae6

Update .github/workflows/pr.yaml

7d3161b

Merge branch 'branch-25.04' into fea/add-catboost-integration-tests

b671426

Matt711 commented Feb 9, 2025

View reviewed changes

python/cudf/cudf_pandas_tests/third_party_integration_tests/tests/test_catboost.py Outdated Show resolved Hide resolved

update copyright

1927f9e

mroeschke reviewed Feb 10, 2025

View reviewed changes

python/cudf/cudf_pandas_tests/third_party_integration_tests/tests/test_cugraph.py Outdated Show resolved Hide resolved

remove xfails

0f6b775

Matt711 marked this pull request as draft February 10, 2025 20:23

merge conflict

164d81e

Matt711 marked this pull request as ready for review February 10, 2025 20:26

Matt711 requested a review from mroeschke February 10, 2025 20:26

mroeschke approved these changes Feb 10, 2025

View reviewed changes

Matt711 added 2 commits February 10, 2025 17:04

Merge branch 'branch-25.04' into fea/add-catboost-integration-tests

43cd327

Merge branch 'branch-25.04' into fea/add-catboost-integration-tests

fb5028d

Matt711 added 4 commits February 11, 2025 13:21

Merge branch 'branch-25.04' into fea/add-catboost-integration-tests

5b3c87d

Merge branch 'branch-25.04' into fea/add-catboost-integration-tests

62660b3

Merge branch 'branch-25.04' into fea/add-catboost-integration-tests

14dae38

Merge branch 'branch-25.04' into fea/add-catboost-integration-tests

4c0ed2a

rapids-bot bot merged commit e6b1c0f into rapidsai:branch-25.04 Feb 13, 2025
108 of 109 checks passed

Matt711 mentioned this pull request Feb 27, 2025

Improve test coverage in the catboost integration tests #18126

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add catboost integration tests #17931

Add catboost integration tests #17931

Matt711 commented Feb 6, 2025 •

edited

Loading

copy-pr-bot bot commented Feb 6, 2025

Matt711 commented Feb 6, 2025

Matt711 commented Feb 6, 2025

Matt711 commented Feb 6, 2025

Matt711 Feb 6, 2025

Matt711 commented Feb 7, 2025

Matt711 left a comment

Matt711 commented Feb 7, 2025

Matt711 commented Feb 7, 2025

jameslamb left a comment

jameslamb Feb 7, 2025 •

edited

Loading

jameslamb Feb 7, 2025

Matt711 commented Feb 8, 2025

Matt711 commented Feb 11, 2025

	n_samples=100, n_features=10, n_classes=2, random_state=42
	n_samples=1_000, n_features=10, n_classes=2, random_state=42

Add catboost integration tests #17931

Add catboost integration tests #17931

Conversation

Matt711 commented Feb 6, 2025 • edited Loading

Description

Checklist

copy-pr-bot bot commented Feb 6, 2025

Matt711 commented Feb 6, 2025

Matt711 commented Feb 6, 2025

Matt711 commented Feb 6, 2025

Matt711 Feb 6, 2025

Choose a reason for hiding this comment

Matt711 commented Feb 7, 2025

Matt711 left a comment

Choose a reason for hiding this comment

Matt711 commented Feb 7, 2025

Matt711 commented Feb 7, 2025

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

jameslamb Feb 7, 2025

Choose a reason for hiding this comment

Matt711 commented Feb 8, 2025

Matt711 commented Feb 11, 2025

Matt711 commented Feb 6, 2025 •

edited

Loading

jameslamb Feb 7, 2025 •

edited

Loading