Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve test coverage in the catboost integration tests #18126

Merged

Conversation

Matt711
Copy link
Contributor

@Matt711 Matt711 commented Feb 27, 2025

Description

Follow up of #17931. This PR tests catboost with categorical features and increases the size of the the test data we fit our catboost models on.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 27, 2025
@Matt711 Matt711 requested a review from a team as a code owner February 27, 2025 22:46
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Feb 27, 2025
@vyasr
Copy link
Contributor

vyasr commented Feb 28, 2025

@jameslamb you had some thoughts on the last PR, any interest in taking another stab at reviewing this one ;)

Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the @, and for doing this follow-up PR!

I left a few more suggestions for your consideration. Leaving a non-blocking review... they're all things that shouldn't block a merge and that you can choose whether or not to do based on how much time you want to invest in this.

"categorical_feature": rng.choice(["A", "B", "C"], size=100),
"target": rng.integers(0, 2, size=100),
"fea_1": rng.standard_normal(1000),
"fea_2": rng.choice([0, 1, 2, 3], size=1000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choosing these values randomly like this is a risk. The categorical features may be noise to catboost, meaning they wouldn't be chosen for any splits. And if they aren't chosen for any splits, then this test wouldn't be able to catch bugs in encoding/decoding that lead to the wrong representation (because predict() would effectively ignore the values in these columns).

It's for concerns like this that the xgboost example code I shared tries to create categorical features that have some relationship to the target: https://github.com/dmlc/xgboost/blob/105aa4247abb3ce787be2cef2f9beb4c24b30049/demo/guide-python/categorical.py#L40-L45

For these tests, I think the easiest way to do that would be something like this:

  • use sklearn.datasets.make_classification() to generate the initial dataset (features there are not totally random... they have some relationship to the target)
  • to get an int categorical feature, replace one of those continuous columns with a discretized version (say, split it into deciles with np.quantile() and np.digitize() or something)... e.g. values in the lowest 10% coded a 0, next 10-20% as 1, etc.
  • to get a category feature, repeat that procedure (but with a different feature's values, and using a different cardinality), but instead of ints use strings like "low", "medium", "high" (or whatever you want) and convert to category dtype

And then in the test, assert that all the features were used at least once. I don't know the exact API for this in catboost... in xgboost / lightgbm it is a property on the estimator called .feature_importances_, like in this example: https://stackoverflow.com/a/79435055/3986677

I know this seems like a lot, but I think it's important. When given a pandas-like input to .predict(), catboost will try to do the same types of transformation as it did on the training data. There's potential for bugs in this integration there, that'd only be caught at predict() time by ensuring the model actually uses the categorical features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to get a category feature, repeat that procedure (but with a different feature's values, and using a different cardinality), but instead of ints use strings like "low", "medium", "high" (or whatever you want) and convert to category dtype

I was unable to add this feature. When doing so I ran into this closed catboost issue catboost/catboost#934 which in IMO should still be open since several other have run into it.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with catboost, overall, it would be useful to have tests that check the consistency of the predict function as James suggested. For instance, what happens if the test dataset has lesser categories. What happens if there's null values. What happens if the test dataset has wrong categories, etc. Just to make sure catboost is producing the same results as pandas inputs.

Matt711 and others added 5 commits March 19, 2025 19:58
There appears to be a conda solve conflict between `cuspatial` &
`custreamz`/`libcudf_kafka`.

I think it ultimately stems from `lz4-c` pinnings. `cuspatial`'s
dependencies require `lz4-c >=1.10.0,<1.11.0a0`, but `libcudf_kafka`'s
`librdkafka >=2.5.0,<2.6.0a0` requires `lz4-c >=1.9.3,<1.10.0a0`.

Bumping `librdkafka` to `>=2.8.0` should resolve this conflict.
…#18337)

Update `test_narwhals.sh` to install pytest-env and hypothesis

In Narwhals these are pulled in by using `--group test`, but that
requires uv (which cuDF doesn't use)

Demo of this working:
https://www.kaggle.com/code/marcogorelli/narwhals-pre-release-test-cudf/notebook?scriptVersionId=228712396

---------

Co-authored-by: Edoardo Abati <[email protected]>
Co-authored-by: Vyas Ramasubramani <[email protected]>
@Matt711 Matt711 marked this pull request as draft March 26, 2025 16:11
Copy link

copy-pr-bot bot commented Mar 26, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@Matt711 Matt711 changed the base branch from branch-25.04 to branch-25.06 March 26, 2025 16:12
Copy link

copy-pr-bot bot commented Mar 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Matt711 Matt711 marked this pull request as ready for review March 26, 2025 16:24
@Matt711
Copy link
Contributor Author

Matt711 commented Mar 26, 2025

/ok to test

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a solid improvement. I'll trust the other reviewers expertise in judging if this is a sufficient test for what we need to verify works with catboost.

@Matt711
Copy link
Contributor Author

Matt711 commented Mar 31, 2025

/merge

@rapids-bot rapids-bot bot merged commit a3d717e into rapidsai:branch-25.06 Mar 31, 2025
110 of 111 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.pandas Issues specific to cudf.pandas improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants