Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new nvtext tokenized minhash API #17944

Merged
merged 37 commits into from
Feb 27, 2025

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Feb 7, 2025

Description

Creates a new minhash API that works on ngrams of row elements given a list column of strings.

std::unique_ptr<cudf::column> minhash_ngrams(
  cudf::lists_column_view const& input,
  cudf::size_type ngrams,
  uint32_t seed,
  cudf::device_span<uint32_t const> parameter_a,
  cudf::device_span<uint32_t const> parameter_b,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);

The input column is expected to be rows of words (strings) and each row is hashed using a sliding window of words (ngrams) and then the permuted algorithm is re-used to produce the minhash values.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 7, 2025
@davidwendt davidwendt self-assigned this Feb 7, 2025
Copy link

copy-pr-bot bot commented Feb 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Feb 11, 2025
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 11, 2025
@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt davidwendt marked this pull request as ready for review February 24, 2025 15:38
@davidwendt davidwendt requested review from a team as code owners February 24, 2025 15:38
Copy link
Member

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to test this PR on larger scale datasets and the results seem to be in line with what we would expect for the given data. I'll run more experiments in the future and open issues if I observe unexpected behavior, but this should be good to go and works for our needs. Thanks!

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean code, looks great!

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ changes LGTM

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 4fda491 into rapidsai:branch-25.04 Feb 27, 2025
113 checks passed
@davidwendt davidwendt deleted the tokenized-minhash branch February 27, 2025 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants