Add new nvtext tokenized minhash API #17944

davidwendt · 2025-02-07T00:19:53Z

Description

Creates a new minhash API that works on ngrams of row elements given a list column of strings.

std::unique_ptr<cudf::column> minhash_ngrams(
  cudf::lists_column_view const& input,
  cudf::size_type ngrams,
  uint32_t seed,
  cudf::device_span<uint32_t const> parameter_a,
  cudf::device_span<uint32_t const> parameter_b,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);

The input column is expected to be rows of words (strings) and each row is hashed using a sliding window of words (ngrams) and then the permuted algorithm is re-used to produce the minhash values.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-02-07T00:19:56Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…okenized-minhash

davidwendt · 2025-02-12T14:44:43Z

/ok to test

davidwendt · 2025-02-19T12:32:21Z

/ok to test

…okenized-minhash

davidwendt · 2025-02-21T00:19:40Z

/ok to test

davidwendt · 2025-02-21T16:05:35Z

/ok to test

…okenized-minhash

davidwendt · 2025-02-24T13:39:08Z

/ok to test

ayushdg

Was able to test this PR on larger scale datasets and the results seem to be in line with what we would expect for the given data. I'll run more experiments in the future and open issues if I observe unexpected behavior, but this should be good to go and works for our needs. Thanks!

PointKernel

Very clean code, looks great!

cpp/src/text/minhash.cu

mhaseeb123

C++ changes LGTM

davidwendt · 2025-02-27T20:42:36Z

/merge

Add new nvtext tokenized minhash API

491de6b

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 7, 2025

davidwendt self-assigned this Feb 7, 2025

davidwendt added 12 commits February 6, 2025 19:25

Merge branch 'branch-25.04' into tokenized-minhash

efe51d8

Merge branch 'branch-25.04' into tokenized-minhash

580f38e

fix merge conflict

c1e805a

fix kernel; add minhash64_ngrams, gtest

9e1472b

Merge branch 'tokenized-minhash' of github.com:davidwendt/cudf into t…

d31be75

…okenized-minhash

Merge branch 'branch-25.04' into tokenized-minhash

8c8c7aa

Merge branch 'branch-25.04' into tokenized-minhash

8b09c5e

Merge branch 'branch-25.04' into tokenized-minhash

49ad728

Merge branch 'branch-25.04' into tokenized-minhash

e5afa36

Merge branch 'branch-25.04' into tokenized-minhash

66027a9

add python interfaces

f7e570b

Merge branch 'branch-25.04' into tokenized-minhash

937c90f

github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Feb 11, 2025

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 11, 2025

davidwendt added 3 commits February 11, 2025 17:08

fix doxygen, add more gtests

1c2d408

Merge branch 'branch-25.04' into tokenized-minhash

15d14ec

Merge branch 'tokenized-minhash' of github.com:davidwendt/cudf into t…

b91d80b

…okenized-minhash

GregoryKimball requested a review from shrshi February 11, 2025 22:40

Merge branch 'branch-25.04' into tokenized-minhash

ce4e25f

davidwendt added 2 commits February 19, 2025 07:26

change Column to ColumnBase

d235959

change thrust::get to cuda::std::get

d7db2dd

davidwendt added 5 commits February 19, 2025 07:38

Merge branch 'tokenized-minhash' of github.com:davidwendt/cudf into t…

ad7e7a4

…okenized-minhash

Merge branch 'branch-25.04' into tokenized-minhash

827093f

Merge branch 'branch-25.04' into tokenized-minhash

504fe94

add sliced gtest

a392ac1

Merge branch 'branch-25.04' into tokenized-minhash

e7d77c9

davidwendt added 2 commits February 21, 2025 10:37

Merge branch 'branch-25.04' into tokenized-minhash

32038d1

fix pylibcudf pytest

5954543

davidwendt added 2 commits February 24, 2025 08:19

Merge branch 'tokenized-minhash' of github.com:davidwendt/cudf into t…

6e5a567

…okenized-minhash

Merge branch 'branch-25.04' into tokenized-minhash

00bf07d

davidwendt marked this pull request as ready for review February 24, 2025 15:38

davidwendt requested review from a team as code owners February 24, 2025 15:38

davidwendt requested review from Matt711, brandon-b-miller, mhaseeb123 and PointKernel February 24, 2025 15:38

ayushdg approved these changes Feb 26, 2025

View reviewed changes

Matt711 approved these changes Feb 26, 2025

View reviewed changes

PointKernel approved these changes Feb 26, 2025

View reviewed changes

cpp/src/text/minhash.cu Show resolved Hide resolved

mhaseeb123 approved these changes Feb 26, 2025

View reviewed changes

shrshi approved these changes Feb 26, 2025

View reviewed changes

rapids-bot bot merged commit 4fda491 into rapidsai:branch-25.04 Feb 27, 2025
113 checks passed

davidwendt deleted the tokenized-minhash branch February 27, 2025 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new nvtext tokenized minhash API #17944

Add new nvtext tokenized minhash API #17944

davidwendt commented Feb 7, 2025 •

edited

Loading

copy-pr-bot bot commented Feb 7, 2025

davidwendt commented Feb 12, 2025

davidwendt commented Feb 19, 2025

davidwendt commented Feb 21, 2025

davidwendt commented Feb 21, 2025

davidwendt commented Feb 24, 2025

ayushdg left a comment

PointKernel left a comment

mhaseeb123 left a comment

davidwendt commented Feb 27, 2025

Add new nvtext tokenized minhash API #17944

Add new nvtext tokenized minhash API #17944

Conversation

davidwendt commented Feb 7, 2025 • edited Loading

Description

Checklist

copy-pr-bot bot commented Feb 7, 2025

davidwendt commented Feb 12, 2025

davidwendt commented Feb 19, 2025

davidwendt commented Feb 21, 2025

davidwendt commented Feb 21, 2025

davidwendt commented Feb 24, 2025

ayushdg left a comment

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

mhaseeb123 left a comment

Choose a reason for hiding this comment

davidwendt commented Feb 27, 2025

davidwendt commented Feb 7, 2025 •

edited

Loading