-
Notifications
You must be signed in to change notification settings - Fork 934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new nvtext tokenized minhash API #17944
Add new nvtext tokenized minhash API #17944
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was able to test this PR on larger scale datasets and the results seem to be in line with what we would expect for the given data. I'll run more experiments in the future and open issues if I observe unexpected behavior, but this should be good to go and works for our needs. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean code, looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ changes LGTM
/merge |
Description
Creates a new minhash API that works on ngrams of row elements given a list column of strings.
The input column is expected to be rows of words (strings) and each row is hashed using a sliding window of words (ngrams) and then the permuted algorithm is re-used to produce the minhash values.
Checklist