Add nvtext::tokenized_to_tensor API #17932
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Adds new
nvtest::tokenized_to_tensor
API for compatibility of the existing subword tokenizer. The tokenizer is to be replaced with a normalizer, wordpiece tokenizer and this new API to provide the complete functions to match what the subword tokenizer currently produces.The wordpiece tokenizer created here #17600 will produce tokens for an input strings column as a list column of integers. This list column can be provided to this new API to produce the existing tensor data structure containing token-ids truncated or strided as well as an attention mask and metadata.
The
nvtext::tokenizer_result
is defined here:cudf/cpp/include/nvtext/subword_tokenize.hpp
Line 77 in 4323ae4
Checklist