Add nvtext::tokenized_to_tensor API #17932

davidwendt · 2025-02-06T17:06:16Z

Description

Adds new nvtest::tokenized_to_tensor API for compatibility of the existing subword tokenizer. The tokenizer is to be replaced with a normalizer, wordpiece tokenizer and this new API to provide the complete functions to match what the subword tokenizer currently produces.

The wordpiece tokenizer created here #17600 will produce tokens for an input strings column as a list column of integers. This list column can be provided to this new API to produce the existing tensor data structure containing token-ids truncated or strided as well as an attention mask and metadata.

nvtext::tokenizer_result nvtext::tokenized_to_tensor(
  cudf::lists_column_view const& input,
  cudf::size_type max_sequence_length,
  cudf::size_type stride,
  bool do_truncate,
  rmm::cuda_stream_view stream ,
  rmm::device_async_resource_ref mr);

The nvtext::tokenizer_result is defined here:

cudf/cpp/include/nvtext/subword_tokenize.hpp

Line 77 in 4323ae4

struct tokenizer_result {

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-02-06T17:06:21Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

davidwendt · 2025-02-06T17:09:06Z

/ok to test

Add nvtext::tokenized_to_tensor API

2a8d8ce

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 6, 2025

davidwendt self-assigned this Feb 6, 2025

Merge branch 'branch-25.04' into list-to-tensor-api

0421247

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nvtext::tokenized_to_tensor API #17932

Add nvtext::tokenized_to_tensor API #17932

davidwendt commented Feb 6, 2025

copy-pr-bot bot commented Feb 6, 2025

davidwendt commented Feb 6, 2025

Add nvtext::tokenized_to_tensor API #17932

Are you sure you want to change the base?

Add nvtext::tokenized_to_tensor API #17932

Conversation

davidwendt commented Feb 6, 2025

Description

Checklist

copy-pr-bot bot commented Feb 6, 2025

davidwendt commented Feb 6, 2025