Workaround thrust-copy-if limit in wordpiece-tokenizer #12168

davidwendt · 2022-11-16T16:57:05Z

Description

Workaround in nvtext's wordpiece-tokenizer due to limitation in thrust::copy_if which fails if the input-iterator spans more than int-max.

Found existing thrust issue: NVIDIA/cccl#747
This calls the thrust::copy_if in chunks if the iterator can span greater than int-max.

Found while working on #12079

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

codecov · 2022-11-17T06:28:16Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.02@7426a06). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-23.02   #12168   +/-   ##
===============================================
  Coverage                ?   88.26%           
===============================================
  Files                   ?      137           
  Lines                   ?    22586           
  Branches                ?        0           
===============================================
  Hits                    ?    19935           
  Misses                  ?     2651           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

davidwendt · 2022-11-17T13:14:08Z

rerun tests

cpp/src/text/subword/wordpiece_tokenizer.cu

vyasr

LGTM. A couple of minor non-blocking suggestions.

cpp/src/text/subword/wordpiece_tokenizer.cu

davidwendt · 2022-11-28T13:15:30Z

@gpucibot merge

davidwendt added 3 commits November 14, 2022 18:09

Workaround thrust-copy-if limit in wordpiece-tokenizer

da29003

Merge branch 'branch-23.02' into fix-wpt-thrust-copy-if

232cef7

Merge branch 'branch-23.02' into fix-wpt-thrust-copy-if

7fe8c1d

davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Nov 16, 2022

davidwendt self-assigned this Nov 16, 2022

Merge branch 'branch-23.02' into fix-wpt-thrust-copy-if

5f241a3

davidwendt added 2 commits November 17, 2022 16:32

Merge branch 'branch-23.02' into fix-wpt-thrust-copy-if

de9e123

Merge branch 'branch-23.02' into fix-wpt-thrust-copy-if

eacff43

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 18, 2022

davidwendt marked this pull request as ready for review November 18, 2022 14:05

davidwendt requested a review from a team as a code owner November 18, 2022 14:05

davidwendt requested review from vyasr and ttnghia and removed request for a team November 18, 2022 14:05

ttnghia reviewed Nov 18, 2022

View reviewed changes

cpp/src/text/subword/wordpiece_tokenizer.cu Outdated Show resolved Hide resolved

simplify setting contiguous_token_ids

3d896d3

ttnghia approved these changes Nov 18, 2022

View reviewed changes

davidwendt added 4 commits November 21, 2022 07:44

Merge branch 'branch-23.02' into fix-wpt-thrust-copy-if

8dfae66

Merge branch 'branch-23.02' into fix-wpt-thrust-copy-if

b448b16

Merge branch 'branch-23.02' into fix-wpt-thrust-copy-if

beaefe9

Merge branch 'branch-23.02' into fix-wpt-thrust-copy-if

2b35035

vyasr approved these changes Nov 23, 2022

View reviewed changes

cpp/src/text/subword/wordpiece_tokenizer.cu Outdated Show resolved Hide resolved

cpp/src/text/subword/wordpiece_tokenizer.cu Outdated Show resolved Hide resolved

davidwendt added 3 commits November 23, 2022 13:52

Merge branch 'branch-23.02' into fix-wpt-thrust-copy-if

75c2497

remove unneeded statement

508f885

Merge branch 'branch-23.02' into fix-wpt-thrust-copy-if

80b800b

rapids-bot bot merged commit 82b646e into rapidsai:branch-23.02 Nov 28, 2022

davidwendt deleted the fix-wpt-thrust-copy-if branch November 28, 2022 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround thrust-copy-if limit in wordpiece-tokenizer #12168

Workaround thrust-copy-if limit in wordpiece-tokenizer #12168

davidwendt commented Nov 16, 2022

codecov bot commented Nov 17, 2022 •

edited

Loading

davidwendt commented Nov 17, 2022

vyasr left a comment

davidwendt commented Nov 28, 2022

Workaround thrust-copy-if limit in wordpiece-tokenizer #12168

Workaround thrust-copy-if limit in wordpiece-tokenizer #12168

Conversation

davidwendt commented Nov 16, 2022

Description

Checklist

codecov bot commented Nov 17, 2022 • edited Loading

Codecov Report

davidwendt commented Nov 17, 2022

vyasr left a comment

Choose a reason for hiding this comment

davidwendt commented Nov 28, 2022

codecov bot commented Nov 17, 2022 •

edited

Loading