You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Improve performance in all tokenization modes (up to 2x faster)
39
+
* Fix missing space escaping within protected sequences in "none" and "space" tokenization modes
40
+
* Fix a regression introduced in 1.20 where `segment_alphabet_*` options behave differently on characters that appear in multiple Unicode scripts (e.g. some Japanese characters can belong to both Hiragana and Katakana scripts and should not trigger a segmentation)
41
+
* Fix a regression introduced in 1.21 where a joiner is incorrectly placed when using `preserve_segmented_tokens` and the word is segmented by both a `segment_*` option and BPE
42
+
* Fix incorrect tokenization when using `support_prior_joiners` and some joiners are within protected sequences
0 commit comments