You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know this is for Japanese, but it would be nice if some romaji words were tokenized consistently.
The string "hello golf2" is tokenized into:
hello
golf
2
which is fine. But when I tokenize "golf2 hello", I get using com.atilika.kuromoji.unidic.Tokenizer (also unidic.kanaaccent, but not the other tokenizers):
g
o
l
f
2
hello
It would be nice, if the second case were like the first. In the meantime, I might handle this with a user-dictionary.
The text was updated successfully, but these errors were encountered:
Tested with version 0.9.0.
I know this is for Japanese, but it would be nice if some romaji words were tokenized consistently.
The string "hello golf2" is tokenized into:
which is fine. But when I tokenize "golf2 hello", I get using
com.atilika.kuromoji.unidic.Tokenizer
(also unidic.kanaaccent, but not the other tokenizers):It would be nice, if the second case were like the first. In the meantime, I might handle this with a user-dictionary.
The text was updated successfully, but these errors were encountered: