Unidic Tokenization on Romaji Words #103

tobias-khs · 2016-04-27T03:10:30Z

Tested with version 0.9.0.

I know this is for Japanese, but it would be nice if some romaji words were tokenized consistently.

The string "hello golf2" is tokenized into:

hello
golf
2

which is fine. But when I tokenize "golf2 hello", I get using com.atilika.kuromoji.unidic.Tokenizer (also unidic.kanaaccent, but not the other tokenizers):

g
o
l
f
2
hello

It would be nice, if the second case were like the first. In the meantime, I might handle this with a user-dictionary.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unidic Tokenization on Romaji Words #103

Unidic Tokenization on Romaji Words #103

tobias-khs commented Apr 27, 2016

Unidic Tokenization on Romaji Words #103

Unidic Tokenization on Romaji Words #103

Comments

tobias-khs commented Apr 27, 2016