Skip to content

Commit

Permalink
Improve pinyin fuzzy segement algorithm
Browse files Browse the repository at this point in the history
Previously, we blindly choose the segment to always prefer the longer
next match, this is prove wrong in the case of "sangeren".

Which should produce, "san ge ren", "sang er en", "sang e ren".

Instead, we change the check to be:
if (current + next match) is valid, and complete pinyin, make it an
acceptable option, unless (current, next match) is actually an inner
fuzzy, which is handled separately below.

For example:
1. For sangeren, will produce sang & san, since next match of
"san", which is "ge", is a complete pinyin.
2. For hua, will only produce hua, since hu a is a inner fuzzy.

Even if it will produce "extra" segement, for example, in the case of
"sanger" will produce a partial pinyin "san" "ge" "r". We may still
consider it as make sense. Since partial pinyin match is considered
fuzzy and will have a penalty score.

People may even benefit from such segement, since "san ge r" seems to be
the most possible option.

Fix #87
  • Loading branch information
wengxt committed Dec 6, 2024
1 parent 118dc5f commit 1a3b857
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 7 deletions.
14 changes: 7 additions & 7 deletions src/libime/pinyin/pinyinencoder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -233,20 +233,20 @@ PinyinEncoder::parseUserPinyin(std::string userPinyin,
fuzzyFlags, pinyinMap);
auto nextMatchAlt = longestMatch(iter + str.size() - 1, end,
fuzzyFlags, pinyinMap);
auto matchSize = str.size() + nextMatch.match.size();
auto matchSizeAlt =
str.size() - 1 + nextMatchAlt.match.size();

// comparator is (validPinyin, wholeMatchSize,
// comparator is (validPinyin, whole size>= lhs pinyin,
// isCompletePinyin) validPinyin means it's at least some
// pinyin, instead of things startsWith i,u,v. Since
// longestMatch will now treat string startsWith iuv a whole
// segment, we need to compare validity before the length.
// Always prefer longer match and complete pinyin match.
std::tuple<bool, size_t, bool> compare(
nextMatch.valid, matchSize, nextMatch.isCompletePinyin);
std::tuple<bool, size_t, bool> compareAlt(
nextMatchAlt.valid, matchSizeAlt,
// If whole size is equal to lhs pinyin, then it should be
// handled by inner segement flag.
std::tuple<bool, bool, bool> compare(
nextMatch.valid, true, nextMatch.isCompletePinyin);
std::tuple<bool, bool, bool> compareAlt(
nextMatchAlt.valid, matchSizeAlt > str.size(),
nextMatchAlt.isCompletePinyin);

if (compare >= compareAlt) {
Expand Down
2 changes: 2 additions & 0 deletions test/testpinyinencoder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,8 @@ int main() {
check("zhuna", PinyinFuzzyFlag::Inner, {"zhu", "na"});
check("zhuna", PinyinFuzzyFlag::Inner, {"zhun", "a"});

check("sangeren", PinyinFuzzyFlag::Inner, {"san", "ge", "ren"});

{
PinyinCorrectionProfile profile(BuiltinPinyinCorrectionProfile::Qwerty);
auto graph = PinyinEncoder::parseUserPinyin(
Expand Down

0 comments on commit 1a3b857

Please sign in to comment.