-
-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(common/core/web): error from early fat-finger termination due to OS interruptions #5479
Conversation
if(alternates.length == 0) { | ||
alternates = undefined; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The true core of the fix. This PR's other changes are simply to promote robustness at other levels, just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(From our slack discussion)
- I'm not entirely clear from the PR description what the root cause of the problem that we are fixing?
- How do we get into the situation where alternates.length == 0?
- Why is a zero length alternates array a problem?
- Are there other code paths where we might need to do the same thing as you are doing here, or is the extra test in languageProcessor.ts that you've made in this PR sufficient to cover all cases? In which case, is this even needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root cause: if the OS pauses the thread evaluating the script between the time we start our timer (when we save TIMEOUT_THRESHOLD
) and the first check against the timer, the JS thread could be "out of time" when the OS allows it to resume. So, from the JS thread's perspective... before a single iteration of the loop may occur, with near-zero thread-execution time.
Should that occur, we will have initialized the alternates
array, but never been able to put anything within it. Hence, alternates.length == 0
.
Why would that be a problem? Because that isn't possible to optimally handle within the lm-layer's worker thread. Other changes in this PR can at least prevent a crash, but it'll rob the predictive-text engine of all information about the current keystroke, resulting in poorer suggestions. Preventing it here allows us to at least preserve info about the actually-typed keystroke.
This is the only place where the .alternates
property is set. Even then, the extra test there serves as a stop-gap to perform a similar prevention & fix check for any other paths that may exist in the future. (It's one of those "extra robustness checks" I referred to above.)
While we could just rely on the languageProcessor.ts check... I'd still like to handle it at the most likely 'source' of the error - the OS-interrupted timer.
While there are other possible ways to have a zero-length array without timing out early b/c of the OS, they're pretty niche and unlikely. For the sil_euro_latin
keyboard, they're also impossible. (Requires missing rules and/or rules with BEEP
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work on tracking this elusive little bug down! A few questions for my understanding...
if(alternates.length == 0) { | ||
alternates = undefined; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(From our slack discussion)
- I'm not entirely clear from the PR description what the root cause of the problem that we are fixing?
- How do we get into the situation where alternates.length == 0?
- Why is a zero length alternates array a problem?
- Are there other code paths where we might need to do the same thing as you are doing here, or is the extra test in languageProcessor.ts that you've made in this PR sufficient to cover all cases? In which case, is this even needed?
let transform = transcription.transform; | ||
var promise = this.currentPromise = this.lmEngine.predict(transcription.alternates || transcription.transform, context); | ||
var promise = this.currentPromise = this.lmEngine.predict(alternates || transcription.transform, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternates
will never be falsy per L326-332 above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending user testing
I, uh... since there's no easy repro for the original error, it's not super-clear exactly how to test this. Guess it's mostly a "make sure @jahorton didn't break anything"? |
Title changed to be slightly more user-readable. I mean, it's still not perfect, but at least it's a degree less technical. |
Yeah, given I assume we will be porting this back to 14.0-stable, good idea just to do an acceptance test right? It is touching pretty core keystroke processing... I don't foresee any issues but then we never do, do we? 🤣 |
In that case... maaaybe we delay to do a round of it whenever we feel like pushing up a new version? Acceptance testing for web is quite costly. Though, given the nature of this specific case... we should be able to drop the cost by running it against a single platform. There's nothing platform-specific here. |
Co-authored-by: Marc Durdin <[email protected]>
I think we want to push this out fairly soon given we are seeing many crash reports stemming from it.
Agreed. Chrome on something? |
User Testing@keymanapp/testers Platform: Chrome emulation / AndroidHow the test was done:
Utilize the "Test unminified Keymanweb" testing page to ensure the following:
Utilize the "Prediction - robust testing" testing page for the following:
|
@MakaraSok - Clarifications needed: TEST_5: Your note there is not part of the test. That test PASSED. Those concerns may be valid, but that's for something completely separate and unrelated to this PR. (No part of the test said to attempt loading a lexical model with the keyboard.) Secondly... that's a Keyman for Android screenshot. Not exactly Chrome emulation of an Android device (as in, outside of the Keyman app), but I'll let that slide - I'm not super-particular about the test environment. TEST_7: Just to confirm - no Spanish keyboard popped up after attempting to add it using the third keyboard-adding option on the page, "Add a keyboard by language name(s)"? It won't be present by default, hence why it's in the "attempt to add" section of the tests. |
Finally, TEST_CRITICAL: that's the "Issues" area, not the "Console" area. I did make sure to point out exactly which area I meant in that screenshot above. Those "issues" are fine and do not fail this test. As there's only a yellow entry and no red entries in the actual main console area, this test also PASSES. |
Good to know; I'll make note of this for the Web acceptance-testing instructions, which is where those items were sourced from.
That's fair; I've noticed the items on the "issues" tab before and totally understand that. Unfortunately, they're unrelated, previously-noted, and not easy to resolve at this time. I actually tried pretty hard to have you ignore the "Issues" tab, but apparently didn't succeed. |
Changes in this pull request will be available for download in Keyman version 15.0.89-alpha |
Fixes #5467.
No wonder it was an elusive little bugger. I knew something smelled like a race condition as I started investigating... turns out it's probably a race condition caused by the interaction of standard OS context-swapping (or garbage collection) with that of a timer based on the system clock. Hence why a direct repro is nigh-impossible to find or construct.
I should note that some of the Sentry issues do indicate occasions where they arose before 14.0.277, which was when #5352 landed. That said, the lion's share of the reports for the tagged issues are indeed since that release, which is why it's become as prominent now as it is.
So, while it's not exactly perfect to just give up immediately once control returns to the engine... at least we've noticed that the correction algorithm itself is fairly capable on its own. So, to keep things responsive, even though the delay is not in our algorithm but rather due to external pressure from the OS or the browser, we'll simply maintain the current logic, which thus bypasses fat-finger calculations as a result of the externally-caused delay.
User Testing
@keymanapp/testers
Much of this is taken from standard KMW acceptance testing and adapted toward the aspects of KMW modified by this PR, though we'll only worry about testing against a single platform here.
Platform: Chrome emulation / Android
Ensure that you see the "Console" tab while performing these tests.
Console tab:
It should resemble Windows'
CMD
prompt and the macOS Terminal.Utilize the "Test unminified Keymanweb" testing page to ensure the following:
sil_ipa
. (For our currently-deprecated IPA keyboard.)ŋ
should be a subkey ofn
- ensure it works.khmer_angkor
keyboard, you'll want to load that one by keyboard name for these tests.spanish
.Utilize the "Prediction - robust testing" testing page for the following:
.
to output'
and then presse
. The two characters should not combine.p
and SHIFT + long-pressg
displays properly and that the subkeys produce the expected output.L
(shift layer)k
(default layer)v
(default layer)Love
andLive
, along with one other random suggestion.