fix(common/core/web): error from early fat-finger termination due to OS interruptions #5479

jahorton · 2021-07-22T01:35:11Z

No wonder it was an elusive little bugger. I knew something smelled like a race condition as I started investigating... turns out it's probably a race condition caused by the interaction of standard OS context-swapping (or garbage collection) with that of a timer based on the system clock. Hence why a direct repro is nigh-impossible to find or construct.

I should note that some of the Sentry issues do indicate occasions where they arose before 14.0.277, which was when #5352 landed. That said, the lion's share of the reports for the tagged issues are indeed since that release, which is why it's become as prominent now as it is.

So, while it's not exactly perfect to just give up immediately once control returns to the engine... at least we've noticed that the correction algorithm itself is fairly capable on its own. So, to keep things responsive, even though the delay is not in our algorithm but rather due to external pressure from the OS or the browser, we'll simply maintain the current logic, which thus bypasses fat-finger calculations as a result of the externally-caused delay.

User Testing

@keymanapp/testers

Much of this is taken from standard KMW acceptance testing and adapted toward the aspects of KMW modified by this PR, though we'll only worry about testing against a single platform here.

Platform: Chrome emulation / Android

Ensure that you see the "Console" tab while performing these tests.

TEST_CRITICAL: If any errors (red text entries in the "Console" area) or warnings appear during these tests, fail this test entry and report a screenshot of them.

Console tab:

It should resemble Windows' CMD prompt and the macOS Terminal.

Utilize the "Test unminified Keymanweb" testing page to ensure the following:

TEST_1: KMW properly handles input to the controls.
Attempt to add the following keyboards:
- TEST_2: By keyboard name: sil_ipa. (For our currently-deprecated IPA keyboard.)
  - TEST_3: ŋ should be a subkey of n - ensure it works.
- TEST_4: By language code: "km".
  - Note that if "km" doesn't return the khmer_angkor keyboard, you'll want to load that one by keyboard name for these tests.
  - TEST_5: Type in the following sequence: ស, ុ, ្រ (subkey of រ), ក. You should see ស្រុក.
  - TEST_6: Continuing from the last test, hit backspace in sequence. As this is a reorder (keyboard) rule test, you should see:
    - ស្រុ
    - ស្រ
    - ស
- TEST_7: By language name: spanish.

Utilize the "Prediction - robust testing" testing page for the following:

Swap to the "English - EuroLatin (SIL)" keyboard.
- TEST_8: Use long-press . to output ' and then press e. The two characters should not combine.
- TEST_9: Ensure that long-press p and SHIFT + long-press g displays properly and that the subkeys produce the expected output.
- TEST_10: Type the following sequence, pressing near the center of the key each time:
  - L (shift layer)
  - k (default layer)
  - v (default layer)
  - Expected result: you should see suggestions for Love and Live, along with one other random suggestion.

jahorton · 2021-07-22T01:37:03Z

common/core/web/input-processor/src/text/inputProcessor.ts

+                if(alternates.length == 0) {
+                  alternates = undefined;
+                }


The true core of the fix. This PR's other changes are simply to promote robustness at other levels, just in case.

(From our slack discussion)

I'm not entirely clear from the PR description what the root cause of the problem that we are fixing?

How do we get into the situation where alternates.length == 0?

Why is a zero length alternates array a problem?

Are there other code paths where we might need to do the same thing as you are doing here, or is the extra test in languageProcessor.ts that you've made in this PR sufficient to cover all cases? In which case, is this even needed?

The root cause: if the OS pauses the thread evaluating the script between the time we start our timer (when we save TIMEOUT_THRESHOLD) and the first check against the timer, the JS thread could be "out of time" when the OS allows it to resume. So, from the JS thread's perspective... before a single iteration of the loop may occur, with near-zero thread-execution time.

Should that occur, we will have initialized the alternates array, but never been able to put anything within it. Hence, alternates.length == 0.

Why would that be a problem? Because that isn't possible to optimally handle within the lm-layer's worker thread. Other changes in this PR can at least prevent a crash, but it'll rob the predictive-text engine of all information about the current keystroke, resulting in poorer suggestions. Preventing it here allows us to at least preserve info about the actually-typed keystroke.

This is the only place where the .alternates property is set. Even then, the extra test there serves as a stop-gap to perform a similar prevention & fix check for any other paths that may exist in the future. (It's one of those "extra robustness checks" I referred to above.)

While we could just rely on the languageProcessor.ts check... I'd still like to handle it at the most likely 'source' of the error - the OS-interrupted timer.

While there are other possible ways to have a zero-length array without timing out early b/c of the OS, they're pretty niche and unlikely. For the sil_euro_latin keyboard, they're also impossible. (Requires missing rules and/or rules with BEEP.)

mcdurdin

Good work on tracking this elusive little bug down! A few questions for my understanding...

mcdurdin · 2021-07-22T02:15:15Z

common/core/web/input-processor/src/text/inputProcessor.ts

+                if(alternates.length == 0) {
+                  alternates = undefined;
+                }


(From our slack discussion)

I'm not entirely clear from the PR description what the root cause of the problem that we are fixing?

How do we get into the situation where alternates.length == 0?

Why is a zero length alternates array a problem?

Are there other code paths where we might need to do the same thing as you are doing here, or is the extra test in languageProcessor.ts that you've made in this PR sufficient to cover all cases? In which case, is this even needed?

mcdurdin · 2021-07-22T02:15:32Z

common/core/web/input-processor/src/text/prediction/languageProcessor.ts

      let transform = transcription.transform;
-      var promise = this.currentPromise = this.lmEngine.predict(transcription.alternates || transcription.transform, context);
+      var promise = this.currentPromise = this.lmEngine.predict(alternates || transcription.transform, context);


alternates will never be falsy per L326-332 above?

common/predictive-text/worker/model-compositor.ts

…ays-truthy

common/predictive-text/worker/model-compositor.ts

mcdurdin

LGTM pending user testing

jahorton · 2021-07-22T03:46:25Z

LGTM pending user testing

I, uh... since there's no easy repro for the original error, it's not super-clear exactly how to test this. Guess it's mostly a "make sure @jahorton didn't break anything"?

jahorton · 2021-07-22T03:49:33Z

Title changed to be slightly more user-readable. I mean, it's still not perfect, but at least it's a degree less technical.

mcdurdin · 2021-07-22T03:51:19Z

I, uh... since there's no easy repro for the original error, it's not super-clear exactly how to test this. Guess it's mostly a "make sure @jahorton didn't break anything"?

Yeah, given I assume we will be porting this back to 14.0-stable, good idea just to do an acceptance test right? It is touching pretty core keystroke processing... I don't foresee any issues but then we never do, do we? 🤣

jahorton · 2021-07-22T03:54:33Z

I, uh... since there's no easy repro for the original error, it's not super-clear exactly how to test this. Guess it's mostly a "make sure @jahorton didn't break anything"?

Yeah, given I assume we will be porting this back to 14.0-stable, good idea just to do an acceptance test right? It is touching pretty core keystroke processing... I don't foresee any issues but then we never do, do we? 🤣

In that case... maaaybe we delay to do a round of it whenever we feel like pushing up a new version? Acceptance testing for web is quite costly.

Though, given the nature of this specific case... we should be able to drop the cost by running it against a single platform. There's nothing platform-specific here.

Co-authored-by: Marc Durdin <[email protected]>

mcdurdin · 2021-07-22T04:51:51Z

In that case... maaaybe we delay to do a round of it whenever we feel like pushing up a new version? Acceptance testing for web is quite costly.

I think we want to push this out fairly soon given we are seeing many crash reports stemming from it.

Though, given the nature of this specific case... we should be able to drop the cost by running it against a single platform. There's nothing platform-specific here.

Agreed. Chrome on something?

MakaraSok · 2021-07-26T11:58:06Z

jahorton · 2021-07-26T22:47:55Z

@MakaraSok - Clarifications needed:

TEST_5: Your note there is not part of the test. That test PASSED. Those concerns may be valid, but that's for something completely separate and unrelated to this PR. (No part of the test said to attempt loading a lexical model with the keyboard.)

Secondly... that's a Keyman for Android screenshot. Not exactly Chrome emulation of an Android device (as in, outside of the Keyman app), but I'll let that slide - I'm not super-particular about the test environment.

TEST_7: Just to confirm - no Spanish keyboard popped up after attempting to add it using the third keyboard-adding option on the page, "Add a keyboard by language name(s)"? It won't be present by default, hence why it's in the "attempt to add" section of the tests.

jahorton · 2021-07-26T22:52:08Z

Finally, TEST_CRITICAL: that's the "Issues" area, not the "Console" area. I did make sure to point out exactly which area I meant in that screenshot above. Those "issues" are fine and do not fail this test.

As there's only a yellow entry and no red entries in the actual main console area, this test also PASSES.

MakaraSok · 2021-07-27T01:10:00Z

Thanks a lot for the clarifications. TEST_5, TEST_7, and TEST_CRITICAL) are all PASSED.

TEST_5: Your note there is not part of the test. That test PASSED. Those concerns may be valid, but that's for something completely separate and unrelated to this PR. (No part of the test said to attempt loading a lexical model with the keyboard.)

Secondly... that's a Keyman for Android screenshot. Not exactly Chrome emulation of an Android device (as in, outside of the Keyman app), but I'll let that slide - I'm not super-particular about the test environment.

I've missed read the instructions. My fault.
Understood. Gonna read more carefully next time.

TEST_7: Just to confirm - no Spanish keyboard popped up after attempting to add it using the third keyboard-adding option on the page, "Add a keyboard by language name(s)"? It won't be present by default, hence why it's in the "attempt to add" section of the tests.

I now see the intention of the instructions vividly. The line should have read as "try and add spanish to the "Add a keyboard by language name(s)" field.

Finally, TEST_CRITICAL: that's the "Issues" area, not the "Console" area. I did make sure to point out exactly which area I meant in that screenshot above. Those "issues" are fine and do not fail this test.

As there's only a yellow entry and no red entries in the actual main console area, this test also PASSES.

The screenshot of the test result should have been focused on the tab:

The screenshot provided in the instructions was informative, but the issue shown there is only one, while there were dozen during testing; which is why the details of what was seen were reported because they were thought to be "warnings of some kinds" which may not be expected.

jahorton · 2021-07-27T01:30:23Z

I now see the intention of the instructions vividly. The line should have read as "try and add spanish to the "Add a keyboard by language name(s)" field.

Good to know; I'll make note of this for the Web acceptance-testing instructions, which is where those items were sourced from.

The screenshot provided in the instructions was informative, but the issue shown there is only one, while there were dozen during testing; which is why the details of what was seen were reported because they were thought to be "warnings of some kinds" which may not be expected.

That's fair; I've noticed the items on the "issues" tab before and totally understand that. Unfortunately, they're unrelated, previously-noted, and not easy to resolve at this time. I actually tried pretty hard to have you ignore the "Issues" tab, but apparently didn't succeed.

keyman-server · 2021-07-27T18:01:32Z

Changes in this pull request will be available for download in Keyman version 15.0.89-alpha

jahorton added 3 commits July 21, 2021 11:36

fix(common/core/web): adds empty-array check for alternates

b414455

fix(common/models): adds worker null guard for predict

52c3356

fix(common/core/web): fixes likely root of problem

1c8689b

jahorton added this to the A15S9 milestone Jul 22, 2021

github-actions bot added common/ common/core/ common/web/ fix web/ labels Jul 22, 2021

jahorton commented Jul 22, 2021

View reviewed changes

mcdurdin reviewed Jul 22, 2021

View reviewed changes

jahorton added 2 commits July 22, 2021 10:36

change(common/core/web): adjustments per review

fd940cb

change(common/core/web): removes conditional replacement of a now alw…

1125b36

…ays-truthy

mcdurdin reviewed Jul 22, 2021

View reviewed changes

common/predictive-text/worker/model-compositor.ts Show resolved Hide resolved

mcdurdin approved these changes Jul 22, 2021

View reviewed changes

jahorton changed the title ~~fix(common/core/web): empty alternates array due to OS context swapping~~ fix(common/core/web): error from early fat-finger termination due to OS interruptions Jul 22, 2021

docs(common/models): applies PR suggestion

9d35b26

Co-authored-by: Marc Durdin <[email protected]>

mcdurdin modified the milestones: A15S9, A15S10 Jul 26, 2021

jahorton mentioned this pull request Jul 27, 2021

fix(common/core/web): error from early fat-finger termination due to OS interruptions 🍒 #5491

Merged

11 tasks

jahorton merged commit fc1251b into master Jul 27, 2021

jahorton deleted the fix/common/core/web/empty-alternates branch July 27, 2021 01:42

mcdurdin added the has-user-test label Feb 12, 2022

jahorton mentioned this pull request Aug 4, 2022

fix(web): enhanced timer for prediction algorithm #7037

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(common/core/web): error from early fat-finger termination due to OS interruptions #5479

fix(common/core/web): error from early fat-finger termination due to OS interruptions #5479

jahorton commented Jul 22, 2021 •

edited

Loading

jahorton Jul 22, 2021

mcdurdin Jul 22, 2021

jahorton Jul 22, 2021

mcdurdin left a comment

mcdurdin Jul 22, 2021

mcdurdin Jul 22, 2021

mcdurdin left a comment

jahorton commented Jul 22, 2021

jahorton commented Jul 22, 2021

mcdurdin commented Jul 22, 2021

jahorton commented Jul 22, 2021

mcdurdin commented Jul 22, 2021

MakaraSok commented Jul 26, 2021 •

edited by jahorton

Loading

jahorton commented Jul 26, 2021 •

edited

Loading

jahorton commented Jul 26, 2021

MakaraSok commented Jul 27, 2021

jahorton commented Jul 27, 2021 •

edited

Loading

keyman-server commented Jul 27, 2021

fix(common/core/web): error from early fat-finger termination due to OS interruptions #5479

fix(common/core/web): error from early fat-finger termination due to OS interruptions #5479

Conversation

jahorton commented Jul 22, 2021 • edited Loading

User Testing

jahorton Jul 22, 2021

Choose a reason for hiding this comment

mcdurdin Jul 22, 2021

Choose a reason for hiding this comment

jahorton Jul 22, 2021

Choose a reason for hiding this comment

mcdurdin left a comment

Choose a reason for hiding this comment

mcdurdin Jul 22, 2021

Choose a reason for hiding this comment

mcdurdin Jul 22, 2021

Choose a reason for hiding this comment

mcdurdin left a comment

Choose a reason for hiding this comment

jahorton commented Jul 22, 2021

jahorton commented Jul 22, 2021

mcdurdin commented Jul 22, 2021

jahorton commented Jul 22, 2021

mcdurdin commented Jul 22, 2021

MakaraSok commented Jul 26, 2021 • edited by jahorton Loading

User Testing

Platform: Chrome emulation / Android

jahorton commented Jul 26, 2021 • edited Loading

jahorton commented Jul 26, 2021

MakaraSok commented Jul 27, 2021

jahorton commented Jul 27, 2021 • edited Loading

keyman-server commented Jul 27, 2021

jahorton commented Jul 22, 2021 •

edited

Loading

MakaraSok commented Jul 26, 2021 •

edited by jahorton

Loading

jahorton commented Jul 26, 2021 •

edited

Loading

jahorton commented Jul 27, 2021 •

edited

Loading