Collect all test results and handle cancelled tests properly #58

sfc-gh-kmakino · 2021-10-06T23:41:05Z

This PR addresses 3 issues:

When multiple tests start simultaneously, started can go beyond max_runs. In this scenario, the agent should wait for all tests to complete, rather than stop at max_runs and ignore the still running jobs
When agents die or considered to be dead due to not heart beating, we should ignore the cancelled tests as we don't know the results
When enough agents are running to serve available ensembles, other agents should timeout

ammolitor · 2021-10-07T16:05:31Z

Given that the unit tests failed (in the Github Actions build), I think this needs another look.

sfc-gh-kmakino · 2021-10-07T16:23:16Z

@ammolitor Can you tell what failed? build.sh works totally fine locally here.

sfc-gh-kmakino · 2021-10-08T06:30:46Z

@ammolitor CI passed. It would great if you can take another quick look. Thanks!

sfc-gh-kmakino · 2021-10-13T15:52:34Z

Now I realized the scaler needs to be aware of this change. Converting this as a draft for now.

…ents attempting to take over a dead agent's run

sfc-gh-anoyes · 2021-11-03T23:49:54Z

joshua/joshua_model.py

+    # TODO(qhoang) let's try this but there must be a better way
+    # When an agent is cancelled, it has already incremented the __started__ counter
+    # but will never get to increment the __ended__ counter
+    _decrement(tr, ensemble_id, "started")


This needs to be idempotent, see https://apple.github.io/foundationdb/developer-guide.html#transactions-with-unknown-results

sfc-gh-kmakino requested a review from ammolitor October 6, 2021 23:41

sfc-gh-kmakino added 4 commits October 7, 2021 19:12

Always wait for the result once started

2038cbb

Ignore cancelled tests

bd7f54e

Do not record cancelled tests

c01defe

Timeout agent if no ensembles to run

435b390

sfc-gh-kmakino force-pushed the kaomakino/overshoot2 branch from 9c0aa5f to 435b390 Compare October 8, 2021 02:13

Fix dead agent test

130b644

sfc-gh-kmakino marked this pull request as draft October 13, 2021 15:52

sfc-gh-kmakino and others added 3 commits October 29, 2021 14:17

Merge branch 'FoundationDB:main' into kaomakino/overshoot2

f4aa838

fixed concurrency bug where _decrement is called multiple times by ag…

fae6f0d

…ents attempting to take over a dead agent's run

attempt to fix started counter

5ed130b

sfc-gh-anoyes reviewed Nov 3, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect all test results and handle cancelled tests properly #58

Collect all test results and handle cancelled tests properly #58

sfc-gh-kmakino commented Oct 6, 2021

ammolitor commented Oct 7, 2021

sfc-gh-kmakino commented Oct 7, 2021

sfc-gh-kmakino commented Oct 8, 2021

sfc-gh-kmakino commented Oct 13, 2021

sfc-gh-anoyes Nov 3, 2021

Collect all test results and handle cancelled tests properly #58

Are you sure you want to change the base?

Collect all test results and handle cancelled tests properly #58

Conversation

sfc-gh-kmakino commented Oct 6, 2021

ammolitor commented Oct 7, 2021

sfc-gh-kmakino commented Oct 7, 2021

sfc-gh-kmakino commented Oct 8, 2021

sfc-gh-kmakino commented Oct 13, 2021

sfc-gh-anoyes Nov 3, 2021

Choose a reason for hiding this comment