Ensure exceptions during CatchUpHook are properly handled #4908

mhsdesign · 2024-02-21T17:40:41Z

Currently any faulty catchup implementation can prevent a replay.
As they are rather optional, maybe the should not crash the whole system?

flow cr:replayAll       
Replaying events for all projections of Content Repository "default" ...
The node type "Foo:Document.Starship" is not available

  Type: Neos\ContentRepository\Core\SharedModel\Exception\NodeTypeNotFoundException
  Code: 1316598370
  File: Packages/Neos/Neos.ContentRepository.Core/Classes/NodeType/NodeTypeManager.
        php
  Line: 126

The text was updated successfully, but these errors were encountered:

bwaidelich · 2024-04-15T15:27:35Z

This can hopefully be simplified a lot with #4992

mhsdesign · 2024-12-09T15:46:57Z

While working on the subscription engine #5321 we thought about multiple ways of reacting to errors.

One implementation was that errors should rollback the projection and put it into error state and the hooks "around" the catchup will do a full rollback (the behaviour of onBeforeEvent and onAfterEvent should equal that to the current 9.0 state of using checkpoints as its in the same transaction)

onBeforeCatchUp: Its important that no errors are thrown, as they will cause the catchup to directly halt with a {@see CatchUpFailed} exception.
onBeforeEvent: Any errors will cause the transaction being rolled back, and the projection goes into {@see SubscriptionStatus::ERROR} state.
onAfterEvent: Any errors will cause the transaction being rolled back, and the projection goes into {@see SubscriptionStatus::ERROR} state.
onAfterCatchUp: The full transaction is rolled back, no projections or status or positions are updated

The idea was that for a failed catchup hook we will try to make it look like that the event was never processed. The other chained hooks would not get to react but that would be fine as after fixing the hook and running reactivateSubscription all hooks are run again. Also the position is rolled back in the database in the onAfterEvent case so that when fixing the catchup hook the event is not applied twice which likely causes the projection to fail then.

The exact behaviour can be read from this test https://github.com/neos/neos-development-collection/blob/caa70bf4a33486e28aa7e2e302a8799ce263f53c/Neos.ContentRepository.BehavioralTests/Tests/Functional/Subscription/CatchUpHookErrorTest.php

But the behaviour to have the hooks like onAfterEvent that tightly around the projection so that errors are interpreted as projection error and thus putting the projection into error state, setting the position to the previous event and rolling it back was the reason for a lot of complexity and the necessity of save-points to ensure exactly once delivery.

Ultimately we got to the agreement pretty fast that onAfterCatchUp should be responsible for the main work of the catchup hoo once the projections is already updated and persisted.
That lead to the rollback concept to brittle. We dont expect the tight hooks (onAfterCatchUp) to do more critical work, that the onAfterCatchUp hook - ideally onAfterCatchUp should only gather data and not fail during it.

So implementing a sophisticated rollback mechanism here - which only works if the catchup uses the same dbal instance (and for that matter a database at all - emails might already be sent) - is overkill. We cant prevent the onAfterCatchUp hook to fail and doing a rollback here would be silly and impossible with batching (as the transaction is already committed) - so we dont want it for the less critical hooks either.

The solution is to collect the errors - ignore them for now - and return them via the ProcessResult. The DelegatingCatchupHook introduces a little bit of complexity here because we still want to ensure that other hooks in the chain are run while one hook fails, but that is easily possible by running all chained hooks and then throwing an error with all sub errors attached if necessary (implementation).

Collecting errors also fixes two more things:
For one we dont absolutely need save points for exactly once delivery in projections as this case is much less likely to happen now - projections only get into error state if they fail themselves.
Also since catchup hooks are now kindof just ignored in error case the projection is up to date and the editor can continue to work. Each change will lead to an exception that is rethrown via ContentRepository::handle but that will not be fatal.

That leaves us with the following error definition:

onBeforeCatchUp: Note that any errors thrown will be collected and the catchup will start as normal.
The collect errors will be returned and rethrown by the content repository.
onBeforeEvent: Note that any errors thrown will be collected and the catchup will continue as normal.
The collect errors will be returned and rethrown by the content repository.
onAfterEvent: Note that any errors thrown will be collected and the catchup will continue as normal.
The collect errors will be returned and rethrown by the content repository.
onAfterCatchUp: The projection and their new status and position are already persisted.
Note that any errors thrown will be collected and the catchup will finish as normal.
The collect errors will be returned and rethrown by the content repository.

The exact behaviour can once again be read from this test: https://github.com/neos/neos-development-collection/blob/6b59fdbfe0d324b035935c1d2d051e223cd95ae3/Neos.ContentRepository.BehavioralTests/Tests/Functional/Subscription/CatchUpHookErrorTest.php

Not stopping the full catchup - during a replay - and possibly collecting thousands of errors is the only downside.
The solution is that we implemented batching, which is also needed to split up the transaction into smaller chunks.
Now with batching we decided that a new batch will only be started if there havent been any errors. This will allow the replay case to stop in a reasonable amount of time - after "only" say 500 events.

Further a new hook will be introduced which is explained as follows:

onAfterBatchCompleted: The projection and their new status and position are already persisted.
Note that any errors thrown will be collected but no further batch is started.
The collect errors will be returned and rethrown by the content repository.

All other hooks can be described like

Note that any errors thrown will be collected and the current catchup batch will be finished as normal.

mhsdesign added this to Neos 9.0 Release Board Mar 24, 2024

mhsdesign added Bug 9.0 labels Mar 24, 2024

bwaidelich mentioned this issue Apr 15, 2024

Replace CatchUpHooks with ContentRepositoryHooks #4992

Closed

mhsdesign mentioned this issue Oct 16, 2024

Split-up "Rework CR CatchUp mechanism" #5285

Open

3 tasks

mhsdesign self-assigned this Nov 26, 2024

This was referenced Dec 9, 2024

FEATURE: Subscription Engine #5321

Open

TASK: Overhaul CatchUpHook error behaviour; At least once delivery for ERROR projections #5392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure exceptions during CatchUpHook are properly handled #4908

Ensure exceptions during CatchUpHook are properly handled #4908

mhsdesign commented Feb 21, 2024

bwaidelich commented Apr 15, 2024

mhsdesign commented Dec 9, 2024 •

edited

Loading

Ensure exceptions during CatchUpHook are properly handled #4908

Ensure exceptions during CatchUpHook are properly handled #4908

Comments

mhsdesign commented Feb 21, 2024

bwaidelich commented Apr 15, 2024

mhsdesign commented Dec 9, 2024 • edited Loading

mhsdesign commented Dec 9, 2024 •

edited

Loading