Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure exceptions during CatchUpHook are properly handled #4908

Open
mhsdesign opened this issue Feb 21, 2024 · 2 comments · May be fixed by #5321
Open

Ensure exceptions during CatchUpHook are properly handled #4908

mhsdesign opened this issue Feb 21, 2024 · 2 comments · May be fixed by #5321
Assignees

Comments

@mhsdesign
Copy link
Member

Currently any faulty catchup implementation can prevent a replay.
As they are rather optional, maybe the should not crash the whole system?

flow cr:replayAll       
Replaying events for all projections of Content Repository "default" ...
The node type "Foo:Document.Starship" is not available

  Type: Neos\ContentRepository\Core\SharedModel\Exception\NodeTypeNotFoundException
  Code: 1316598370
  File: Packages/Neos/Neos.ContentRepository.Core/Classes/NodeType/NodeTypeManager.
        php
  Line: 126
@bwaidelich
Copy link
Member

This can hopefully be simplified a lot with #4992

@mhsdesign
Copy link
Member Author

mhsdesign commented Dec 9, 2024

While working on the subscription engine #5321 we thought about multiple ways of reacting to errors.

One implementation was that errors should rollback the projection and put it into error state and the hooks "around" the catchup will do a full rollback (the behaviour of onBeforeEvent and onAfterEvent should equal that to the current 9.0 state of using checkpoints as its in the same transaction)

  • onBeforeCatchUp: Its important that no errors are thrown, as they will cause the catchup to directly halt with a {@see CatchUpFailed} exception.
  • onBeforeEvent: Any errors will cause the transaction being rolled back, and the projection goes into {@see SubscriptionStatus::ERROR} state.
  • onAfterEvent: Any errors will cause the transaction being rolled back, and the projection goes into {@see SubscriptionStatus::ERROR} state.
  • onAfterCatchUp: The full transaction is rolled back, no projections or status or positions are updated

The idea was that for a failed catchup hook we will try to make it look like that the event was never processed. The other chained hooks would not get to react but that would be fine as after fixing the hook and running reactivateSubscription all hooks are run again. Also the position is rolled back in the database in the onAfterEvent case so that when fixing the catchup hook the event is not applied twice which likely causes the projection to fail then.

The exact behaviour can be read from this test https://github.com/neos/neos-development-collection/blob/caa70bf4a33486e28aa7e2e302a8799ce263f53c/Neos.ContentRepository.BehavioralTests/Tests/Functional/Subscription/CatchUpHookErrorTest.php


But the behaviour to have the hooks like onAfterEvent that tightly around the projection so that errors are interpreted as projection error and thus putting the projection into error state, setting the position to the previous event and rolling it back was the reason for a lot of complexity and the necessity of save-points to ensure exactly once delivery.

Ultimately we got to the agreement pretty fast that onAfterCatchUp should be responsible for the main work of the catchup hoo once the projections is already updated and persisted.
That lead to the rollback concept to brittle. We dont expect the tight hooks (onAfterCatchUp) to do more critical work, that the onAfterCatchUp hook - ideally onAfterCatchUp should only gather data and not fail during it.

So implementing a sophisticated rollback mechanism here - which only works if the catchup uses the same dbal instance (and for that matter a database at all - emails might already be sent) - is overkill. We cant prevent the onAfterCatchUp hook to fail and doing a rollback here would be silly and impossible with batching (as the transaction is already committed) - so we dont want it for the less critical hooks either.

The solution is to collect the errors - ignore them for now - and return them via the ProcessResult. The DelegatingCatchupHook introduces a little bit of complexity here because we still want to ensure that other hooks in the chain are run while one hook fails, but that is easily possible by running all chained hooks and then throwing an error with all sub errors attached if necessary (implementation).

Collecting errors also fixes two more things:
For one we dont absolutely need save points for exactly once delivery in projections as this case is much less likely to happen now - projections only get into error state if they fail themselves.
Also since catchup hooks are now kindof just ignored in error case the projection is up to date and the editor can continue to work. Each change will lead to an exception that is rethrown via ContentRepository::handle but that will not be fatal.

That leaves us with the following error definition:

  • onBeforeCatchUp: Note that any errors thrown will be collected and the catchup will start as normal.
    The collect errors will be returned and rethrown by the content repository.
  • onBeforeEvent: Note that any errors thrown will be collected and the catchup will continue as normal.
    The collect errors will be returned and rethrown by the content repository.
  • onAfterEvent: Note that any errors thrown will be collected and the catchup will continue as normal.
    The collect errors will be returned and rethrown by the content repository.
  • onAfterCatchUp: The projection and their new status and position are already persisted.
    Note that any errors thrown will be collected and the catchup will finish as normal.
    The collect errors will be returned and rethrown by the content repository.

The exact behaviour can once again be read from this test: https://github.com/neos/neos-development-collection/blob/6b59fdbfe0d324b035935c1d2d051e223cd95ae3/Neos.ContentRepository.BehavioralTests/Tests/Functional/Subscription/CatchUpHookErrorTest.php

Not stopping the full catchup - during a replay - and possibly collecting thousands of errors is the only downside.
The solution is that we implemented batching, which is also needed to split up the transaction into smaller chunks.
Now with batching we decided that a new batch will only be started if there havent been any errors. This will allow the replay case to stop in a reasonable amount of time - after "only" say 500 events.

Further a new hook will be introduced which is explained as follows:

  • onAfterBatchCompleted: The projection and their new status and position are already persisted.
    Note that any errors thrown will be collected but no further batch is started.
    The collect errors will be returned and rethrown by the content repository.

All other hooks can be described like

Note that any errors thrown will be collected and the current catchup batch will be finished as normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress 🚧
Development

Successfully merging a pull request may close this issue.

2 participants