Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE: Subscription Engine #5321

Open
wants to merge 158 commits into
base: 9.0
Choose a base branch
from

Conversation

bwaidelich
Copy link
Member

@bwaidelich bwaidelich commented Oct 24, 2024

Related: #4746
Resolves: #4152
Resolves: #4908

migration to migrate the checkpoints to subscribers

./flow migrateevents:migrateCheckpointsToSubscriptions

TODO

  • SubscriptionEngine::reset() (i.e. projection replay)
  • Tests
    • spying testing catchup is correctly applied for example
    • parallel tests to ensure subscription locking works
  • Fine tuning
  • inline docs
    • subscription status docs
  • Credit patchlevel (are inline doc comments for all classes underneath Neos\ContentRepository\Core\Subscription enough?)
  • Optional: Allow to target specific groups/subscriptions via CLI commands (the SubscriptionEngine already supports this)
  • Postgres support for the engine -> remove platform options, simple string fields.
  • use DateTimeImmutable::ATOM for date formatting instead? But it clashes with datetime_immutable (use Types::DATETIMETZ_IMMUTABLE in schema) -> not possible as DATETIMETZ is not supported by mysql
  • use Status->value instead of name field!!!
  • wrap projection->setUp in transaction to ensure errors (like in a migration) will be fully rolled back -> Cannot be rolled back
  • discuss if the setup code is multi thread safe? Because it doesnt use a transaction? What if there is a catchup ongoing?
    • discoverNewSubscriptions doesnt have a transaction?
  • ensure the NEW subscriptions are never catched up! (dont reset NEW subscriptions, and dont reset detached because that fails)
  • todo will the CatchupHadErrors exception disturb the codeflow from dbal repos and doctrine orm things that are rolled back?
  • refine logging exceptions to only first throwable
    • context aware log files? Development ...
    • todo add sequence number and event type to catchup error?
    • overhaul getIteratorAggregate for errors collection? Ensure errors are logged correctly by the throwable storage!
  • if replay fails with error denote that its only a partially made replay because of batching
  • Document deprecations / changes:
    • Interface of catchup hook different
    • projectionReplayAll deprecated (use subscription commands instead)
    • content repository does not contain setup and status, use content repository maintainer instead
    • catchup --until for debugging was removed
  • questions
    • should a detachment of a subscriber be noticed in the ProcessResult as error? Should the cr throw?
    • TODO pass the error subscription status to onAfterCatchUp, so that in case of an error it can be prevented that mails f.x. will be sent?
    • introduce custom exception if catchup failed in the cr, maybe use the CatchUpFailed ... but rather rename that and also SubscriptionEngineAlreadyProcessingException to CatchUpError and then have a ErrorDuringCatchUp?
    • rename booting to booted? -> we agree that the ING is not a state, as it needs to be rather was.. but we dont have a better name
    • rename subscription status to state?
    • Make subscription now api and return it in the status collection as its not mutable?
    • do we need to prepare the option to deactivate catchup hooks at runtime?

Related: #4746
@github-actions github-actions bot added the 9.0 label Oct 24, 2024
@bwaidelich bwaidelich changed the title WIP CatchUp Subscription Engine Oct 24, 2024
@bwaidelich bwaidelich changed the title CatchUp Subscription Engine WIP: CatchUp Subscription Engine Oct 24, 2024
@mhsdesign
Copy link
Member

I had a quick look and id love to help with the wiring and sticking it all together :) We should probably really get a simple draft running of the catchups just getting their projection state first :) As that will require some work already

mhsdesign added a commit to mhsdesign/neos-development-collection that referenced this pull request Oct 27, 2024
…n they are registered for

A catchup doesn't have access to the full content repository, as it would allow full recursion via handle and accessing other projections
state is not safe as the other projection might not be behind - the order is undefined.

This will make it possible to catchup projections from outside of the cr instance as proposed here: neos#5321
mhsdesign added a commit to mhsdesign/neos-development-collection that referenced this pull request Nov 2, 2024
…n they are registered for

A catchup doesn't have access to the full content repository, as it would allow full recursion via handle and accessing other projections
state is not safe as the other projection might not be behind - the order is undefined.

This will make it possible to catchup projections from outside of the cr instance as proposed here: neos#5321
@mhsdesign
Copy link
Member

Okay i thought further about our little content graph projection vs projection states vs event handlers dilemma and i think the solution is not necessary exposing just the $projection on ProjectionEventHandler ... as that would also undo the explicit graph projection wiring from #5272 but instead pass a "bigger" object than "just" the Subscribers into the content repository factory.

in my eyes this object, being build by the content repository registry in some way, must reflect

  • what is guaranteed to be the ContentGraphProjectionInterface and its state ContentGraphReadModelInterface, not part of some collection but distinct
  • what are the additional projection states
  • what are all the subscribers (with their catchup-hooks) that are just directly passed to the subscription engine without doing anything further with it. (e.g. no public accessors on the ProjectionEventHandler for anything)

that could look like:

final readonly class ContentRepositoryGraphProjectionAndSubscribers
{
    public function __construct(
        public ContentGraphProjectionInterface $contentGraphProjection,
        public Subscribers $subscribers, // must contain a subscriber for the $contentGraphProjection
        public ProjectionStates $additionalProjectionStates, // must not contain the $contentGraphProjection state
    ) {
    }
}

or maybe a little more explicit so the factories dont have to deal with all the logic and we have control over the subscription ids:

final readonly class ProjectionsAndCatchupHooksBetterVersionZwo
{
    public function __construct(
        public ContentGraphProjectionInterface $contentGraphProjection,
        private Projections $additionalProjections,
        private Subscribers $additionalSubscriber,
        private array $catchUpHooksByProjectionClass
    ) {
    }

    public function getSubscribers(): Subscribers
    {
        $subscribers = iterator_to_array($this->additionalSubscriber);

        $subscribers[] = new Subscriber(
            SubscriptionId::fromString('contentGraphProjection'),
            SubscriptionGroup::fromString('default'),
            RunMode::FROM_BEGINNING,
            new ProjectionEventHandler(
                $this->contentGraphProjection,
                $this->getCatchUpHooksForProjectionClass(ContentGraphProjectionInterface::class),
            ),
        );
        
        foreach ($this->additionalProjections as $projection) {
            $subscribers[] = new Subscriber(
                SubscriptionId::fromString(substr(strrchr($projection::class, '\\'), 1)),
                SubscriptionGroup::fromString('default'),
                RunMode::FROM_BEGINNING,
                new ProjectionEventHandler(
                    $projection,
                    $this->getCatchUpHooksForProjectionClass($projection::class),
                ),
            );
        }
        
        return Subscribers::fromArray($subscribers);
    }
    
    public function getAdditionalProjectionStates(): ProjectionStates
    {
        return ProjectionStates::fromArray(array_map(
            fn ($projection) => $projection->getState(),
            iterator_to_array($this->additionalProjections)
        ));
    }

    private function getCatchUpHooksForProjectionClass(string $projectionClass): ?CatchUpHookInterface
    {
        return $this->catchUpHooksByProjectionClass[$projectionClass] ?? null;
    }
}

but for things that will belong to the future ProjectionService like, replayProjection, replayAllProjections, resetAllProjections we might still need to expose all projections here, unless the subscription engine will lear that itself: $this->subscriptionEngine->reset()

@bwaidelich
Copy link
Member Author

@mhsdesign thanks for your input!
Ands in a class name always make me suspicious..

That's my current draft of the ContentRepositoryFactory constructor:

public function __construct(
    private readonly ContentRepositoryId $contentRepositoryId,
    EventStoreInterface $eventStore,
    NodeTypeManager $nodeTypeManager,
    ContentDimensionSourceInterface $contentDimensionSource,
    Serializer $propertySerializer,
    private readonly UserIdProviderInterface $userIdProvider,
    private readonly ClockInterface $clock,
    SubscriptionStoreInterface $subscriptionStore,
    ContentGraphProjectionFactoryInterface $contentGraphProjectionFactory,
    CatchUpHookFactoryInterface $contentGraphCatchUpHookFactory,
    private readonly ContentRepositorySubscribersFactoryInterface $additionalSubscribersFactory,
) {
// ...
}

@mhsdesign
Copy link
Member

As discussed that looks good ❤️ my idea had a flaw because i assumed the projection instance could be build by the registry which it CANNOT because we need factory dependencies.... and the thing with iterating over the event handlers to fetch their state via getState or something is weird but okay in as that projectionState is now a little tunnel through space as well :) So definitely okay to do that little quirk.

neos-bot pushed a commit to neos/contentrepository-core that referenced this pull request Nov 4, 2024
…n they are registered for

A catchup doesn't have access to the full content repository, as it would allow full recursion via handle and accessing other projections
state is not safe as the other projection might not be behind - the order is undefined.

This will make it possible to catchup projections from outside of the cr instance as proposed here: neos/neos-development-collection#5321
neos-bot pushed a commit to neos/contentrepositoryregistry that referenced this pull request Nov 4, 2024
…n they are registered for

A catchup doesn't have access to the full content repository, as it would allow full recursion via handle and accessing other projections
state is not safe as the other projection might not be behind - the order is undefined.

This will make it possible to catchup projections from outside of the cr instance as proposed here: neos/neos-development-collection#5321
neos-bot pushed a commit to neos/neos that referenced this pull request Nov 4, 2024
…n they are registered for

A catchup doesn't have access to the full content repository, as it would allow full recursion via handle and accessing other projections
state is not safe as the other projection might not be behind - the order is undefined.

This will make it possible to catchup projections from outside of the cr instance as proposed here: neos/neos-development-collection#5321
# Conflicts:
#	Neos.ContentRepository.Core/Classes/ContentRepository.php
#	Neos.ContentRepository.Core/Classes/Factory/ContentRepositoryFactory.php
#	Neos.ContentRepositoryRegistry/Classes/ContentRepositoryRegistry.php
#	Neos.ContentRepositoryRegistry/Classes/Service/ProjectionService.php
#	Neos.ContentRepositoryRegistry/Classes/Service/ProjectionServiceFactory.php
#	phpstan-baseline.neon
Previously the `CatchUpHookInterface` contained a `onBeforeBatchCompleted` which was explained as follows:

This hook is called directly before the database lock is RELEASED

It can happen that this method is called multiple times, even without
having seen Events in the meantime.

If there exist more events which need to be processed, the database lock
is directly acquired again after it is released.

----------

`onAfterBatchCompleted` behaves similar just that it runs when the lock was released. This allows us to run all the tasks that `onAfterCatchUp` could also do (using possibly transactions and commiting work)

We dont ensure that `onAfterBatchCompleted` is only called if there are more events to handle, as this would complicate the code and technically without acquiring a lock there is never a guarantee for that, so hooks might as well encounter the case more often and have to deal with that.
@mhsdesign
Copy link
Member

mhsdesign commented Dec 9, 2024

okay so the long missed update. On the 3rd of December Basti, Denny and me discussed if going down road #5383 would be the right choice or what Basti said:

I had deliberately opted for less transactions and no additional (file based) locks

in short the result:

EDIT: Basti, Sebastian and me discussed this matter further on the 4th and we agreed that this is the solution.
Especially the catchup error case behaviour which is extensively documented here #4908 (comment) was discussed.

-> implemented via #5392

@mhsdesign
Copy link
Member

mhsdesign commented Dec 10, 2024

The return of the at least once delivery (for errors)

Today Christian, Denny, Bastian and me rediscussed the use of save-points (as described here).

As written above the save-point will only be used for REAL projection errors now and never rolled back if catchup errors occur.
With this change, the save-points are less important because a real projection error should better be thrown at the start before any statements, and even if some statements were issued and a full rollback is done its unlikely that a reactivateSubscription helps that case by applying the event again.

Then i thought that we could use transactions and save-points to wrap the setup call to ensure any migration - if it fails - can be rolled back, but i oversaw that CREATE TABLE can not run in a transaction after all. So the setup case doesnt justify the use of savepoints because its not possible there.

So with all these conditions changed - and with the fact that there is no central dbal connection concept in the core and thus any calls to a global createSavepoint method are code smell (and a real transactional method on the projection too much complexity) - and also in light that save-points are a less known and possibly battle tested feature, we want to dial back from a rollback in ERROR case.

Of course would it be more "pretty" if the apply is rolled back from the projection as some events will definitely crash the projection if reapplied via reactivateSubscription.
Currently ANY projection error should rather be treated with a full replay anyway.

Similarly to that doctrine migrations (`flow doctrine:migrate`) are also cannot be rollback in case of an error

So we cannot wrap projection->setUp in transaction and errors (like in a migration) will be directly commited. Thus, it must be acted with care.
see #5321 (comment)

the save-point will only be used for REAL projection errors now and never rolled back if catchup errors occur.

With that change in code the save-points are less important because a real projection error should better be thrown at the start before any statements, and even if some statements were issued and a full rollback is done its unlikely that a reactivateSubscription helps that case.

Instead, to repair projections you should replay
@mhsdesign
Copy link
Member

The decision to use batching for a replay but not for a full catchup (see #5321 (comment)) was rediscussed.

The thesis was that we should not differentiate between the two - to avoid having problems like why does a work and not b.
Also if we acknowledge that a transaction could go too big, we should prevent that in both cases then.

But we agreed on keeping that decision for the following reasons:

  • booting is triggered explicitly and if any errors occur they could be handled by the user
    • if a publish catchup fails if it is batched, the catchup will only be done partially and the Neos Ui cannot issue new commands to the cr. Someone would need to catchup (via backend module) until the position is up to date again and ignore any errors for each batch. -> this is not really nice UX for the editor.
  • for a catchup the size can only get huge with when a publish with thousand of nodes is run
    • in that case there will be a big simulation first using a big transaction
    • doing any batching here in fear that the transaction will be too big is too late: if that would be the case the simulation would have already failed

But we also should keep in mind:

  • publish only issues one content stream was forked event which is the most expensive part (and in a replay there are multiple)
  • transactions have no limit but possibly only get slower as they have to write the state to temporary files

bwaidelich and others added 2 commits December 10, 2024 20:12
…point-simplication

TASK: Overhaul CatchUpHook error behaviour; At least once delivery for ERROR projections
@mhsdesign mhsdesign marked this pull request as ready for review December 11, 2024 08:27
…ption

Adds test for these cases

Additionally, we have to introduce a `FakeCatchUpHookFactory2`
to have two hooks on one projection and prevent:

> a CatchUpHookFactory of type "Neos\ContentRepository\TestSuite\Fakes\FakeCatchUpHookFactory" already exists in this set
This way we reduce complexity as we dont have to pass too many parameters by reference.
Also it makes the case for `onBeforeCatchUp` (the first iteration) more explicit

We dont need to worry about rollbacks as no errors should ever be thrown uncatched during catchup.
…" was not invoked

this does not happen under normal circumstances but is technically possible as we release and reaquire a lock during batching.
As the exception log and `CatchUpHadErrors` only accept one previous exception and to reduce possible bloat if to many exceptions are thrown.
Instead, the exceptions should additionally be logged (next commit).
without batching we had invalid state because the position is none (0) while the event is applied and persisted an the projection.
as we use SubscriptionStatus::from
Copy link
Member

@mhsdesign mhsdesign left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure exceptions during CatchUpHook are properly handled Consider using a single table for checkpoints
3 participants