-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(satp-hermes): add crash recovery & rollback protocol #3491
feat(satp-hermes): add crash recovery & rollback protocol #3491
Conversation
I will review this PR |
f9014b0
to
0de9744
Compare
@Yogesh01000100 please rebase with satp-dev (should not have conflicts) |
0de9744
to
4c0124d
Compare
@Yogesh01000100 please include documentation and tests, and update the description, as discussed. |
ce9a179
to
24b8eaf
Compare
24b8eaf
to
728e7cb
Compare
@Yogesh01000100 could you please squash the commits and rebase with latest version of satp-dev, prior to merge? |
1a55673
to
21ad772
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks very good, but there are some changes to be done prior to merging.
Summarizing my comments:
- Add other authors to the commit
- Incorporate feedback from the logging process (namely un-hardcoding logs and adding more information)
- Implement RollbackState (for example, should state how many more steps are to be rolled-back, at any moment; what was rolledback already; estimated time to completion, etc)
- Please add tests that support the new feature
- Please add comprehensive documentation on this feature. Example: The readme of SATP should have a section on how to run the docker compose with several examples of configurations.
packages/cactus-plugin-satp-hermes/src/main/typescript/core/recovery/crash-manager.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/recovery/crash-recovery-handler.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/recovery/crash-recovery-handler.ts
Outdated
Show resolved
Hide resolved
...us-plugin-satp-hermes/src/main/typescript/core/recovery/rollback/stage0-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...us-plugin-satp-hermes/src/main/typescript/core/recovery/rollback/stage1-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...us-plugin-satp-hermes/src/main/typescript/core/recovery/rollback/stage2-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...us-plugin-satp-hermes/src/main/typescript/core/recovery/rollback/stage3-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...us-plugin-satp-hermes/src/main/typescript/generated/proto/cacti/satp/v02/common/health_pb.ts
Outdated
Show resolved
Hide resolved
49e1135
to
fb703b4
Compare
fb703b4
to
b30ccb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review how sessionData is being used, and take a look at the Stage 3 question.
Please document the new code as well. The rest is being documented in this PR:
#3619
packages/cactus-plugin-satp-hermes/src/main/typescript/core/recovery/crash-manager.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/recovery/crash-manager.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/recovery/crash-manager.ts
Outdated
Show resolved
Hide resolved
...us-plugin-satp-hermes/src/main/typescript/core/recovery/rollback/stage0-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...us-plugin-satp-hermes/src/main/typescript/core/recovery/rollback/stage3-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/test/typescript/unit/recovery/logging.test.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/test/typescript/unit/recovery/services.test.ts
Outdated
Show resolved
Hide resolved
13e0302
to
2896426
Compare
f0e50ef
to
cb24d53
Compare
cb24d53
to
d14f178
Compare
d14f178
to
4eef528
Compare
packages/cactus-plugin-satp-hermes/src/main/typescript/gol/gateway-orchestrator.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/recovery/crash-manager.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/proto/cacti/satp/v02/crash_recovery.proto
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/blo/dispatcher.ts
Outdated
Show resolved
Hide resolved
d6ffbca
to
1405923
Compare
d73a5eb
to
e16e84d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave some comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave some comments:
As discussed 3 months ago: @Yogesh01000100 please include documentation and tests, and update the description, as discussed.
Add other authors to the commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include CrashStatus and LocalLog types in the open api spec and import them where needed
packages/cactus-plugin-satp-hermes/src/main/typescript/plugin-satp-hermes-gateway.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/plugin-satp-hermes-gateway.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/test/typescript/integration/recovery.test.ts
Outdated
Show resolved
Hide resolved
about this open API spec part as it has a tpl.json and a .json, which one to update I'm a bit unsure |
b094409
to
222d088
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to consider this carefully. If sessionData remains as it is, we must handle it with care and clearly differentiate between the client and server sides of the gateway. I designed the sessionData this way to ensure that a gateway can act as both a client and server to itself.
@yogesh please address this concern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include CrashStatus and LocalLog types in the open api spec and import them where needed
about this open API spec part as it has a tpl.json and a .json, which one to update I'm a bit unsure
Please see the package.json to see which one is used for generation and which purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yogesh, can you confirm this has been addressed?
222d088
to
503658c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please resolve this issues, they are important and can cause problems in the future
packages/cactus-plugin-satp-hermes/src/main/typescript/gol/crash-manager.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/gol/crash-manager.ts
Outdated
Show resolved
Hide resolved
503658c
to
b56fe20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are problems that needs to be fixed.
packages/cactus-plugin-satp-hermes/src/main/typescript/core/crash-management/client-service.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/crash-management/server-service.ts
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/crash-management/client-service.ts
Show resolved
Hide resolved
...n-satp-hermes/src/main/typescript/core/crash-management/rollback/stage0-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...n-satp-hermes/src/main/typescript/core/crash-management/rollback/stage0-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...n-satp-hermes/src/main/typescript/core/crash-management/rollback/stage1-rollback-strategy.ts
Show resolved
Hide resolved
...n-satp-hermes/src/main/typescript/core/crash-management/rollback/stage2-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...n-satp-hermes/src/main/typescript/core/crash-management/rollback/stage2-rollback-strategy.ts
Show resolved
Hide resolved
...n-satp-hermes/src/main/typescript/core/crash-management/rollback/stage3-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/satp-session.ts
Outdated
Show resolved
Hide resolved
Added a commit fixing several issues. @Yogesh01000100 could you please take a look at the tests and double check everything works? |
6efaa6f
to
523284f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some issues that need to be attended, If there are any issue that you cannot resolve please comment a TODO
and the explanation in everywhere it is needed.
packages/cactus-plugin-satp-hermes/src/main/proto/cacti/satp/v02/common/session.proto
Outdated
Show resolved
Hide resolved
...n-satp-hermes/src/main/typescript/core/crash-management/rollback/stage0-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
...n-satp-hermes/src/main/typescript/core/crash-management/rollback/stage1-rollback-strategy.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/satp-session.ts
Outdated
Show resolved
Hide resolved
packages/cactus-plugin-satp-hermes/src/main/typescript/core/satp-utils.ts
Outdated
Show resolved
Hide resolved
523284f
to
c5eaa19
Compare
1. Implemented recovery & rollback using RPC-based message handlers. 2. Added rollback strategies for all SATP stages. 3. Integrated database log management for recovery and rollback. 4. Added cron jobs for scheduled crash detection and recovery initiation. Co-authored-by: Rafael Belchior <[email protected]> Co-authored-by: Carlos Amaro <[email protected]> Signed-off-by: Yogesh01000100 <[email protected]> chore(satp-hermes): improve DB management Signed-off-by: Rafael Belchior <[email protected]> chore(satp-hermes): crash recovery architecture Signed-off-by: Rafael Belchior <[email protected]> fix(recovery): enhance crash recovery and rollback implementation Signed-off-by: Yogesh01000100 <[email protected]> refactor(recovery): consolidate logic and improve SATP message handling Signed-off-by: Yogesh01000100 <[email protected]> feat(recovery): add rollback implementations Signed-off-by: Yogesh01000100 <[email protected]> fix: correct return types and inits Signed-off-by: Yogesh01000100 <[email protected]> fix: add unit tests and resolve rollbackstate Signed-off-by: Yogesh01000100 <[email protected]> feat: add function processing logs from g2 Signed-off-by: Yogesh01000100 <[email protected]> feat: add cron schedule for periodic crash checks Signed-off-by: Yogesh01000100 <[email protected]> fix: resolve rollback condition and add tests Signed-off-by: Yogesh01000100 <[email protected]> feat: add orchestrator communication layer using connect-RPC Signed-off-by: Yogesh01000100 <[email protected]> feat: add rollback protocol rpc Signed-off-by: Yogesh01000100 <[email protected]> fix: handle server log synchronization Signed-off-by: Yogesh01000100 <[email protected]> fix: resolve gol errors, add unit tests Signed-off-by: Yogesh01000100 <[email protected]> fix: handle server-side rollback Signed-off-by: Yogesh01000100 <[email protected]> fix: resolve networkId in rollback strategies Signed-off-by: Yogesh01000100 <[email protected]>
c5eaa19
to
43367c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
This PR addresses issue #3114 by implementing core components for crash recovery and rollback protocols. The changes enhance fault tolerance and ensure consistent recovery during failures.
Key Changes
1. CrashManager
Introduced a CrashManager class responsible for managing crash detection, recovery, and rollback processes.
Key functionalities include:
node-schedule
.2. Protocol Services
Updated crash_recovery.proto to define:
RecoverMessage
,RecoverUpdateMessage
, andRecoverSuccessMessage
for crash recovery.RollbackMessage
andRollbackAckMessage
for rollback processes.3. Recovery & Rollback Strategies
Implemented recovery & rollback strategies for all SATP protocol stages, ensuring the ability to revert to a consistent state upon failure.
4. Crash Detection and Handling
Added mechanisms to: