Skip to content

Commit

Permalink
feat(satp-hermes): add crash recovery & rollback protocol
Browse files Browse the repository at this point in the history
1. Implemented recovery & rollback using RPC-based message handlers.
2. Added rollback strategies for all SATP stages.
3. Integrated database log management for recovery and rollback.
4. Added cron jobs for scheduled crash detection and recovery initiation.

Co-authored-by: Rafael Belchior <[email protected]>
Co-authored-by: Carlos Amaro <[email protected]>
Signed-off-by: Yogesh01000100 <[email protected]>

chore(satp-hermes): improve DB management

Signed-off-by: Rafael Belchior <[email protected]>

chore(satp-hermes): crash recovery architecture

Signed-off-by: Rafael Belchior <[email protected]>

fix(recovery): enhance crash recovery and rollback implementation

Signed-off-by: Yogesh01000100 <[email protected]>

refactor(recovery): consolidate logic and improve SATP message handling

Signed-off-by: Yogesh01000100 <[email protected]>

feat(recovery): add rollback implementations

Signed-off-by: Yogesh01000100 <[email protected]>

fix: correct return types and inits

Signed-off-by: Yogesh01000100 <[email protected]>

fix: add unit tests and resolve rollbackstate

Signed-off-by: Yogesh01000100 <[email protected]>

feat: add function processing logs from g2

Signed-off-by: Yogesh01000100 <[email protected]>

feat: add cron schedule for periodic crash checks

Signed-off-by: Yogesh01000100 <[email protected]>

fix: resolve rollback condition and add tests

Signed-off-by: Yogesh01000100 <[email protected]>

feat: add orchestrator communication layer using connect-RPC

Signed-off-by: Yogesh01000100 <[email protected]>

feat: add rollback protocol rpc

Signed-off-by: Yogesh01000100 <[email protected]>

fix: handle server log synchronization

Signed-off-by: Yogesh01000100 <[email protected]>

fix: resolve gol errors, add unit tests

Signed-off-by: Yogesh01000100 <[email protected]>

fix: handle server-side rollback

Signed-off-by: Yogesh01000100 <[email protected]>

fix: resolve networkId in rollback strategies

Signed-off-by: Yogesh01000100 <[email protected]>
  • Loading branch information
Yogesh01000100 authored and RafaelAPB committed Jan 10, 2025
1 parent 109b7b4 commit 1737df4
Show file tree
Hide file tree
Showing 43 changed files with 6,478 additions and 57 deletions.
49 changes: 43 additions & 6 deletions packages/cactus-plugin-satp-hermes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,17 @@ The sequence diagram of SATP is pictured below.

![satp-sequence-diagram](https://i.imgur.com/SOdXFEt.png)

### Crash Recovery Integration
The crash recovery protocol ensures session consistency across all stages of SATP. Each session's state, logs, hashes, timestamps, and signatures are stored and recovered using the following mechanisms:

1. **Session Logs**: A persistent log storage mechanism ensures crash-resilient state recovery.
2. **Consistency Checks**: Ensures all messages and actions are consistent across both gateways and the connected ledgers.
3. **Stage Recovery**: Recovers interrupted sessions by validating logs, hashes, timestamps, and signatures to maintain protocol integrity.
4. **Rollback Operations**: In the event of a timeout or irrecoverable failure, rollback messages ensure the state reverts back the current stage.
5. **Logging & Proofs**: The database is leveraged for state consistency and proof accountability across gateways.

Refer to the [Crash Recovery Sequence](https://datatracker.ietf.org/doc/html/draft-belchior-satp-gateway-recovery) for more details.

### Application-to-Gateway API (API Type 1)
We

Expand All @@ -76,17 +87,33 @@ There are Client and Server Endpoints for each type of message detailed in the S
- CommitFinalV1Response
- TransferCompleteV1Request
- ClientV1Request
### Crash Recovery Endpoints
There are Client and Server gRPC Endpoints for the recovery and rollback messages:

There are also defined the endpoints for the crash recovery procedure (there is still missing the endpoint to receive the Rollback mesage):
- RecoverV1Message
- RecoverUpdateV1Message
- RecoverUpdateAckV1Message
- RecoverSuccessV1Message
- RollbackV1Message
- **Recovery Messages:**
- `RecoverV2Message`
- `RecoverV2SuccessMessage`
- `RecoverUpdateMessage`
- **Rollback Messages:**
- `RollbackV2Message`
- `RollbackAckMessage`

## Use case
Alice and Bob, in blockchains A and B, respectively, want to make a transfer of an asset from one to the other. Gateway A represents the gateway connected to Alice's blockchain. Gateway B represents the gateway connected to Bob's blockchain. Alice and Bob will run SATP, which will execute the transfer of the asset from blockchain A to blockchain B. The above endpoints will be called in sequence. Notice that the asset will first be locked on blockchain A and a proof is sent to the server-side. Afterward, the asset on the original blockchain is extinguished, followed by its regeneration on blockchain B.

### Role of Crash Recovery in SATP
In SATP, crash recovery ensures that asset transfers remain consistent and fault-tolerant across distributed ledgers. Key features include:
- **Session Recovery**: Gateways synchronize state using recovery messages, ensuring continuity after failures.
- **Rollback**: For irrecoverable errors, rollback procedures ensure safe reversion to previous states.
- **Fault Resilience**: Enables recovery from crashes while maintaining the integrity of ongoing transfers.

These features enhance reliability in scenarios where network or gateway disruptions occur during asset transfers.

### Future Work

- **Single-Gateway Topology Enhancement**
The crash recovery and rollback mechanisms are implemented for configurations where client and server data are handled separately. For single-gateway setups, where both client and server data coexist in session, the current implementation of fetching a single log may not suffice. This requires to fetch multiple logs (X logs) `recoverSessions()` to differentiate and handle client and server-specific data accurately, to reconstruct the session back after the crash.

## Running the tests

[A test of the entire protocol with manual calls to the methods, i.e. without ledger connectors and Open API.](https://github.com/hyperledger/cactus/blob/2e94ef8d3b34449c7b4d48e37d81245851477a3e/packages/cactus-plugin-satp-hermes/src/test/typescript/integration/satp.test.ts)
Expand All @@ -109,6 +136,16 @@ Alice and Bob, in blockchains A and B, respectively, want to make a transfer of

[A test with a backup gateway resuming the protocol after the client gateway crashed.](https://github.com/hyperledger/cactus/tree/main/packages/cactus-plugin-satp-hermes/src/test/typescript/integration/backup-gateway-after-client-crash.test.ts)


### Crash Recovery Tests
- [Stage 1 Recovery Test](src/test/typescript/integration/recovery/recovery-stage-1.test.ts)
- [Stage 2 Recovery Test](src/test/typescript/integration/recovery/recovery-stage-2.test.ts)
- [Stage 3 Recovery Test](src/test/typescript/integration/recovery/recovery-stage-3.test.ts)
- [Stage 0 Rollback Test](src/test/typescript/integration/rollback/rollback-stage-0.test.ts)
- [Stage 1 Rollback Test](src/test/typescript/integration/rollback/rollback-stage-1.test.ts)
- [Stage 2 Rollback Test](src/test/typescript/integration/rollback/rollback-stage-2.test.ts)
- [Stage 3 Rollback Test](src/test/typescript/integration/rollback/rollback-stage-3.test.ts)

For developers that want to test separate steps/phases of the SATP protocol, please refer to [these](https://github.com/hyperledger/cactus/blob/2e94ef8d3b34449c7b4d48e37d81245851477a3e/packages/cactus-plugin-satp-hermes/src/test/typescript/unit/) test files (client and server side along with the recovery procedure).

## Usage
Expand Down
8 changes: 8 additions & 0 deletions packages/cactus-plugin-satp-hermes/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@
{
"name": "Bruno Mateus",
"url": "https://github.com/brunoffmateus"
},
{
"name": "Yogesh D",
"url": "https://github.com/Yogesh01000100"
}
],
"main": "dist/lib/main/typescript/index.js",
Expand Down Expand Up @@ -79,6 +83,8 @@
"preinstall": "curl -L https://foundry.paradigm.xyz | bash && foundryup",
"pretsc": "npm run generate-sdk",
"start-gateway": "node ./dist/lib/main/typescript/plugin-satp-hermes-gateway-cli.js",
"test:integration": "npx jest ./src/test/typescript/integration",
"test:unit": "npx jest ./src/test/typescript/unit",
"tsc": "tsc --project ./tsconfig.json",
"watch": "tsc --build --watch",
"db:setup": "bash -c 'npm run db:destroy || true && run-s db:start db:migrate db:seed'",
Expand Down Expand Up @@ -136,6 +142,7 @@
"jsonc": "2.0.0",
"knex": "2.4.0",
"kubo-rpc-client": "3.0.1",
"node-schedule": "2.1.1",
"npm-run-all": "4.1.5",
"openzeppelin-solidity": "3.4.2",
"pg": "8.13.1",
Expand Down Expand Up @@ -163,6 +170,7 @@
"@types/fs-extra": "11.0.4",
"@types/google-protobuf": "3.15.12",
"@types/node": "18.18.2",
"@types/node-schedule": "2.1.7",
"@types/pg": "8.11.10",
"@types/swagger-ui-express": "4.1.6",
"@types/tape": "4.13.4",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ import { Knex } from "knex";
const envPath = process.env.ENV_PATH;
dotenv.config({ path: envPath });

const config: { [key: string]: Knex.Config } = {
development: {
export const knexRemoteInstance: { [key: string]: Knex.Config } = {
default: {
client: "sqlite3",
connection: {
filename: path.resolve(__dirname, ".dev.remote-" + uuidv4() + ".sqlite3"),
filename: path.resolve(__dirname, `.dev.remote-${uuidv4()}.sqlite3`),
},
migrations: {
directory: path.resolve(__dirname, "migrations"),
Expand All @@ -31,5 +31,3 @@ const config: { [key: string]: Knex.Config } = {
},
},
};

export default config;
6 changes: 2 additions & 4 deletions packages/cactus-plugin-satp-hermes/src/knex/knexfile.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ import { Knex } from "knex";
const envPath = process.env.ENV_PATH;
dotenv.config({ path: envPath });

const config: { [key: string]: Knex.Config } = {
development: {
export const knexLocalInstance: { [key: string]: Knex.Config } = {
default: {
client: "sqlite3",
connection: {
filename: path.resolve(__dirname, `.dev.local-${uuidv4()}.sqlite3`),
Expand All @@ -34,5 +34,3 @@ const config: { [key: string]: Knex.Config } = {
},
},
};

export default config;
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ import { Knex } from "knex";

export function up(knex: Knex): Knex.SchemaBuilder {
return knex.schema.createTable("logs", (table) => {
table.string("sessionID").notNullable();
table.string("sessionId").notNullable();
table.string("type").notNullable();
table.string("key").notNullable().primary();
table.string("operation").notNullable();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ message SessionData {
cacti.satp.v02.common.MessageType phase_error = 68;
bool recovered_tried = 69;
SATPMessages satp_Messages = 70;
Type role = 71;
}

enum State {
Expand All @@ -93,6 +94,12 @@ enum State {
STATE_RECOVERING = 7;
}

enum Type {
UNKNOWN = 0;
CLIENT = 1;
SERVER = 2;
}

message SATPMessages {
Stage0Messages stage0 = 1;
Stage1Messages stage1 = 2;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ service CrashRecovery {

// step RPCs
rpc RecoverV2Message(RecoverMessage) returns (RecoverUpdateMessage);
rpc RecoverV2SuccessMessage(RecoverSuccessMessage) returns (google.protobuf.Empty);
rpc RecoverV2SuccessMessage(RecoverSuccessMessage) returns (RecoverSuccessMessageResponse);
rpc RollbackV2Message(RollbackMessage) returns (RollbackAckMessage);
}

Expand All @@ -21,15 +21,15 @@ message RecoverMessage {
bool is_backup = 5;
string new_identity_public_key = 6;
int64 last_entry_timestamp = 7;
string sender_signature = 8;
string client_signature = 8;
}

message RecoverUpdateMessage {
string session_id = 1;
string message_type = 2;
string hash_recover_message = 3;
repeated LocalLog recovered_logs = 4;
string sender_signature = 5;
repeated persistLogEntry recovered_logs = 4;
string server_signature = 5;
}

message RecoverSuccessMessage {
Expand All @@ -38,7 +38,13 @@ message RecoverSuccessMessage {
string hash_recover_update_message = 3;
bool success = 4;
repeated string entries_changed = 5;
string sender_signature = 6;
string client_signature = 6;
}

message RecoverSuccessMessageResponse {
string session_id = 1;
bool received = 2;
string server_signature = 3;
}

message RollbackMessage {
Expand All @@ -47,7 +53,7 @@ message RollbackMessage {
bool success = 3;
repeated string actions_performed = 4;
repeated string proofs = 5;
string sender_signature = 6;
string client_signature = 6;
}

message RollbackAckMessage {
Expand All @@ -56,10 +62,10 @@ message RollbackAckMessage {
bool success = 3;
repeated string actions_performed = 4;
repeated string proofs = 5;
string sender_signature = 6;
string server_signature = 6;
}

message LocalLog {
message persistLogEntry {
string session_id = 1;
string type = 2;
string key = 3;
Expand All @@ -74,16 +80,14 @@ message RollbackLogEntry {
string stage = 2;
string timestamp = 3;
string action = 4; // action performed during rollback
string status = 5; // status of rollback (e.g., SUCCESS, FAILED)
string details = 6; // Additional details or metadata about the rollback
string status = 5;
string details = 6;
}

message RollbackState {
string session_id = 1;
string current_stage = 2;
int32 steps_remaining = 3;
repeated RollbackLogEntry rollback_log_entries = 4;
string estimated_time_to_completion = 5;
string status = 6; // Overall status (e.g., IN_PROGRESS, COMPLETED, FAILED)
string details = 7; // Additional metadata or information
repeated RollbackLogEntry rollback_log_entries = 3;
string status = 4; // Overall status (e.g., IN_PROGRESS, COMPLETED, FAILED)
string details = 5;
}
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ export async function getHealthCheckService(
const status = manager.healthCheck();

const res: HealthCheckResponse = {
status: status
status: status,
};

log.debug(res);
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
import { Logger } from "@hyperledger/cactus-common";
import { TransactRequest, TransactResponse } from "../../public-api";
import { SATPManager } from "../../gol/satp-manager";
import { populateClientSessionData } from "../../core/session-utils";
Expand All @@ -12,7 +11,6 @@ import { GatewayOrchestrator } from "../../gol/gateway-orchestrator";
import { GatewayIdentity } from "../../core/types";
import { SATP_VERSION } from "../../core/constants";
import { getStatusService } from "../admin/get-status-handler-service";
import { log } from "console";

// todo
export async function executeTransact(
Expand All @@ -26,7 +24,7 @@ export async function executeTransact(
label: fnTag,
level: logLevel,
});

logger.info(`${fnTag}, executing transaction endpoint`);

//TODO check input for valid strings...
Expand Down
Loading

0 comments on commit 1737df4

Please sign in to comment.