Skip to content

Commit

Permalink
Merge pull request #11 from SaladTechnologies/0.5.1
Browse files Browse the repository at this point in the history
0.5.1 - Better handling of transient network failures
  • Loading branch information
shawnrushefsky authored Feb 18, 2025
2 parents 98ab22f + d40c451 commit 6768597
Show file tree
Hide file tree
Showing 5 changed files with 44 additions and 11 deletions.
4 changes: 2 additions & 2 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "kelpie",
"version": "0.5.0",
"version": "0.5.1",
"description": "A worker binary to coordinate long running jobs on salad. Works with Kelpie API",
"main": "dist/index.js",
"scripts": {
Expand Down
4 changes: 2 additions & 2 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ CMD ["/kelpie"]

When running the image, you will need additional configuration in the environment:

- AWS/Cloudflare Credentials: Provide `AWS_ACCESS_KEY_ID`, etc to enable the kelpie worker to upload and download from your bucket storage. We use the s3 compatability api, so any s3-compatible storage should work.
- AWS/Cloudflare Credentials: Provide `AWS_ACCESS_KEY_ID`, etc to enable the kelpie worker to upload and download from your bucket storage. We use the s3 compatibility api, so any s3-compatible storage should work.
- `KELPIE_API_URL`: the root URL for the coordination API, e.g. kelpie.saladexamples.com
- `KELPIE_API_KEY`: Your api key for the coordination API, issued by Salad for use with kelpie. NOT your Salad API Key.

Expand Down Expand Up @@ -108,7 +108,7 @@ This is optional, and only required if you want to use Kelpie's autoscaling feat
Kelpie uses the Salad API to [start](https://docs.salad.com/reference/saladcloud-api/container_groups/start-a-container-group), [stop](https://docs.salad.com/reference/saladcloud-api/container_groups/stop-a-container-group), and [scale](https://docs.salad.com/reference/saladcloud-api/container_groups/update-a-container-group) your container group in response to job volume.

In your container group configuration, you will provide the docker image url, the hardware configuration needed by your job, and the environment variables detailed above.
You do not need to enable Container Gateway, or Job Queues, and you do not need to configure probes.
You do not need to enable Container Gateway, or Job Queues, and you do not need to configure probes.
While Salad does offer built-in logging, it is still recommended to connect an [external logging service](https://docs.salad.com/products/sce/container-groups/external-logging/external-logging) for more advanced features.

Once your container group is deployed, and you've verified that the node starts and runs successfully, you'll want to retrieve the container group ID from the [Salad API](https://docs.salad.com/api-reference/container_groups/get-a-container-group). You will use this ID when submitting jobs to the Kelpie API.
Expand Down
6 changes: 3 additions & 3 deletions src/api.ts
Original file line number Diff line number Diff line change
Expand Up @@ -123,15 +123,15 @@ export async function reportFailed(jobId: string, log: Logger): Promise<void> {
);
numFailures++;
if (numFailures >= maxJobFailures) {
await reallocateMe(log);
await reallocateMe("Kelpie: Max Job Failures Exceeded", log);
}
}

export async function reallocateMe(log: Logger): Promise<void> {
export async function reallocateMe(reason: string, log: Logger): Promise<void> {
try {
log.info("Reallocating container via IMDS");
await imds.metadata.reallocateContainer({
reason: "Kelpie: Max Job Failures Exceeded",
reason,
});
} catch (e: any) {
log.error(`Failed to reallocate container via IMDS: ${e.message}`);
Expand Down
39 changes: 36 additions & 3 deletions src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,10 @@ async function main() {
/**
* If there are no uploads in progress, we can reallocate the instance.
*/
await reallocateMe(baseLogger);
await reallocateMe(
`Kelpie: Max idle time exceeded: ${maxTimeWithNoWorkMs}`,
baseLogger
);
break;
}
baseLogger.info("No work available, sleeping for 10 seconds...");
Expand All @@ -183,12 +186,42 @@ async function main() {

const directoryWatchers: DirectoryWatcher[] = [];

heartbeatManager.startHeartbeat(work.heartbeat_interval, async () => {
/**
* The heartbeat endpoint may return a status of "canceled" if the job has been cancelled,
* in which case we should stop the job and ask for a new one.
*/
const onJobCancel = async () => {
await Promise.all(
directoryWatchers.map((watcher) => watcher.stopWatching())
);
commandExecutor.interrupt();
});
};

const handleHeartbeatError = async (e: any) => {
/**
* This occurs if a heartbeat fails config.maxRetries times, meaning the machine
* has lost communication with kelpie api
* */
log.error(`Heartbeat error: ${e.message}`);

/**
* If the heartbeat throws an error, we should restart it.
* This is because the error is likely due to a network issue which
* may be transient, and the job is still running. This way,
* the job can continue to run and the heartbeat will be re-established.
*
* The alternative is to abort the job or reallocate the instance, but this is
* not ideal because the job is still running and may complete successfully.
*/
await heartbeatManager.stopHeartbeat();
await heartbeatManager
.startHeartbeat(work.heartbeat_interval, onJobCancel)
.catch(handleHeartbeatError);
};

heartbeatManager
.startHeartbeat(work.heartbeat_interval, onJobCancel)
.catch(handleHeartbeatError);

/**
* This block is event-driven, triggered by file changes in configured directories.
Expand Down

0 comments on commit 6768597

Please sign in to comment.