Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.5.1 - Better handling of transient network failures #11

Merged
merged 7 commits into from
Feb 18, 2025
Merged

Conversation

shawnrushefsky
Copy link
Collaborator

@shawnrushefsky shawnrushefsky commented Feb 18, 2025

  • Adds clearer reasons when invoking the imds reallocate endpoint
  • We also now handle the case where the heartbeat manager throws an error. This is typically because of a transient inability to access the kelpie api. This could happen because a router got unplugged, the internet went out on a node, or there was an error effecting the kelpie api or its dependent services on cloudflare. Now, we will keep trying to restart the heartbeat until the node dies or connection is restored.
  • But what if my job already got handed out again during disconnection? The heartbeat endpoint will return a status of "canceled" if a heartbeat is attempted against a job that is held by another machine id. This will make the kelpie worker treat the job as a normal canceled job, interrupting the job process and asking for another job.

@SaladTechnologies SaladTechnologies deleted a comment from github-actions bot Feb 18, 2025
@SaladTechnologies SaladTechnologies deleted a comment from github-actions bot Feb 18, 2025
@SaladTechnologies SaladTechnologies deleted a comment from github-actions bot Feb 18, 2025
Copy link

🚀 Download the latest release candidate 🚀

@SaladTechnologies SaladTechnologies deleted a comment from github-actions bot Feb 18, 2025
Copy link

@rxsalad rxsalad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go!

@shawnrushefsky shawnrushefsky merged commit 6768597 into main Feb 18, 2025
1 check passed
@shawnrushefsky shawnrushefsky deleted the 0.5.1 branch February 18, 2025 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants