-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store timeline locations on storage controller (schema, insert, timeline creation and passthrough to the majority) #9011
Comments
github-merge-queue bot
pushed a commit
that referenced
this issue
Dec 13, 2024
## Problem We want to extract safekeeper http client to separate crate for use in storage controller and neon_local. However, many types used in the API are internal to safekeeper. ## Summary of changes Move them to safekeeper_api crate. No functional changes. ref #9011
github-merge-queue bot
pushed a commit
that referenced
this issue
Jan 17, 2025
Add an endpoint to obtain the utilization of a safekeeper. Future changes to the storage controller can use this endpoint to find the most suitable safekeepers for newly created timelines, analogously to how it's done for pageservers already. Initially we just want to assign by timeline count, then we can iterate from there. Part of #9011
This was referenced Jan 18, 2025
github-merge-queue bot
pushed a commit
that referenced
this issue
Feb 13, 2025
In #9011, we want to schedule timelines to safekeepers. In order to do such scheduling, we need information about how utilized a safekeeper is and if it's available or not. Therefore, send constant heartbeats to the safekeepers and try to figure out if they are online or not. Includes some code from #10440.
github-merge-queue bot
pushed a commit
that referenced
this issue
Feb 18, 2025
…0863) Preparations for a successor of #10440: * move `pull_timeline` to `safekeeper_api` and add it to `SafekeeperClient`. we want to do `pull_timeline` on any creations that we couldn't do initially. * Add a `SafekeeperGeneration` type instead of relying on a type alias. we want to maintain a safekeeper specific generation number now in the storcon database. A separate type is important to make it impossible to mix it up with the tenant's pageserver specific generation number. We absolutely want to avoid that for correctness reasons. If someone mixes up a safekeeper and pageserver id (both use the `NodeId` type), that's bad but there is no wrong generations flying around. part of #9011
github-merge-queue bot
pushed a commit
that referenced
this issue
Feb 19, 2025
This PR does the following things: * The initial heartbeat round blocks the storage controller from becoming online again. If all safekeepers are unresponsive, this can cause storage controller startup to be very slow. The original intent of #10583 was that heartbeats don't affect normal functionality of the storage controller. So add a short timeout to prevent it from impeding storcon functionality. * Fix the URL of the utilization endpoint. * Don't send heartbeats to safekeepers which are decomissioned. Part of #9011 context: https://neondb.slack.com/archives/C033RQ5SPDH/p1739966807592589
github-merge-queue bot
pushed a commit
that referenced
this issue
Feb 19, 2025
Doing this to help debugging offline safekeepers. Part of #9011
This was referenced Feb 20, 2025
github-merge-queue bot
pushed a commit
that referenced
this issue
Feb 21, 2025
Return an empty json response in the `scheduling_policy` handler. This prevents errors of the form: ``` Error: receive body: error decoding response body: EOF while parsing a value at line 1 column 0 ``` when setting the scheduling policy via the `storcon_cli`. part of #9011.
github-merge-queue bot
pushed a commit
that referenced
this issue
Feb 21, 2025
Safekeepers only respond to requests with the per-token scope, or the `safekeeperdata` JWT scope. Therefore, add infrastructure in the storage controller for safekeeper JWTs. Also, rename the ambiguous `jwt_token` to `pageserver_jwt_token`. Part of #9011 Related: neondatabase/cloud#24727
Bodobolero
pushed a commit
that referenced
this issue
Feb 21, 2025
Return an empty json response in the `scheduling_policy` handler. This prevents errors of the form: ``` Error: receive body: error decoding response body: EOF while parsing a value at line 1 column 0 ``` when setting the scheduling policy via the `storcon_cli`. part of #9011.
Bodobolero
pushed a commit
that referenced
this issue
Feb 21, 2025
Safekeepers only respond to requests with the per-token scope, or the `safekeeperdata` JWT scope. Therefore, add infrastructure in the storage controller for safekeeper JWTs. Also, rename the ambiguous `jwt_token` to `pageserver_jwt_token`. Part of #9011 Related: neondatabase/cloud#24727
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Tasks
There is a subtlety in handling timeline creations & deletions on safekeepers. We don't want to block op if one safekeeper is down, but neither want to have uncreated / undeleted timelines left behind. So seems like we should track these in tables like
sk_pending_timeline_creations
andsk_pending_timeline_deletions
, and have background task working on these.The text was updated successfully, but these errors were encountered: