Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store timeline locations on storage controller (schema, insert, timeline creation and passthrough to the majority) #9011

Open
9 of 19 tasks
jcsp opened this issue Sep 16, 2024 · 0 comments
Assignees

Comments

@jcsp
Copy link
Contributor

jcsp commented Sep 16, 2024

Tasks

Preview Give feedback

There is a subtlety in handling timeline creations & deletions on safekeepers. We don't want to block op if one safekeeper is down, but neither want to have uncreated / undeleted timelines left behind. So seems like we should track these in tables like sk_pending_timeline_creations and sk_pending_timeline_deletions, and have background task working on these.

@arssher arssher changed the title Store timeline locations on storage controller (schema, insert, timeline creation passthrough to the majority) Store timeline locations on storage controller (schema, insert, timeline creation and passthrough to the majority) Sep 30, 2024
@arssher arssher self-assigned this Dec 2, 2024
github-merge-queue bot pushed a commit that referenced this issue Dec 13, 2024
## Problem

We want to extract safekeeper http client to separate crate for use in
storage controller and neon_local. However, many types used in the API
are internal to safekeeper.

## Summary of changes

Move them to safekeeper_api crate. No functional changes.

ref #9011
@jcsp jcsp assigned arpad-m and unassigned koivunej Jan 6, 2025
github-merge-queue bot pushed a commit that referenced this issue Jan 17, 2025
Add an endpoint to obtain the utilization of a safekeeper. Future
changes to the storage controller can use this endpoint to find the most
suitable safekeepers for newly created timelines, analogously to how
it's done for pageservers already.

Initially we just want to assign by timeline count, then we can iterate
from there.

Part of #9011
github-merge-queue bot pushed a commit that referenced this issue Jan 22, 2025
Add APIs for timeline creation and deletion to the safekeeper client
crate. Going to be used later in #10440.

Split off from #10440.

Part of #9011
github-merge-queue bot pushed a commit that referenced this issue Feb 13, 2025
In #9011, we want to schedule timelines to safekeepers. In order to do
such scheduling, we need information about how utilized a safekeeper is
and if it's available or not.

Therefore, send constant heartbeats to the safekeepers and try to figure
out if they are online or not.

Includes some code from #10440.
github-merge-queue bot pushed a commit that referenced this issue Feb 13, 2025
There was a typo in the name of the utilization endpoint URL, fix it.
Also, ensure that the heartbeat mechanism actually works.

Related: #10583, #10429

Part of #9011
github-merge-queue bot pushed a commit that referenced this issue Feb 18, 2025
…0863)

Preparations for a successor of #10440: 

* move `pull_timeline` to `safekeeper_api` and add it to
`SafekeeperClient`. we want to do `pull_timeline` on any creations that
we couldn't do initially.
* Add a `SafekeeperGeneration` type instead of relying on a type alias.
we want to maintain a safekeeper specific generation number now in the
storcon database. A separate type is important to make it impossible to
mix it up with the tenant's pageserver specific generation number. We
absolutely want to avoid that for correctness reasons. If someone mixes
up a safekeeper and pageserver id (both use the `NodeId` type), that's
bad but there is no wrong generations flying around.

part of #9011
github-merge-queue bot pushed a commit that referenced this issue Feb 19, 2025
This PR does the following things:

* The initial heartbeat round blocks the storage controller from
becoming online again. If all safekeepers are unresponsive, this can
cause storage controller startup to be very slow. The original intent of
#10583 was that heartbeats don't affect normal functionality of the
storage controller. So add a short timeout to prevent it from impeding
storcon functionality.

* Fix the URL of the utilization endpoint.

* Don't send heartbeats to safekeepers which are decomissioned.

Part of #9011

context: https://neondb.slack.com/archives/C033RQ5SPDH/p1739966807592589
github-merge-queue bot pushed a commit that referenced this issue Feb 19, 2025
Doing this to help debugging offline safekeepers.

Part of #9011
github-merge-queue bot pushed a commit that referenced this issue Feb 21, 2025
Return an empty json response in the `scheduling_policy` handler.

This prevents errors of the form:

```
Error: receive body: error decoding response body: EOF while parsing a value at line 1 column 0
```

when setting the scheduling policy via the `storcon_cli`.

part of #9011.
github-merge-queue bot pushed a commit that referenced this issue Feb 21, 2025
Safekeepers only respond to requests with the per-token scope, or the
`safekeeperdata` JWT scope. Therefore, add infrastructure in the storage
controller for safekeeper JWTs. Also, rename the ambiguous `jwt_token`
to `pageserver_jwt_token`.

Part of #9011
Related: neondatabase/cloud#24727
Bodobolero pushed a commit that referenced this issue Feb 21, 2025
Return an empty json response in the `scheduling_policy` handler.

This prevents errors of the form:

```
Error: receive body: error decoding response body: EOF while parsing a value at line 1 column 0
```

when setting the scheduling policy via the `storcon_cli`.

part of #9011.
Bodobolero pushed a commit that referenced this issue Feb 21, 2025
Safekeepers only respond to requests with the per-token scope, or the
`safekeeperdata` JWT scope. Therefore, add infrastructure in the storage
controller for safekeeper JWTs. Also, rename the ambiguous `jwt_token`
to `pageserver_jwt_token`.

Part of #9011
Related: neondatabase/cloud#24727
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants