Store timeline locations on storage controller (schema, insert, timeline creation and passthrough to the majority) #9011

jcsp · 2024-09-16T13:34:56Z

Tasks

Give feedback

add timelines schema per the rfc
timeline_create: accept cplane selected safekeepers in schema, but fail if set (via feature)
timeline_create: return storcon selected safekeepers in schema (via feature)
Extract public sk types to safekeeper_api #10137
Extract safekeeper http client to separate crate. #10140
create timelines on safekeepers (via feature)
timeline delete endpoint should delete timeline from storconn db and safekeepers
neon_local support
Add safekeeper utilization endpoint #10429
Expose safekeeper APIs for creation and deletion #10478
storcon: timeline table, deletion and creation #10440
Fix utilization URL and ensure heartbeats work #10811
move pull_timeline to safekeeper_api and add SafekeeperGeneration #10863
storcon: sk heartbeat fixes #10891
storcon: use the SchedulingPolicy enum in SafekeeperPersistence #10897
Add pg_lsn postgres data type diesel-rs/diesel#4499
storcon: log all safekeepers marked as offline #10898
https://github.com/neondatabase/cloud/issues/24727
storcon: infrastructure for safekeeper specific JWT tokens #10905
Options

There is a subtlety in handling timeline creations & deletions on safekeepers. We don't want to block op if one safekeeper is down, but neither want to have uncreated / undeleted timelines left behind. So seems like we should track these in tables like sk_pending_timeline_creations and sk_pending_timeline_deletions, and have background task working on these.

The text was updated successfully, but these errors were encountered:

## Problem We want to extract safekeeper http client to separate crate for use in storage controller and neon_local. However, many types used in the API are internal to safekeeper. ## Summary of changes Move them to safekeeper_api crate. No functional changes. ref #9011

Add an endpoint to obtain the utilization of a safekeeper. Future changes to the storage controller can use this endpoint to find the most suitable safekeepers for newly created timelines, analogously to how it's done for pageservers already. Initially we just want to assign by timeline count, then we can iterate from there. Part of #9011

Add APIs for timeline creation and deletion to the safekeeper client crate. Going to be used later in #10440. Split off from #10440. Part of #9011

In #9011, we want to schedule timelines to safekeepers. In order to do such scheduling, we need information about how utilized a safekeeper is and if it's available or not. Therefore, send constant heartbeats to the safekeepers and try to figure out if they are online or not. Includes some code from #10440.

There was a typo in the name of the utilization endpoint URL, fix it. Also, ensure that the heartbeat mechanism actually works. Related: #10583, #10429 Part of #9011

…0863) Preparations for a successor of #10440: * move `pull_timeline` to `safekeeper_api` and add it to `SafekeeperClient`. we want to do `pull_timeline` on any creations that we couldn't do initially. * Add a `SafekeeperGeneration` type instead of relying on a type alias. we want to maintain a safekeeper specific generation number now in the storcon database. A separate type is important to make it impossible to mix it up with the tenant's pageserver specific generation number. We absolutely want to avoid that for correctness reasons. If someone mixes up a safekeeper and pageserver id (both use the `NodeId` type), that's bad but there is no wrong generations flying around. part of #9011

This PR does the following things: * The initial heartbeat round blocks the storage controller from becoming online again. If all safekeepers are unresponsive, this can cause storage controller startup to be very slow. The original intent of #10583 was that heartbeats don't affect normal functionality of the storage controller. So add a short timeout to prevent it from impeding storcon functionality. * Fix the URL of the utilization endpoint. * Don't send heartbeats to safekeepers which are decomissioned. Part of #9011 context: https://neondb.slack.com/archives/C033RQ5SPDH/p1739966807592589

Doing this to help debugging offline safekeepers. Part of #9011

Return an empty json response in the `scheduling_policy` handler. This prevents errors of the form: ``` Error: receive body: error decoding response body: EOF while parsing a value at line 1 column 0 ``` when setting the scheduling policy via the `storcon_cli`. part of #9011.

Safekeepers only respond to requests with the per-token scope, or the `safekeeperdata` JWT scope. Therefore, add infrastructure in the storage controller for safekeeper JWTs. Also, rename the ambiguous `jwt_token` to `pageserver_jwt_token`. Part of #9011 Related: neondatabase/cloud#24727

Return an empty json response in the `scheduling_policy` handler. This prevents errors of the form: ``` Error: receive body: error decoding response body: EOF while parsing a value at line 1 column 0 ``` when setting the scheduling policy via the `storcon_cli`. part of #9011.

Safekeepers only respond to requests with the per-token scope, or the `safekeeperdata` JWT scope. Therefore, add infrastructure in the storage controller for safekeeper JWTs. Also, rename the ambiguous `jwt_token` to `pageserver_jwt_token`. Part of #9011 Related: neondatabase/cloud#24727

jcsp mentioned this issue Sep 16, 2024

Epic: safekeepers dynamic membership change (Phase 2) #8614

Open

jcsp assigned koivunej Sep 16, 2024

arssher changed the title ~~Store timeline locations on storage controller (schema, insert, timeline creation passthrough to the majority)~~ Store timeline locations on storage controller (schema, insert, timeline creation and passthrough to the majority) Sep 30, 2024

arssher self-assigned this Dec 2, 2024

arssher mentioned this issue Dec 13, 2024

Extract public sk types to safekeeper_api #10137

Merged

jcsp assigned arpad-m and unassigned koivunej Jan 6, 2025

arpad-m mentioned this issue Jan 16, 2025

Add safekeeper utilization endpoint #10429

Merged

This was referenced Jan 18, 2025

storcon: timeline table, deletion and creation #10440

Draft

Expose safekeeper APIs for creation and deletion #10478

Merged

github-merge-queue bot pushed a commit that referenced this issue Jan 22, 2025

Expose safekeeper APIs for creation and deletion (#10478)

c60b913

Add APIs for timeline creation and deletion to the safekeeper client crate. Going to be used later in #10440. Split off from #10440. Part of #9011

arpad-m mentioned this issue Jan 30, 2025

storcon: track safekeepers in memory, send heartbeats to them #10583

Merged

arpad-m mentioned this issue Feb 13, 2025

Fix utilization URL and ensure heartbeats work #10811

Merged

arpad-m mentioned this issue Feb 18, 2025

move pull_timeline to safekeeper_api and add SafekeeperGeneration #10863

Merged

arpad-m mentioned this issue Feb 19, 2025

storcon: sk heartbeat fixes #10891

Merged

arpad-m mentioned this issue Feb 19, 2025

storcon: log all safekeepers marked as offline #10898

Merged

github-merge-queue bot pushed a commit that referenced this issue Feb 19, 2025

storcon: log all safekeepers marked as offline (#10898)

787b98f

Doing this to help debugging offline safekeepers. Part of #9011

This was referenced Feb 20, 2025

Return a json response in scheduling_policy handler #10904

Merged

storcon: infrastructure for safekeeper specific JWT tokens #10905

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store timeline locations on storage controller (schema, insert, timeline creation and passthrough to the majority) #9011

Store timeline locations on storage controller (schema, insert, timeline creation and passthrough to the majority) #9011

jcsp commented Sep 16, 2024 •

edited by arpad-m

Loading

Tasks

Store timeline locations on storage controller (schema, insert, timeline creation and passthrough to the majority) #9011

Store timeline locations on storage controller (schema, insert, timeline creation and passthrough to the majority) #9011

Comments

jcsp commented Sep 16, 2024 • edited by arpad-m Loading

Tasks

jcsp commented Sep 16, 2024 •

edited by arpad-m

Loading