-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 proxy #11357
base: main
Are you sure you want to change the base?
S3 proxy #11357
Conversation
8063 tests run: 7480 passed, 189 failed, 394 skipped (full report)Failures on Postgres 17
Failures on Postgres 16
Failures on Postgres 15
Failures on Postgres 14
Flaky tests (14)Postgres 17
Postgres 16
Postgres 15
Postgres 14
Test coverage report is not availableThe comment gets automatically updated with the latest test results
cd2a64e at 2025-04-03T17:52:48.479Z :recycle: |
0f0e654
to
a5cdbd8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a lot of the feedback is around the fact that this doesn't use the http-utils
stuff.
Would be nice to keep to the "standard" if possible.
Unrelated changes: gate a rename_noreplace feature and disable it in remote_storage so as s3proxy can be built with musl
Can we do separate pr?
s3proxy/src/main.rs
Outdated
@@ -0,0 +1,650 @@ | |||
use anyhow::{Context, Result, anyhow, bail}; | |||
use axum::body::{Body, Bytes}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why axum specifically? Storage services use libs/http-utils/src/server.rs
. Generally, the http-utils
crate can do a lot of the boilerplate for you.
Let's take the example of the storage controller:
- This is the set-up code:
neon/storage_controller/src/main.rs
Lines 448 to 498 in 4bb7087
let auth = secrets .public_key .map(|jwt_auth| Arc::new(SwappableJwtAuth::new(jwt_auth))); let router = make_router(service.clone(), auth, build_info) .build() .map_err(|err| anyhow!(err))?; let http_service = Arc::new(http_utils::RequestServiceBuilder::new(router).map_err(|err| anyhow!(err))?); let api_shutdown = CancellationToken::new(); // Start HTTP server let http_server_task: OptionFuture<_> = match args.listen { Some(http_addr) => { let http_listener = tcp_listener::bind(http_addr)?; let http_server = http_utils::server::Server::new(Arc::clone(&http_service), http_listener, None)?; tracing::info!("Serving HTTP on {}", http_addr); Some(tokio::task::spawn(http_server.serve(api_shutdown.clone()))) } None => None, } .into(); // Start HTTPS server let https_server_task: OptionFuture<_> = match args.listen_https { Some(https_addr) => { let https_listener = tcp_listener::bind(https_addr)?; let resolver = ReloadingCertificateResolver::new( &args.ssl_key_file, &args.ssl_cert_file, *args.ssl_cert_reload_period, ) .await?; let server_config = rustls::ServerConfig::builder() .with_no_client_auth() .with_cert_resolver(resolver); let tls_acceptor = tokio_rustls::TlsAcceptor::from(Arc::new(server_config)); let https_server = http_utils::server::Server::new(http_service, https_listener, Some(tls_acceptor))?; tracing::info!("Serving HTTPS on {}", https_addr); Some(tokio::task::spawn(https_server.serve(api_shutdown.clone()))) } None => None, } .into(); - This is where the handlers are defined:
neon/storage_controller/src/http.rs
Line 1836 in 4bb7087
pub fn make_router(
I'm not necessarily opposed to axum, but it's nice to have things homogeneous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have services in axum i.e. compute_ctl API, so, as a compute service, I'd lean to axum as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument for http-utils
is homogeneity. What's the argument for axum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
axum is hyper 1.0, while http-utils is hyper 0.x. Although Dmitrii said that it should be fairly easy to upgrade http-utils to hyper 1.0 once we are there.
s3proxy/src/main.rs
Outdated
// TODO what if multiple instances try to write a file? | ||
// random delay before writing? Instances will be ~3 per region + dedicated for computes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need many instances of this per region? If so, why?
If load is a concern we could have some simple static sharding to avoid the consistency problems around multiple writers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably because of link saturation, I'd like ~3 instances (DaemonSet?) per region, but don't have a strong opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want multiple instances, how do we deal with consistency? I proposed a naive sharding approach, but i'm curious to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instances are stateless so what are your concerns with consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need many instances of this per region? If so, why?
I was thinking about starting with 3 instances right away for the purpose of zero-downtime restarts and releases, i.e. do a rolling restart of the ReplicaSet, giving LB/ingress to discover/deregister targets, so none of requests hit 503
If we want multiple instances, how do we deal with consistency? I proposed a naive sharding approach, but i'm curious to hear your thoughts.
What's the naive sharding approach, I'm also curious to know your ideas. Re consistency, my thinking model was kinda 'the last request wins', is it how S3 API works? So kinda be 'dumb' in this sense and just have the same concurrency handling as S3 API does, is it bad?
In reality, we do not expect concurrent requests, though, as consumers are compute_ctl, which will only do single request to the same time at a time; and cplane, but this will only do DELETE requests. So I gues the only guarantee we need here is that unexpected concurrent requests are handled in expected way, and 'the last request wins' would work, but maybe I miss omething
I'd like ~3 instances (DaemonSet?)
DaemonSet is a very specific thing, it's a deployment that starts a pod on every k8s node. Here it's more like a generic ReplicaSet. Commenting just to avoid someone's else confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please no ReplicaSet directly :-) use Deployments instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please no ReplicaSet directly :-) use Deployments instead
Yeah, sure thing
397075d
to
1b335ba
Compare
Possibly related #11363 |
1b335ba
to
5ac1d4a
Compare
FWIW, I think this PR should not update the AWS SDK or the BehaviourVersion. Both changes are things we should do sooner or later, but they seem both orthogonal from what this PR tries to achieve, so should probably get their own PRs. Also, note that an earlier attempt to update the AWS SDK has caused issues in staging which we haven't found a fix for, so such an upgrade is non-trivial sadly (as you need to find a fix for the issues). see: https://github.com/neondatabase/cloud/issues/24208 . |
Yes, that's completely reasonable so I've reverted the changes. The current bugs seem to be tied to current SDK |
s3proxy/src/main.rs
Outdated
// TODO what if multiple instances try to write a file? | ||
// random delay before writing? Instances will be ~3 per region + dedicated for computes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need many instances of this per region? If so, why?
I was thinking about starting with 3 instances right away for the purpose of zero-downtime restarts and releases, i.e. do a rolling restart of the ReplicaSet, giving LB/ingress to discover/deregister targets, so none of requests hit 503
If we want multiple instances, how do we deal with consistency? I proposed a naive sharding approach, but i'm curious to hear your thoughts.
What's the naive sharding approach, I'm also curious to know your ideas. Re consistency, my thinking model was kinda 'the last request wins', is it how S3 API works? So kinda be 'dumb' in this sense and just have the same concurrency handling as S3 API does, is it bad?
In reality, we do not expect concurrent requests, though, as consumers are compute_ctl, which will only do single request to the same time at a time; and cplane, but this will only do DELETE requests. So I gues the only guarantee we need here is that unexpected concurrent requests are handled in expected way, and 'the last request wins' would work, but maybe I miss omething
I'd like ~3 instances (DaemonSet?)
DaemonSet is a very specific thing, it's a deployment that starts a pod on every k8s node. Here it's more like a generic ReplicaSet. Commenting just to avoid someone's else confusion
eee701e
to
f4bc6ea
Compare
0b654f6
to
ba87a75
Compare
Scripts for testing s3 proxy with real s3 (minio): #!/bin/sh
mkdir -p ${HOME}/minio/data
docker run \
-p 9000:9000 \
-p 9001:9001 \
--user $(id -u):$(id -g) \
--name minio1 \
-e "MINIO_ROOT_USER=ROOTUSER" \
-e "MINIO_ROOT_PASSWORD=CHANGEME123" \
-v ${HOME}/minio/data:/data \
quay.io/minio/minio server /data --console-address ":9001"
Launch proxy: #!/bin/sh
export AWS_ACCESS_KEY_ID=ROOTUSER
export AWS_SECRET_ACCESS_KEY=CHANGEME123
export AWS_ENDPOINT_URL=http://172.17.0.2:9000
export RUST_LOG=debug
export REMOTE_STORAGE_S3_BUCKET=bucket
export REMOTE_STORAGE_S3_REGION="us-west-1"
export ENABLE_REAL_S3_REMOTE_STORAGE=1
#export RUST_BACKTRACE=1
cd ~/neon/s3proxy
cargo t --release |
9f2f6c8
to
ae2481c
Compare
fe733b6
to
a3c5900
Compare
a3c5900
to
6c248bc
Compare
Service targeted for storing and retrieving LFC prewarm data.
Can be used for proxying S3 access for Postgres extensions like pg_mooncake as well.
Requests must include a Bearer JWT token.
Token is validated using a pemfile (should be passed in infra/).
Note: app is not tolerant to extra trailing slashes, see app.rs
delete_prefix
test for comments.Resolves: https://github.com/neondatabase/cloud/issues/26342
Unrelated changes: gate a
rename_noreplace
feature and disable it inremote_storage
so ass3proxy
can be built with musl