-
Notifications
You must be signed in to change notification settings - Fork 93
Add OpenTelemetry Phase 2 Design: Comprehensive Tracing, Logging, and Enhanced Metrics #196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add OpenTelemetry Phase 2 Design: Comprehensive Tracing, Logging, and Enhanced Metrics #196
Conversation
You might find this useful: https://blog.ferretdb.io/otel-context-propagation-in-ferretdb/ |
|
||
## Background | ||
|
||
Phase 1 successfully implemented basic metrics for monitoring cluster availability and traffic shifting during failover events. While these metrics provide valuable operational insights, they represent only one pillar of observability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but gw is working as a thin layer between CP and pg and is unaware about any failover events, since it's controlled by CP, not gw
|
||
We propose a comprehensive OpenTelemetry instrumentation that extends our foundation with: | ||
|
||
1. **Distributed Tracing**: Trace the lifecycle of requests as they flow through the gateway and to backend clusters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the gw is unaware about cluster layout, the only assumption it has is about a local pg instance
- Initial request reception | ||
- Authentication and authorization | ||
- Request parsing and validation | ||
- Backend selection (primary vs secondary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gw is unaware about primary or secondary instance and work with the backend as the local postgres instance
- Authentication and authorization | ||
- Request parsing and validation | ||
- Backend selection (primary vs secondary) | ||
- Query transformation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're not performing any query transformation, at most we form a payload to pg, poll the result, transform the pg response to mongo compatible response
R: AsyncRead + AsyncWrite + Unpin + Send, | ||
{ | ||
// Start a server span for the incoming request | ||
let ctx = connection_context.telemetry.start_request_span( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we have a lot of contexts depending on current level of processing (e.g. Service/Connection/Request contexts) I'm afraid naming it ctx will contribute to the confusion while reading the code. let's use request_span
instead
|
||
1. **Request Duration Histogram** (`docdb_gateway_request_duration`): Measures the duration of requests processed by the gateway | ||
2. **Error Counter** (`docdb_gateway_errors_total`): Tracks errors by error type and operation | ||
3. **Connection Pool Metrics** (`docdb_connection_pool_size`, `docdb_connection_wait_time`): Monitor connection pool health |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will be a signal for the connection_wait_time since the deadpool crate that we use for connectionPool are not providing us with this info and guarantees only eventual-consistency of the existing Status
|
||
```rust | ||
// Record with current trace context to enable correlation | ||
let ctx = Context::current(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, we already have a lot of context, let's come up with a better naming for telemetry
|
||
1. **Maintain Tracer Instance**: For creating and managing trace spans | ||
2. **Hold Meter Provider**: For registering and recording metrics | ||
3. **Context Management**: Extract and propagate context across boundaries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which exactly context are we talking about? or it's a:
- W3C Trace Context
- Mongo wire context
- Local context
- existing contexts for Service/Connection/Request levels?
```rust | ||
async fn emit_request_event(&self, ctx: &ConnectionContext, header: &Header, request: Option<&Request<'_>>, ...) { | ||
// Extract trace context from request | ||
let context = extract_context_from_request(header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: trace_context
2. **Request Processing**: | ||
- Authentication & authorization | ||
- Request parsing and validation | ||
- Request routing decisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain what do you mean by routing decisions? routing is either done through pg or through SLB
This PR adds the design document for Phase 2 of our OpenTelemetry instrumentation strategy. Building on the foundational metrics implemented in Phase 1, this design outlines a comprehensive approach to telemetry that includes distributed tracing and structured logging alongside enhanced metrics.
Key Features:
Implementation Details:
The design ensures that all three pillars of observability will work together cohesively, giving both users and developers improved visibility into gateway operations, particularly during failover events and regional transitions.