Skip to content

Add OpenTelemetry Phase 2 Design: Comprehensive Tracing, Logging, and Enhanced Metrics #196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

udsmicrosoft
Copy link

@udsmicrosoft udsmicrosoft commented May 21, 2025

This PR adds the design document for Phase 2 of our OpenTelemetry instrumentation strategy. Building on the foundational metrics implemented in Phase 1, this design outlines a comprehensive approach to telemetry that includes distributed tracing and structured logging alongside enhanced metrics.

Key Features:

  • Distributed tracing for end-to-end request visibility through the gateway
  • Context propagation to link metrics, traces, and logs
  • Structured logging with trace correlation
  • Enhanced metrics with additional operational indicators
  • Vendor-neutral telemetry export via OTLP

Implementation Details:

  • Rust-based integration with the tracing ecosystem
  • Enhanced TelemetryProvider that maintains backward compatibility
  • Documentation of key integration points across the gateway codebase
  • Configuration options for flexible deployment
  • Example Docker Compose setup for the OpenTelemetry Collector with Jaeger and Prometheus

The design ensures that all three pillars of observability will work together cohesively, giving both users and developers improved visibility into gateway operations, particularly during failover events and regional transitions.

@udsmicrosoft udsmicrosoft changed the title Add OpenTelemetry Phase 2 Design: Comprehensive Tracing, Logging, and Enhanced Metric Add OpenTelemetry Phase 2 Design: Comprehensive Tracing, Logging, and Enhanced Metrics May 21, 2025
@AlekSi
Copy link
Contributor

AlekSi commented May 21, 2025

You might find this useful: https://blog.ferretdb.io/otel-context-propagation-in-ferretdb/

@udsmicrosoft udsmicrosoft marked this pull request as ready for review June 4, 2025 19:59
@udsmicrosoft udsmicrosoft requested a review from a team as a code owner June 4, 2025 19:59

## Background

Phase 1 successfully implemented basic metrics for monitoring cluster availability and traffic shifting during failover events. While these metrics provide valuable operational insights, they represent only one pillar of observability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but gw is working as a thin layer between CP and pg and is unaware about any failover events, since it's controlled by CP, not gw


We propose a comprehensive OpenTelemetry instrumentation that extends our foundation with:

1. **Distributed Tracing**: Trace the lifecycle of requests as they flow through the gateway and to backend clusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the gw is unaware about cluster layout, the only assumption it has is about a local pg instance

- Initial request reception
- Authentication and authorization
- Request parsing and validation
- Backend selection (primary vs secondary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gw is unaware about primary or secondary instance and work with the backend as the local postgres instance

- Authentication and authorization
- Request parsing and validation
- Backend selection (primary vs secondary)
- Query transformation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're not performing any query transformation, at most we form a payload to pg, poll the result, transform the pg response to mongo compatible response

R: AsyncRead + AsyncWrite + Unpin + Send,
{
// Start a server span for the incoming request
let ctx = connection_context.telemetry.start_request_span(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we have a lot of contexts depending on current level of processing (e.g. Service/Connection/Request contexts) I'm afraid naming it ctx will contribute to the confusion while reading the code. let's use request_span instead


1. **Request Duration Histogram** (`docdb_gateway_request_duration`): Measures the duration of requests processed by the gateway
2. **Error Counter** (`docdb_gateway_errors_total`): Tracks errors by error type and operation
3. **Connection Pool Metrics** (`docdb_connection_pool_size`, `docdb_connection_wait_time`): Monitor connection pool health
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will be a signal for the connection_wait_time since the deadpool crate that we use for connectionPool are not providing us with this info and guarantees only eventual-consistency of the existing Status


```rust
// Record with current trace context to enable correlation
let ctx = Context::current();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, we already have a lot of context, let's come up with a better naming for telemetry


1. **Maintain Tracer Instance**: For creating and managing trace spans
2. **Hold Meter Provider**: For registering and recording metrics
3. **Context Management**: Extract and propagate context across boundaries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which exactly context are we talking about? or it's a:

  • W3C Trace Context
  • Mongo wire context
  • Local context
  • existing contexts for Service/Connection/Request levels?

```rust
async fn emit_request_event(&self, ctx: &ConnectionContext, header: &Header, request: Option<&Request<'_>>, ...) {
// Extract trace context from request
let context = extract_context_from_request(header);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: trace_context

2. **Request Processing**:
- Authentication & authorization
- Request parsing and validation
- Request routing decisions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain what do you mean by routing decisions? routing is either done through pg or through SLB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants