Add OpenTelemetry Phase 2 Design: Comprehensive Tracing, Logging, and Enhanced Metrics #196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

udsmicrosoft wants to merge 2 commits into microsoft:main from udsmicrosoft:udsmicrosoft/otel-design-2

udsmicrosoft commented May 21, 2025 •

edited

Loading

This PR adds the design document for Phase 2 of our OpenTelemetry instrumentation strategy. Building on the foundational metrics implemented in Phase 1, this design outlines a comprehensive approach to telemetry that includes distributed tracing and structured logging alongside enhanced metrics.

Key Features:

Distributed tracing for end-to-end request visibility through the gateway
Context propagation to link metrics, traces, and logs
Structured logging with trace correlation
Enhanced metrics with additional operational indicators
Vendor-neutral telemetry export via OTLP

Implementation Details:

Rust-based integration with the tracing ecosystem
Enhanced TelemetryProvider that maintains backward compatibility
Documentation of key integration points across the gateway codebase
Configuration options for flexible deployment
Example Docker Compose setup for the OpenTelemetry Collector with Jaeger and Prometheus

The design ensures that all three pillars of observability will work together cohesively, giving both users and developers improved visibility into gateway operations, particularly during failover events and regional transitions.


          adding design for phase 2 of otel

6ab1f2f

udsmicrosoft changed the title ~~Add OpenTelemetry Phase 2 Design: Comprehensive Tracing, Logging, and Enhanced Metric~~ Add OpenTelemetry Phase 2 Design: Comprehensive Tracing, Logging, and Enhanced Metrics

Contributor

AlekSi commented May 21, 2025

You might find this useful: https://blog.ferretdb.io/otel-context-propagation-in-ferretdb/


          Enhance context propagation strategy for OpenTelemetry in DocumentDB …

6efc08e

…gateway

udsmicrosoft marked this pull request as ready for review

June 4, 2025 19:59

udsmicrosoft requested a review from a team as a code owner

June 4, 2025 19:59

AndrewKhoma requested changes

View reviewed changes

docs/v1/opentelemetry_design_phase2.md


		## Background

		Phase 1 successfully implemented basic metrics for monitoring cluster availability and traffic shifting during failover events. While these metrics provide valuable operational insights, they represent only one pillar of observability.

Member

AndrewKhoma Jun 5, 2025

but gw is working as a thin layer between CP and pg and is unaware about any failover events, since it's controlled by CP, not gw

docs/v1/opentelemetry_design_phase2.md


		We propose a comprehensive OpenTelemetry instrumentation that extends our foundation with:

		1. Distributed Tracing: Trace the lifecycle of requests as they flow through the gateway and to backend clusters

Member

AndrewKhoma Jun 5, 2025

the gw is unaware about cluster layout, the only assumption it has is about a local pg instance

docs/v1/opentelemetry_design_phase2.md

+              - Initial request reception
+              - Authentication and authorization
+              - Request parsing and validation
+              - Backend selection (primary vs secondary)

Member

AndrewKhoma Jun 5, 2025

gw is unaware about primary or secondary instance and work with the backend as the local postgres instance

docs/v1/opentelemetry_design_phase2.md

+              - Authentication and authorization
+              - Request parsing and validation
+              - Backend selection (primary vs secondary)
+              - Query transformation

Member

AndrewKhoma Jun 5, 2025

we're not performing any query transformation, at most we form a payload to pg, poll the result, transform the pg response to mongo compatible response

docs/v1/opentelemetry_design_phase2.md

+                  R: AsyncRead + AsyncWrite + Unpin + Send,
+              {
+                  // Start a server span for the incoming request
+                  let ctx = connection_context.telemetry.start_request_span(

Member

AndrewKhoma Jun 5, 2025

since we have a lot of contexts depending on current level of processing (e.g. Service/Connection/Request contexts) I'm afraid naming it ctx will contribute to the confusion while reading the code. let's use request_span instead

docs/v1/opentelemetry_design_phase2.md

+. **Request Duration Histogram** (`docdb_gateway_request_duration`): Measures the duration of requests processed by the gateway
+. **Error Counter** (`docdb_gateway_errors_total`): Tracks errors by error type and operation
+. **Connection Pool Metrics** (`docdb_connection_pool_size`, `docdb_connection_wait_time`): Monitor connection pool health

Member

AndrewKhoma Jun 5, 2025

what will be a signal for the connection_wait_time since the deadpool crate that we use for connectionPool are not providing us with this info and guarantees only eventual-consistency of the existing Status

docs/v1/opentelemetry_design_phase2.md

+              ```rust
+              // Record with current trace context to enable correlation
+              let ctx = Context::current();

Member

AndrewKhoma Jun 5, 2025

same as above, we already have a lot of context, let's come up with a better naming for telemetry

docs/v1/opentelemetry_design_phase2.md

+. **Maintain Tracer Instance**: For creating and managing trace spans
+. **Hold Meter Provider**: For registering and recording metrics
+. **Context Management**: Extract and propagate context across boundaries

Member

AndrewKhoma Jun 5, 2025

which exactly context are we talking about? or it's a:

W3C Trace Context
Mongo wire context
Local context
existing contexts for Service/Connection/Request levels?

docs/v1/opentelemetry_design_phase2.md

+              ```rust
+              async fn emit_request_event(&self, ctx: &ConnectionContext, header: &Header, request: Option<&Request<'_>>, ...) {
+                  // Extract trace context from request
+                  let context = extract_context_from_request(header);

Member

AndrewKhoma Jun 5, 2025

nit: trace_context

docs/v1/opentelemetry_design_phase2.md

+. **Request Processing**:
+                 - Authentication & authorization
+                 - Request parsing and validation
+                 - Request routing decisions

Member

AndrewKhoma Jun 5, 2025

can you explain what do you mean by routing decisions? routing is either done through pg or through SLB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet