-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats/opentelemetry: Introduce Tracing API #7852
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7852 +/- ##
==========================================
+ Coverage 81.80% 82.03% +0.22%
==========================================
Files 375 378 +3
Lines 37978 38233 +255
==========================================
+ Hits 31068 31363 +295
+ Misses 5609 5568 -41
- Partials 1301 1302 +1
|
@@ -34,7 +34,10 @@ import ( | |||
|
|||
"github.com/prometheus/client_golang/prometheus/promhttp" | |||
"go.opentelemetry.io/otel/exporters/prometheus" | |||
"go.opentelemetry.io/otel/propagation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove the examples from this PR
gcp/observability/go.mod
Outdated
@@ -1,6 +1,6 @@ | |||
module google.golang.org/grpc/gcp/observability | |||
|
|||
go 1.22 | |||
go 1.22.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this changed? I think the OpenTelemetry plugin is stabalised already stabilised now #7759
clientconn.go
Outdated
@@ -604,6 +605,9 @@ type ClientConn struct { | |||
idlenessMgr *idle.Manager | |||
metricsRecorderList *stats.MetricsRecorderList | |||
|
|||
// Tracks if there was a delay in name resolution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: To track
interop/xds/go.sum
Outdated
@@ -77,6 +77,8 @@ google.golang.org/genproto/googleapis/api v0.0.0-20241015192408-796eee8c2d53 h1: | |||
google.golang.org/genproto/googleapis/api v0.0.0-20241015192408-796eee8c2d53/go.mod h1:riSXTwQ4+nqmPGtobMFyW5FqVAmIs0St6VPp4Ug7CE4= | |||
google.golang.org/genproto/googleapis/rpc v0.0.0-20241015192408-796eee8c2d53 h1:X58yt85/IXCx0Y3ZwN6sEIKZzQtDEYaBWrDvErdXrRE= | |||
google.golang.org/genproto/googleapis/rpc v0.0.0-20241015192408-796eee8c2d53/go.mod h1:GX3210XPVPUjJbTUbvwI8f2IpZDMZuPJWDzDuebbviI= | |||
google.golang.org/grpc/stats/opentelemetry v0.0.0-20241028142157-ada6787961b3 h1:hUfOButuEtpc0UvYiaYRbNwxVYr0mQQOWq6X8beJ9Gc= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. all these go.mod and go.sum files shouldn't need to change
@@ -119,22 +134,46 @@ func (h *clientStatsHandler) streamInterceptor(ctx context.Context, desc *grpc.S | |||
} | |||
|
|||
startTime := time.Now() | |||
ctx, span := h.createCallTraceSpan(ctx, method) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should just not call createCallTraceSpan if tracing is disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
// in the context provided if created. | ||
func (h *clientStatsHandler) createCallTraceSpan(ctx context.Context, method string) (context.Context, trace.Span) { | ||
var span trace.Span | ||
if !isTracingDisabled(h.options.TraceOptions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should just not call this when tracing is disabled. It should probably just panic if tracer is not there instead of silently supressing it because there will be failures eventually later in rpc lifecycle.
if info.NameResolutionDelay { | ||
callSpan.AddEvent("Delayed name resolution complete") | ||
} | ||
ctx = trace.ContextWithSpan(ctx, callSpan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be one line ctx, ti = h.traceTagRPC(trace.ContextWithSpan(ctx, callSpan), info)
h.processRPCEvent(ctx, rs, ri.ai) | ||
} | ||
if !isTracingDisabled(h.options.TraceOptions) { | ||
populateSpan(ctx, rs, ri.ai.ti) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we should make this also h.populateSpan
stats/opentelemetry/trace.go
Outdated
// current span, message counters for sent and received messages (used for | ||
// generating message IDs), and the number of previous RPC attempts for the | ||
// associated call. | ||
type attemptTraceSpan struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be just attemptInfo
similar to callInfo
and it can have a field traceSpan
stats/opentelemetry/opentelemetry.go
Outdated
@@ -181,6 +205,8 @@ type attemptInfo struct { | |||
|
|||
pluginOptionLabels map[string]string // pluginOptionLabels to attach to metrics emitted | |||
xdsLabels map[string]string | |||
|
|||
ti *attemptTraceSpan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. Change to attemptInfo. We should restrict to traces. attemptInfo can have traces, metrics etc. together as separate fields within
@aranjans please update this with latest main branch to get rid of all go.mod and go.sum changes |
@purnesh42H Thanks for your review, I have addressed all your comments and this PR is ready for another pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more non-test comments. Will review tests in next pass
clientconn.go
Outdated
@@ -604,6 +605,9 @@ type ClientConn struct { | |||
idlenessMgr *idle.Manager | |||
metricsRecorderList *stats.MetricsRecorderList | |||
|
|||
// Track if there was a delay in name resolution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: To track
stats/handlers.go
Outdated
NameResolutionDelay bool | ||
// IsTransparentRetry indicates whether the stream is undergoing a | ||
// transparent retry. | ||
IsTransparentRetry bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also valid on client side only?
Also, it looks like its already being passed to HandleRPC https://github.com/grpc/grpc-go/blob/master/stream.go#L428. Do we have to track it still? We only need to add it while populating span which can be done directly under begin case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was already handled earlier, but somehow while rebasing with my earlier commits, this came. In the populateSpan, we were already using from Begin handler here.
// in the context provided if created. | ||
func (h *clientStatsHandler) createCallTraceSpan(ctx context.Context, method string) (context.Context, trace.Span) { | ||
if h.options.TraceOptions.TracerProvider == nil { | ||
panic("tracing is required but the TracerProvider is not set. Ensure that tracing is enabled in the options.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, need to panic forcefully. You can just return ctx, nil and log an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't check at all, it will panic anyway
stream.go
Outdated
@@ -416,8 +417,10 @@ func (cs *clientStream) newAttemptLocked(isTransparent bool) (*csAttempt, error) | |||
method := cs.callHdr.Method | |||
var beginTime time.Time | |||
shs := cs.cc.dopts.copts.StatsHandlers | |||
nameResolutionDelayed := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you have to assign twice. can assign cs.nameResolutionDelayed directly?
@@ -119,22 +138,47 @@ func (h *clientStatsHandler) streamInterceptor(ctx context.Context, desc *grpc.S | |||
} | |||
|
|||
startTime := time.Now() | |||
|
|||
var span trace.Span |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a pointer as it will be passed around a lot
@@ -85,8 +100,12 @@ func (h *clientStatsHandler) unaryInterceptor(ctx context.Context, method string | |||
} | |||
|
|||
startTime := time.Now() | |||
var span trace.Span |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. should be pointer.
)) | ||
h.clientMetrics.callDuration.Record(ctx, callLatency, attrs) | ||
func (h *clientStatsHandler) perCallTracesAndMetrics(ctx context.Context, err error, startTime time.Time, ci *callInfo, ts trace.Span) { | ||
s := status.Convert(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only required for tracing? so should be inside tracing condition
otelattribute.String("grpc.status", canonicalString(status.Code(err))), | ||
)) | ||
h.clientMetrics.callDuration.Record(ctx, callLatency, attrs) | ||
func (h *clientStatsHandler) perCallTracesAndMetrics(ctx context.Context, err error, startTime time.Time, ci *callInfo, ts trace.Span) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we should have a docstring for this method. Can be simple like "perCallTracesAndMetrics records per call trace spans and metrics."
} | ||
} | ||
|
||
// createCallTraceSpan creates a call span if tracing is enabled, which will be put |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creates a call span to put in the provided context using provided TraceProvider. If TraceProvider is nil, it returns context as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more non-test comments. Will review tests in next pass
@aranjans you should link the gRFC and concise your description |
@purnesh42H I have addressed all the comments, and updated the description to link the grfc proposal. |
@@ -68,6 +74,15 @@ func (h *clientStatsHandler) initializeMetrics() { | |||
rm.registerMetrics(metrics, meter) | |||
} | |||
|
|||
func (h *clientStatsHandler) initializeTracing() { | |||
if h.options.TraceOptions.TracerProvider == nil || h.options.TraceOptions.TextMapPropagator == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can just call isTracingDisabled here?
@@ -85,8 +100,12 @@ func (h *clientStatsHandler) unaryInterceptor(ctx context.Context, method string | |||
} | |||
|
|||
startTime := time.Now() | |||
var span *trace.Span | |||
if !isTracingDisabled(h.options.TraceOptions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we pass the method to perCallTracesAndMetrics
and create the trace span there if tracing is not disabled? it is because we are already checking traces disable in perCallTracesAndMetrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have create call span here as we need to add event for "name resolution delay", and this info is only available in RPCTagInfo. One alternative way is to have a struct for nameResolutionDelay as a key of context metadata, but I don't think it'd be good idea to do that.
Based on offline discussion, going ahead with earlier approach.
@@ -119,22 +138,50 @@ func (h *clientStatsHandler) streamInterceptor(ctx context.Context, desc *grpc.S | |||
} | |||
|
|||
startTime := time.Now() | |||
|
|||
var span *trace.Span | |||
if !isTracingDisabled(h.options.TraceOptions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same suggestion here
// provided TraceProvider. If TraceProvider is nil, it returns context as is. | ||
func (h *clientStatsHandler) createCallTraceSpan(ctx context.Context, method string) (context.Context, *trace.Span) { | ||
if h.options.TraceOptions.TracerProvider == nil { | ||
logger.Error("tracing is required but the TracerProvider is not set. Ensure that tracing is enabled in the options.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error can be simply "TracerProvider
is not supplied/provided in trace options"
return setRPCInfo(ctx, &rpcInfo{ | ||
ai: &attemptInfo{ | ||
startTime: time.Now(), | ||
xdsLabels: labels.TelemetryLabels, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its fine to populate startTime: time.Now(), xdsLabels: labels.TelemetryLabels, method: info.FullMethodName before and set trace stuff inside if so that you don't duplicate below
stats/opentelemetry/opentelemetry.go
Outdated
@@ -84,6 +88,16 @@ type MetricsOptions struct { | |||
pluginOption otelinternal.PluginOption | |||
} | |||
|
|||
// TraceOptions are the tracing options for OpenTelemetry instrumentation. | |||
type TraceOptions struct { | |||
// TracerProvider provides Tracers that are used by instrumentation code to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TracerProvider is the OpenTelemetry tracer which is required to record traces/trace spans for instrumentation
ctx, ai = h.traceTagRPC(ctx, info) | ||
return setRPCInfo(ctx, &rpcInfo{ | ||
ai: &attemptInfo{ | ||
startTime: time.Now(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here regarding assigning common fields before and flip the condition
ai: attemptInfo{...}
if tracingDisabled(...) {
return setRPCInfo(ctx, ri)
}
stats/opentelemetry/trace.go
Outdated
"google.golang.org/grpc/status" | ||
) | ||
|
||
// traceTagRPC populates context with a new span, and serializes information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// traceTagRPC populates provided context with a new span using the TextMapPropagator
supplied in trace options and internal itracing.carrier
. It creates a new outgoing carrier which serializes information about this span into gRPC Metadata, if TextMapPropagator is provided in the trace options. if TextMapPropagator is not provided, it returns the context as is.
stats/opentelemetry/trace.go
Outdated
} | ||
} | ||
|
||
// traceTagRPC populates context with new span data, with a parent based on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// traceTagRPC populates context with new span data using the TextMapPropagator
supplied in trace options and internal itracing.Carrier
. It creates a new incoming carrier which extracts an existing span context (if present) by deserializing from provided context. If valid span context is extracted, it is set as parent of the new span otherwise new span remains the root span. If TextMapPropagator is not provided in the trace options, it returns context as is.
@purnesh42H I have addressed all your comments, and updated the PR. Kindly review the PR. |
Overview
This pull request implements the OpenTelemetry tracing support in the gRPC-Go library as outlined in proposal A72. The implementation provides a robust framework for tracing gRPC calls using OpenTelemetry, facilitating a smooth migration path from OpenCensus tracing.
Key Features
OpenTelemetry Tracing API: Introduces a new API for enabling and configuring OpenTelemetry tracing within gRPC. This includes the addition of TraceOptions in the Options struct to allow users to specify their TraceProvider.
Context Propagation: Implements context propagation between gRPC clients and servers using OpenTelemetry's TextMapPropagator. This ensures that trace context is correctly passed along with RPC calls.
Migration Path: Provides a clear migration path from OpenCensus to OpenTelemetry, allowing users to transition their tracing implementations without breaking existing functionality. This includes support for both cross-process and in-binary migration scenarios(this will be added in separate PR).
Tracing Information: Captures detailed tracing information during the RPC lifecycle, including events for outbound and inbound messages, retries, and load balancer delays. This information is essential for monitoring and debugging distributed systems.
RELEASE NOTES: