Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats/opentelemetry: Introduce Tracing API #7852

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Conversation

aranjans
Copy link
Contributor

@aranjans aranjans commented Nov 18, 2024

Overview

This pull request implements the OpenTelemetry tracing support in the gRPC-Go library as outlined in proposal A72. The implementation provides a robust framework for tracing gRPC calls using OpenTelemetry, facilitating a smooth migration path from OpenCensus tracing.

Key Features

  • OpenTelemetry Tracing API: Introduces a new API for enabling and configuring OpenTelemetry tracing within gRPC. This includes the addition of TraceOptions in the Options struct to allow users to specify their TraceProvider.

  • Context Propagation: Implements context propagation between gRPC clients and servers using OpenTelemetry's TextMapPropagator. This ensures that trace context is correctly passed along with RPC calls.

  • Migration Path: Provides a clear migration path from OpenCensus to OpenTelemetry, allowing users to transition their tracing implementations without breaking existing functionality. This includes support for both cross-process and in-binary migration scenarios(this will be added in separate PR).

  • Tracing Information: Captures detailed tracing information during the RPC lifecycle, including events for outbound and inbound messages, retries, and load balancer delays. This information is essential for monitoring and debugging distributed systems.

RELEASE NOTES:

  • stats/opentelemetry: Added OpenTelemetry tracing support in gRPC-Go, enabling enhanced observability and a migration path from OpenCensus.

@aranjans aranjans added this to the 1.69 Release milestone Nov 18, 2024
@aranjans aranjans added Type: Feature New features or improvements in behavior Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability labels Nov 18, 2024
Copy link

codecov bot commented Nov 18, 2024

Codecov Report

Attention: Patch coverage is 84.19244% with 46 lines in your changes missing coverage. Please review.

Project coverage is 82.03%. Comparing base (c63aeef) to head (8e78478).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
stats/opentelemetry/trace.go 77.77% 13 Missing and 5 partials ⚠️
...entelemetry/internal/tracing/custom_map_carrier.go 68.29% 11 Missing and 2 partials ⚠️
stats/opentelemetry/client_metrics.go 86.48% 7 Missing and 3 partials ⚠️
stats/opentelemetry/grpc_trace_bin_propagator.go 87.50% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7852      +/-   ##
==========================================
+ Coverage   81.80%   82.03%   +0.22%     
==========================================
  Files         375      378       +3     
  Lines       37978    38233     +255     
==========================================
+ Hits        31068    31363     +295     
+ Misses       5609     5568      -41     
- Partials     1301     1302       +1     
Files with missing lines Coverage Δ
clientconn.go 92.42% <100.00%> (+0.27%) ⬆️
stats/opentelemetry/opentelemetry.go 76.82% <100.00%> (+0.95%) ⬆️
stats/opentelemetry/server_metrics.go 90.81% <100.00%> (+1.43%) ⬆️
stream.go 81.47% <100.00%> (+0.03%) ⬆️
stats/opentelemetry/grpc_trace_bin_propagator.go 87.50% <87.50%> (ø)
stats/opentelemetry/client_metrics.go 86.34% <86.48%> (-1.59%) ⬇️
...entelemetry/internal/tracing/custom_map_carrier.go 68.29% <68.29%> (ø)
stats/opentelemetry/trace.go 77.77% <77.77%> (ø)

... and 25 files with indirect coverage changes

---- 🚨 Try these New Features:

@purnesh42H purnesh42H self-assigned this Nov 19, 2024
@@ -34,7 +34,10 @@ import (

"github.com/prometheus/client_golang/prometheus/promhttp"
"go.opentelemetry.io/otel/exporters/prometheus"
"go.opentelemetry.io/otel/propagation"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove the examples from this PR

@@ -1,6 +1,6 @@
module google.golang.org/grpc/gcp/observability

go 1.22
go 1.22.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changed? I think the OpenTelemetry plugin is stabalised already stabilised now #7759

clientconn.go Outdated
@@ -604,6 +605,9 @@ type ClientConn struct {
idlenessMgr *idle.Manager
metricsRecorderList *stats.MetricsRecorderList

// Tracks if there was a delay in name resolution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: To track

@@ -77,6 +77,8 @@ google.golang.org/genproto/googleapis/api v0.0.0-20241015192408-796eee8c2d53 h1:
google.golang.org/genproto/googleapis/api v0.0.0-20241015192408-796eee8c2d53/go.mod h1:riSXTwQ4+nqmPGtobMFyW5FqVAmIs0St6VPp4Ug7CE4=
google.golang.org/genproto/googleapis/rpc v0.0.0-20241015192408-796eee8c2d53 h1:X58yt85/IXCx0Y3ZwN6sEIKZzQtDEYaBWrDvErdXrRE=
google.golang.org/genproto/googleapis/rpc v0.0.0-20241015192408-796eee8c2d53/go.mod h1:GX3210XPVPUjJbTUbvwI8f2IpZDMZuPJWDzDuebbviI=
google.golang.org/grpc/stats/opentelemetry v0.0.0-20241028142157-ada6787961b3 h1:hUfOButuEtpc0UvYiaYRbNwxVYr0mQQOWq6X8beJ9Gc=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. all these go.mod and go.sum files shouldn't need to change

@@ -119,22 +134,46 @@ func (h *clientStatsHandler) streamInterceptor(ctx context.Context, desc *grpc.S
}

startTime := time.Now()
ctx, span := h.createCallTraceSpan(ctx, method)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should just not call createCallTraceSpan if tracing is disabled

Copy link
Contributor Author

@aranjans aranjans Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

// in the context provided if created.
func (h *clientStatsHandler) createCallTraceSpan(ctx context.Context, method string) (context.Context, trace.Span) {
var span trace.Span
if !isTracingDisabled(h.options.TraceOptions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just not call this when tracing is disabled. It should probably just panic if tracer is not there instead of silently supressing it because there will be failures eventually later in rpc lifecycle.

if info.NameResolutionDelay {
callSpan.AddEvent("Delayed name resolution complete")
}
ctx = trace.ContextWithSpan(ctx, callSpan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be one line ctx, ti = h.traceTagRPC(trace.ContextWithSpan(ctx, callSpan), info)

h.processRPCEvent(ctx, rs, ri.ai)
}
if !isTracingDisabled(h.options.TraceOptions) {
populateSpan(ctx, rs, ri.ai.ti)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we should make this also h.populateSpan

// current span, message counters for sent and received messages (used for
// generating message IDs), and the number of previous RPC attempts for the
// associated call.
type attemptTraceSpan struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be just attemptInfo similar to callInfo and it can have a field traceSpan

@@ -181,6 +205,8 @@ type attemptInfo struct {

pluginOptionLabels map[string]string // pluginOptionLabels to attach to metrics emitted
xdsLabels map[string]string

ti *attemptTraceSpan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. Change to attemptInfo. We should restrict to traces. attemptInfo can have traces, metrics etc. together as separate fields within

@purnesh42H
Copy link
Contributor

@aranjans please update this with latest main branch to get rid of all go.mod and go.sum changes

@aranjans aranjans assigned purnesh42H and unassigned aranjans Nov 22, 2024
@aranjans
Copy link
Contributor Author

@purnesh42H Thanks for your review, I have addressed all your comments and this PR is ready for another pass.

Copy link
Contributor

@purnesh42H purnesh42H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more non-test comments. Will review tests in next pass

clientconn.go Outdated
@@ -604,6 +605,9 @@ type ClientConn struct {
idlenessMgr *idle.Manager
metricsRecorderList *stats.MetricsRecorderList

// Track if there was a delay in name resolution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: To track

NameResolutionDelay bool
// IsTransparentRetry indicates whether the stream is undergoing a
// transparent retry.
IsTransparentRetry bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also valid on client side only?

Also, it looks like its already being passed to HandleRPC https://github.com/grpc/grpc-go/blob/master/stream.go#L428. Do we have to track it still? We only need to add it while populating span which can be done directly under begin case

Copy link
Contributor Author

@aranjans aranjans Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was already handled earlier, but somehow while rebasing with my earlier commits, this came. In the populateSpan, we were already using from Begin handler here.

// in the context provided if created.
func (h *clientStatsHandler) createCallTraceSpan(ctx context.Context, method string) (context.Context, trace.Span) {
if h.options.TraceOptions.TracerProvider == nil {
panic("tracing is required but the TracerProvider is not set. Ensure that tracing is enabled in the options.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, need to panic forcefully. You can just return ctx, nil and log an error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't check at all, it will panic anyway

stream.go Outdated
@@ -416,8 +417,10 @@ func (cs *clientStream) newAttemptLocked(isTransparent bool) (*csAttempt, error)
method := cs.callHdr.Method
var beginTime time.Time
shs := cs.cc.dopts.copts.StatsHandlers
nameResolutionDelayed := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you have to assign twice. can assign cs.nameResolutionDelayed directly?

@@ -119,22 +138,47 @@ func (h *clientStatsHandler) streamInterceptor(ctx context.Context, desc *grpc.S
}

startTime := time.Now()

var span trace.Span
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a pointer as it will be passed around a lot

@@ -85,8 +100,12 @@ func (h *clientStatsHandler) unaryInterceptor(ctx context.Context, method string
}

startTime := time.Now()
var span trace.Span
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. should be pointer.

))
h.clientMetrics.callDuration.Record(ctx, callLatency, attrs)
func (h *clientStatsHandler) perCallTracesAndMetrics(ctx context.Context, err error, startTime time.Time, ci *callInfo, ts trace.Span) {
s := status.Convert(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only required for tracing? so should be inside tracing condition

otelattribute.String("grpc.status", canonicalString(status.Code(err))),
))
h.clientMetrics.callDuration.Record(ctx, callLatency, attrs)
func (h *clientStatsHandler) perCallTracesAndMetrics(ctx context.Context, err error, startTime time.Time, ci *callInfo, ts trace.Span) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we should have a docstring for this method. Can be simple like "perCallTracesAndMetrics records per call trace spans and metrics."

}
}

// createCallTraceSpan creates a call span if tracing is enabled, which will be put
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creates a call span to put in the provided context using provided TraceProvider. If TraceProvider is nil, it returns context as is.

Copy link
Contributor

@purnesh42H purnesh42H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more non-test comments. Will review tests in next pass

@purnesh42H purnesh42H assigned aranjans and unassigned purnesh42H Nov 22, 2024
@purnesh42H purnesh42H changed the title Implement A72: OpenTelemetry Tracing stats/opentelemetry: Introduce Tracing API Nov 22, 2024
@purnesh42H
Copy link
Contributor

@aranjans you should link the gRFC and concise your description

@aranjans
Copy link
Contributor Author

aranjans commented Nov 23, 2024

@purnesh42H I have addressed all the comments, and updated the description to link the grfc proposal.
Feel free to close the thread which are resolved now.

@aranjans aranjans assigned purnesh42H and unassigned aranjans Nov 23, 2024
@@ -68,6 +74,15 @@ func (h *clientStatsHandler) initializeMetrics() {
rm.registerMetrics(metrics, meter)
}

func (h *clientStatsHandler) initializeTracing() {
if h.options.TraceOptions.TracerProvider == nil || h.options.TraceOptions.TextMapPropagator == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just call isTracingDisabled here?

@@ -85,8 +100,12 @@ func (h *clientStatsHandler) unaryInterceptor(ctx context.Context, method string
}

startTime := time.Now()
var span *trace.Span
if !isTracingDisabled(h.options.TraceOptions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we pass the method to perCallTracesAndMetrics and create the trace span there if tracing is not disabled? it is because we are already checking traces disable in perCallTracesAndMetrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have create call span here as we need to add event for "name resolution delay", and this info is only available in RPCTagInfo. One alternative way is to have a struct for nameResolutionDelay as a key of context metadata, but I don't think it'd be good idea to do that.

Based on offline discussion, going ahead with earlier approach.

@@ -119,22 +138,50 @@ func (h *clientStatsHandler) streamInterceptor(ctx context.Context, desc *grpc.S
}

startTime := time.Now()

var span *trace.Span
if !isTracingDisabled(h.options.TraceOptions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same suggestion here

// provided TraceProvider. If TraceProvider is nil, it returns context as is.
func (h *clientStatsHandler) createCallTraceSpan(ctx context.Context, method string) (context.Context, *trace.Span) {
if h.options.TraceOptions.TracerProvider == nil {
logger.Error("tracing is required but the TracerProvider is not set. Ensure that tracing is enabled in the options.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error can be simply "TracerProvider is not supplied/provided in trace options"

return setRPCInfo(ctx, &rpcInfo{
ai: &attemptInfo{
startTime: time.Now(),
xdsLabels: labels.TelemetryLabels,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its fine to populate startTime: time.Now(), xdsLabels: labels.TelemetryLabels, method: info.FullMethodName before and set trace stuff inside if so that you don't duplicate below

@@ -84,6 +88,16 @@ type MetricsOptions struct {
pluginOption otelinternal.PluginOption
}

// TraceOptions are the tracing options for OpenTelemetry instrumentation.
type TraceOptions struct {
// TracerProvider provides Tracers that are used by instrumentation code to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TracerProvider is the OpenTelemetry tracer which is required to record traces/trace spans for instrumentation

ctx, ai = h.traceTagRPC(ctx, info)
return setRPCInfo(ctx, &rpcInfo{
ai: &attemptInfo{
startTime: time.Now(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here regarding assigning common fields before and flip the condition

ai: attemptInfo{...}
if tracingDisabled(...) {
return setRPCInfo(ctx, ri)
}

"google.golang.org/grpc/status"
)

// traceTagRPC populates context with a new span, and serializes information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// traceTagRPC populates provided context with a new span using the TextMapPropagator supplied in trace options and internal itracing.carrier. It creates a new outgoing carrier which serializes information about this span into gRPC Metadata, if TextMapPropagator is provided in the trace options. if TextMapPropagator is not provided, it returns the context as is.

}
}

// traceTagRPC populates context with new span data, with a parent based on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// traceTagRPC populates context with new span data using the TextMapPropagator supplied in trace options and internal itracing.Carrier. It creates a new incoming carrier which extracts an existing span context (if present) by deserializing from provided context. If valid span context is extracted, it is set as parent of the new span otherwise new span remains the root span. If TextMapPropagator is not provided in the trace options, it returns context as is.

@purnesh42H purnesh42H assigned aranjans and unassigned purnesh42H Nov 25, 2024
@aranjans
Copy link
Contributor Author

@purnesh42H I have addressed all your comments, and updated the PR. Kindly review the PR.

@aranjans aranjans assigned purnesh42H and unassigned aranjans Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants