Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement tail sampling on the client #4160

Merged
merged 4 commits into from
May 17, 2024
Merged

Conversation

philipp-spiess
Copy link
Member

@philipp-spiess philipp-spiess commented May 13, 2024

Closes CODY-1367

We currently implement open telemetry tail-sampling only on the OTel Collector. This causes a bunch of issues, though:

  • We noticed "congestion" since we log a lot of spans and only sample a few of them. In the past, OTel collector was discarding events because of this for larger periods of time.
  • We can only send the traces to our dedicated OTel collector since otherwise it will be too noisy. This prevented us from enabling traces on enterprise instances in any useful way.

This PR attempts to implement tail-sampling like we do it on the OTel collector on the client. Since there is no Tail Sampler, this is done by extending the HTTP SpanExporter. For every span that is exported, we will now look at the root span and only record if if it was manually marked as sampled.

Since spans can be flushed out of order (sub-spans can be flushed before the parent span is sent), theres a bit of bookkeeping that queues spans until their parent was received (or up to 1min in total).

Test plan

  • To test this, I made some changes in the source code, mainly:
    • Comment out the check in CodyTraceExport line 17 for this.isTracingEnabled
    • Add a console.log to see which traces are flushed in CodyTraceExport on line 58
  • Then, trigger things that would collect traces like autocomplete or chat and see if there's something logged:
Screenshot 2024-05-13 at 17 49 27

@bobheadxi
Copy link
Member

bobheadxi commented May 13, 2024

For every span that is exported, we will now look at the root span and only record if if it was manually marked as sampled.

@philipp-spiess at least in the Go SDK, it's possible to set the trace is sampled flag https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#sampling that automatically applies to all spans in a trace and controls whether the underlying processor receives the span or not, which sounds like the behaviour you describe here (i.e. there is a native "should sample" flag that does not require looking at the root span). This also works over the wire if both ends are configured correctly, which I commented on here: sourcegraph/zoekt#761 (comment)

This is also how we implement should trace "trace policy" in-Sourcegraph as of sourcegraph/sourcegraph#58068, PR also describes how this works in Go a bit

@bobheadxi
Copy link
Member

The native sampling flag behaviour sounds like pretty much what you want here, but I may be missing some nuance - feel free to ping me on Slack to discuss if that might be more effective!

@philipp-spiess
Copy link
Member Author

@bobheadxi I'll look into it but one problem I remember is that for autocomplete we only know towards the end of the trace if we will sample it so all sub-spans that are already flushed need to wait for that, hence the tail sampling approach.

If we can do this with TraceFlags too than that should help a ton yep

@philipp-spiess
Copy link
Member Author

@bobheadxi I tried (via span.spanContext().traceFlags = TraceFlags.SAMPLED) but I couldn't get this to work the way I want to. Any ideas what else I can try? 🤔

  • When I just make this change, everything else is sampled by default too
  • When I use it in combination with TraceIdRatioBasedSampler set to 0 (i.e nothing is sampled) it won't sample any trace
  • I guess I can use it with a custom Sampler but this has the issue that some spans may be flushed before we make the sampling decision (I tried this early with the manual sampled field and it was not working reliably).

@bobheadxi
Copy link
Member

bobheadxi commented May 14, 2024

I tried (via span.spanContext().traceFlags = TraceFlags.SAMPLED) but I couldn't get this to work the way I want to. Any ideas what else I can try? 🤔

It should be a read-only attribute, so the should-trace flag must be set by a custom Sampler implementation, Go version: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/tracer/policy_sampler.go?L23:1-44:2 i.e. wrap TraceIdRatioBasedSampler with your own sampling decider

Node version: https://open-telemetry.github.io/opentelemetry-js/interfaces/_opentelemetry_sdk_trace_base.Sampler.html

we only know towards the end of the trace

Curious how this works, as your custom sampler in this PR interrogates the root span 🤔 do we set the attribute only at the end of the root span?

Copy link
Member

@bobheadxi bobheadxi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some suggestions on native mechanism: #4160 (comment) - if this is not critical, I recommend investigating if we can use some semblance of the native mechanisms instead, as the bookkeeping done in this exporter seems a bit risky 😁

Approving to unblock regardless as this is just telemetry and if we need this to improve the state of trace export traffic, then feel free to ship this as-is :)

@philipp-spiess
Copy link
Member Author

Curious how this works, as your custom sampler in this PR interrogates the root span 🤔 do we set the attribute only at the end of the root span?

@bobheadxi Yeah. I only wanted to sample autocomplete requests that also end up being shown to users. We do a similar thing for telemetry, because we don't really care about those which aren't shown (these won't be noticed by a user anyhow and we only care about their cost which we can capture from the gateway traces).

So what we do here is we start a trace and then do all the fancy shmancy to build the autocomplete result and only if the request was not interrupted in the mean time (i.e by another keystroke or some other things) will it be marked as "suggested" and also sampled. That's the inherent issue I had with custom Sampler, the decision on wether something samples need to be made ahead of time.

I'll try this out for now, my main motivation behind this PR is that we reduce the load of traces being sent from the Cody Clients to the OTel collectors so if we run into issues we can continue to move the sampling decision into the otel collector but it would be nice if the client doesn't log every single I/O operation going on 😬

@philipp-spiess philipp-spiess merged commit 7b7f0ff into main May 17, 2024
21 checks passed
@philipp-spiess philipp-spiess deleted the ps/tail-sampling-otel branch May 17, 2024 12:20
@bobheadxi
Copy link
Member

Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants