Starfish Tracing Model #83

mitsuhiko · 2023-03-29T11:19:16Z

This RFC proposes a new tracing model for Sentry's performance product to better
address future product improvements. The goal of this model is to allow
storing entire traces without gaps, support dynamic sampling, indexing
of spans and to extract metrics from spans pre-sampling. This is an
evolution of the current transaction based approach.

Rendered RFC

text/0083-starfish-tracing-model.md

AbhiPrasad

Exciting!

We should probably add some notes about cardinality (in span name, span tags, extracted metrics, etc.), I feel like that's necessary information to make the best decisions around this.

AbhiPrasad · 2023-03-29T16:32:11Z

text/0083-starfish-tracing-model.md

+provide at least the following pieces of information:
+
+* `op`: defines the core operation that the span represents (eg: `db.query`)
+* `description`: the most significant description of what this operation is (eg: the database query).


I would heavily be in favor of re-naming this to name.

For those unfamiliar we currently have an inconsistency in the sentry schema where transactions have a name attribute and spans have a description (functionality identical).

I think most users are familiar with name as an attribute (closer to otel and other vendors as well then), and it would require less delta for us to approach.

We even alias span descriptions to names in the UI

I think if we want to get rid of description as the primary field then we need to rethink how we do the UX around some of those pieces. We currently stash into description what others for instance stash into http.route etc.

I don't think we need to get rid of description, simply just call it name instead of description. It can work the same exact way.

text/0083-starfish-tracing-model.md

AbhiPrasad · 2023-03-29T16:37:34Z

text/0083-starfish-tracing-model.md

+* `segment_id`: a span that is part of a segment, always refers back to it.
+* `is_segment`: when set to `true` this span is a segment.
+* `start_time` and `end_time` to give the span time information.
+* `tags`: a key/value pair of arbitrary tags that are set per-span. Conceptionally however spans


Currently there are a couple ways to store arbitrary data on the existing schema:

on transactions you have tags, contexts and extras (inherting from error events)

on spans you have tags and data.

As per our spec, tags are restricted to string keys -> string values, and have strict size limits., while data is restricted to string keys, but can have arbitrary values (with more liberal size limits).

This has lead to most of the SDKs storing arbitrary values in span data, and the OpenTelemetry conversion setting span attributes (otels arbitrary span data) in the data field as well.

Span data is now also used for semantic conventions, similar to how opentelemetry semantic conventions use otel span attributes. I would like to keep this behaviour. For example, the span.data with the key blocked_main_thread is used in the file io performance issue detector.

I would like an arbitrary place to continue to store data to still exist, so that means we either move to saying this is data, or we update the limitations on tags.

I agree that we should be keeping this. I won't have time yet to focus on the span schema so if you want, maybe you can sketch this out directly on the RFC with what the schema should look like?

text/0083-starfish-tracing-model.md

AbhiPrasad · 2023-03-29T16:43:44Z

text/0083-starfish-tracing-model.md

+* `parent_id`: a span optionally points back to a parent trace which could be from a different
+ service
+* `segment_id`: a span that is part of a segment, always refers back to it.
+* `is_segment`: when set to `true` this span is a segment.


If segment is meant to be either a transaction or an interaction, should we build it into the spec here?

so segment: "transaction" | "interaction" | null, where null says that the span is not a segment.

Co-authored-by: Mark Story <[email protected]> Co-authored-by: Abhijeet Prasad <[email protected]>

fpacifici · 2023-03-30T06:00:49Z

text/0083-starfish-tracing-model.md

+* Enable a path that allows clients to send 100% of spans outwards at least to a local aggregator,
+ preferrably a remote relay


Which types of aggregation do you have in mind here ?
Is that the usual metrics extraction or something else?

fpacifici · 2023-03-30T06:08:49Z

text/0083-starfish-tracing-model.md

+
+* `op`: defines the core operation that the span represents (eg: `db.query`)
+* `description`: the most significant description of what this operation is (eg: the database query).
+ The description also gets processed and cleaned in the process.


How do we expect to use this?
When we experimented with span search and when we built suspect spans one of the major issues was that all the useful information was stored unstructured in this field and all queries would have been regexes. It did not end well.

@nikhars , if this is going to be a searchable field you may want to look into ngram indexes.

fpacifici · 2023-03-30T06:19:42Z

text/0083-starfish-tracing-model.md

+establishing of a connection it could attach a `socket.connect_time` or similar
+numbers. Other examples might be things like `cache.result_size` etc.
+
+Measurements are ingested broken down by segment and span level tags before dynamic


I am a little concerned about breaking down by all tags and letting these be arbitrary.
We have plenty of unique values in tags today.
Spans represent currently on average 25x the volume of transactions. If the cardinality will be extreme (like unique values in tags to propagate) pre-aggregation will be worthless independently on the storage used, so we need to be careful here.

dcramer · 2023-03-31T21:00:31Z

Ive not been able to review this in depth but I wanted to escalate the convo we're having in Slack.

I do not believe a session should be a trace, or vice versa. To me these are different constructs. I, as a user, expect a trace to represent a request cycle. That is, when I take an action, a trace occurs, and it ends when that action fully completes. That said, there are problems in all models.

To illustrate what I mean, I want to call out several cases that begin root traces:

I curl an API endpoint, it starts when the edge receives the request
I load a SPA, it starts when the browser initiaties the request
I load Slack, on my Desktop, it starts when I load it
I start a match in League of Legends - or maybe I open the game client?

When do these traces end?

API endpoint is straight forward. Commonly it'll end when the response is returned (unless some async work happens)
SPA could end when the rendering completes, however, there may still be things going on. What happens with that instrumentation?
Slack, it could end when the app has fully loaded, but what about all the other tasks going on. Should background tasks be their own trace? Then they're sessions again. Should whoevers pushing data to Slack initiate the trace? You lose context.

My pov is all models are trash. You lose data if you make traces behave as expected, which is what I originally describe here. Traces were not designed for long lived applications and frankly speaking they do not work for it.

Should we throw it all away, call it something else, and just go with it? e.g. we could just call it Session Tracing, explicitly say a trace generally speaking will live for an entire session? This has its own problems.

I load a SPA, or Slack, and keep it open for days. All those requests are a single trace.

How do I sample? Sampling an entire session is prohibitively expensive, thus a session still needs bounds.
How do I make sense of the trace data? Theres infinite segments/transactions inside of it. Segments solve for this but its still an immense amount of complexity.
How do I make sense of interactions?

I'm sure there are more things I've forgotten to write up here, but the problem does not seem to have a right answer.

mattjohnsonpint · 2023-04-01T22:00:00Z

If we're really considering a new model, IMHO this would be a good time to tackle the time synchronization problem. Timestamps aren't enough IMHO, because each clock that generates them is synchronized independently. It's all to easy for a "distributed trace" to show nonsensical timestamps when viewed as a whole, such as server responses that complete before the request has been sent. I have some ideas here we could experiment with. Would this be welcome as part of this new model?

JoshFerge · 2023-04-03T19:51:48Z

text/0083-starfish-tracing-model.md

+* capture entire traces
+* browser tabs = a trace
+* index and extract metrics on a span level
+* clients send 100% of metrics


maybe this is in the TODO, but how does this proposal help with this "100% sending" or "memory-pressure/payload size" issue listed above? what about the current CPU overhead?

It is helpful for me to remember our own sampling limits, even after dynamic sampling has been implemented, are often set extremely low for performance reasons https://github.com/getsentry/sentry/blob/cc47128efd5a00e66b2475acc318ff57860fdf94/src/sentry/utils/sdk.py#L35

even with a local relay I don't understand how we could ever have metrics sampled enough to be useful for SLAs / Alerts where performance of the process is not severely degraded in medium throughput and above processes. It seems to me the only way to accomplish this is for metrics collection directly in the process.

ah maybe this is what

extract metrics from spans pre-sampling.

is going to address?

maggiepint · 2023-04-03T21:58:06Z

Sorry for rather drive-by comments but since I have a vested interest in better UI tracing a couple thoughts.

I want to note from experience that most SLO initiatives fail to meaningfully improve reliability in organizations. This is generally because SLOs are measured over individual services and endpoints, and that aggregation doesn't connect meaningfully to user experience. That's really the same problem @dcramer is calling out here.

There's a lot of opportunity to build a better model, but I think you'd have to think somewhat platform specific. If I live just in React in my head (though I think conceptually this fits the whole web platform), the meaningful measurements I would want for my users are:

Route/url change
Interactions that load new stuff inline (infinite scroll, drawers, modals, etc)
Events to server (like form posts)
Events from server (server sent events)

When you think about it - nearly all of these events will have some 'main' component/DOM element associated with them. Figuring out what that main component is is is hard because they nest, but logically:

router loads
New suspense or error boundary loads (well designed modals and drawers will have these, less well designed stuff this is tough)
Button or link clicks/form posts

All mark the start of a new transaction and most likely the end of the previous one. They also likely represent a logical 'feature' in terms an average product manager would understand.

Server sent events are a wild-card because they represent an almost parallel thread to your main UI thread - but I do feel they could almost always be looked at separately as they are async and at least shouldn't represent real-time interaction (the chat scenario being the weird counterexample that must always exist - but then each chat exchange can likely be thought of as it's own transaction as well.)

This might be possible through auto-instrumentation if you are willing to go into the frameworks (historically y'all are). I think you end up with sort of a 'transaction type' and then a 'transaction boundary' following this schema - where the 'transaction type' is 'route | suspense | button/form' and the boundary is the name of the associated thing (dom ID/name, component name, etc).

weyert · 2023-04-05T14:14:44Z

You might find the OpenTelemetry Client Instrumentation SIG group interesting:
https://docs.google.com/document/d/16Vsdh-DM72AfMg_FIt9yT9ExEWF4A_vRbQ3jRNBe09w/edit#heading=h.yplevr950565

Quick scan I didn't see Sentry mentioned in it yet. I think it would be great to consider
join the group and solve it together.

@maggiepint You can have composite SLO and give different weights to the underlying ones, you can set the compositeWeight in a OpenSLO definition that allows for a end-to-end user journey.

untitaker · 2023-06-08T19:04:27Z

text/0083-starfish-tracing-model.md

+number of spans of a certain category per segment and have these counters be
+promoted into the segment.
+
+## Batches


I have an interest in a tracing model that actually works for batched workloads.

I'm trying to figure out how a tracing story could work for arroyo, and so far it seems that one would both:

want a transaction wrapped around a function that operates on a batch

want to be able to correlate that transaction to the origin of each message in that batch

so that to me implies that one transaction may have multiple parent traces

that's the only solution i could make sense of, but it sucks for a lot of reasons, so I'm curious whether we have a better plan here

Starfish Tracing Model

8edfc4f

mitsuhiko marked this pull request as draft March 29, 2023 11:19

mitsuhiko added 3 commits March 29, 2023 13:20

Cross link RFC

d7d7818

Add text

d1a312f

Added another diagram

63e09ef

markstory reviewed Mar 29, 2023

View reviewed changes

AbhiPrasad reviewed Mar 29, 2023

View reviewed changes

mitsuhiko and others added 2 commits March 29, 2023 18:59

Apply suggestions from code review

65d0cee

Co-authored-by: Mark Story <[email protected]> Co-authored-by: Abhijeet Prasad <[email protected]>

Changes

7e215e6

fpacifici reviewed Mar 30, 2023

View reviewed changes

JoshFerge reviewed Apr 3, 2023

View reviewed changes

AbhiPrasad mentioned this pull request May 10, 2023

rfc(feature): Aligning the Span Schema with OpenTelemetry #89

Closed

untitaker reviewed Jun 8, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starfish Tracing Model #83

Starfish Tracing Model #83

mitsuhiko commented Mar 29, 2023 •

edited

AbhiPrasad left a comment •

edited

AbhiPrasad Mar 29, 2023

mitsuhiko Mar 29, 2023

AbhiPrasad Mar 29, 2023

AbhiPrasad Mar 29, 2023

mitsuhiko Mar 29, 2023

AbhiPrasad Mar 29, 2023

fpacifici Mar 30, 2023

fpacifici Mar 30, 2023

fpacifici Mar 30, 2023

dcramer commented Mar 31, 2023 •

edited

mattjohnsonpint commented Apr 1, 2023 •

edited

JoshFerge Apr 3, 2023

JoshFerge Apr 3, 2023 •

edited

JoshFerge Apr 3, 2023

maggiepint commented Apr 3, 2023

weyert commented Apr 5, 2023 •

edited

untitaker Jun 8, 2023

		* Enable a path that allows clients to send 100% of spans outwards at least to a local aggregator,
		preferrably a remote relay

Starfish Tracing Model #83

Are you sure you want to change the base?

Starfish Tracing Model #83

Conversation

mitsuhiko commented Mar 29, 2023 • edited

AbhiPrasad left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcramer commented Mar 31, 2023 • edited

mattjohnsonpint commented Apr 1, 2023 • edited

Choose a reason for hiding this comment

JoshFerge Apr 3, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maggiepint commented Apr 3, 2023

weyert commented Apr 5, 2023 • edited

Choose a reason for hiding this comment

mitsuhiko commented Mar 29, 2023 •

edited

AbhiPrasad left a comment •

edited

dcramer commented Mar 31, 2023 •

edited

mattjohnsonpint commented Apr 1, 2023 •

edited

JoshFerge Apr 3, 2023 •

edited

weyert commented Apr 5, 2023 •

edited