Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starfish Tracing Model #83

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

Conversation

mitsuhiko
Copy link
Member

@mitsuhiko mitsuhiko commented Mar 29, 2023

This RFC proposes a new tracing model for Sentry's performance product to better
address future product improvements. The goal of this model is to allow
storing entire traces without gaps, support dynamic sampling, indexing
of spans
and to extract metrics from spans pre-sampling. This is an
evolution of the current transaction based approach.

Rendered RFC

@mitsuhiko mitsuhiko marked this pull request as draft March 29, 2023 11:19
text/0083-starfish-tracing-model.md Outdated Show resolved Hide resolved
text/0083-starfish-tracing-model.md Outdated Show resolved Hide resolved
text/0083-starfish-tracing-model.md Outdated Show resolved Hide resolved
text/0083-starfish-tracing-model.md Outdated Show resolved Hide resolved
text/0083-starfish-tracing-model.md Outdated Show resolved Hide resolved
text/0083-starfish-tracing-model.md Outdated Show resolved Hide resolved
text/0083-starfish-tracing-model.md Outdated Show resolved Hide resolved
text/0083-starfish-tracing-model.md Outdated Show resolved Hide resolved
Copy link
Member

@AbhiPrasad AbhiPrasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting!

We should probably add some notes about cardinality (in span name, span tags, extracted metrics, etc.), I feel like that's necessary information to make the best decisions around this.

provide at least the following pieces of information:

* `op`: defines the core operation that the span represents (eg: `db.query`)
* `description`: the most significant description of what this operation is (eg: the database query).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would heavily be in favor of re-naming this to name.

For those unfamiliar we currently have an inconsistency in the sentry schema where transactions have a name attribute and spans have a description (functionality identical).

I think most users are familiar with name as an attribute (closer to otel and other vendors as well then), and it would require less delta for us to approach.

We even alias span descriptions to names in the UI

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we want to get rid of description as the primary field then we need to rethink how we do the UX around some of those pieces. We currently stash into description what others for instance stash into http.route etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to get rid of description, simply just call it name instead of description. It can work the same exact way.

text/0083-starfish-tracing-model.md Outdated Show resolved Hide resolved
* `segment_id`: a span that is part of a segment, always refers back to it.
* `is_segment`: when set to `true` this span is a segment.
* `start_time` and `end_time` to give the span time information.
* `tags`: a key/value pair of arbitrary tags that are set per-span. Conceptionally however spans
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there are a couple ways to store arbitrary data on the existing schema:

on transactions you have tags, contexts and extras (inherting from error events)

on spans you have tags and data.

As per our spec, tags are restricted to string keys -> string values, and have strict size limits., while data is restricted to string keys, but can have arbitrary values (with more liberal size limits).

This has lead to most of the SDKs storing arbitrary values in span data, and the OpenTelemetry conversion setting span attributes (otels arbitrary span data) in the data field as well.

Span data is now also used for semantic conventions, similar to how opentelemetry semantic conventions use otel span attributes. I would like to keep this behaviour. For example, the span.data with the key blocked_main_thread is used in the file io performance issue detector.

I would like an arbitrary place to continue to store data to still exist, so that means we either move to saying this is data, or we update the limitations on tags.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should be keeping this. I won't have time yet to focus on the span schema so if you want, maybe you can sketch this out directly on the RFC with what the schema should look like?

text/0083-starfish-tracing-model.md Show resolved Hide resolved
text/0083-starfish-tracing-model.md Show resolved Hide resolved
text/0083-starfish-tracing-model.md Show resolved Hide resolved
text/0083-starfish-tracing-model.md Outdated Show resolved Hide resolved
* `parent_id`: a span optionally points back to a parent trace which could be from a different
service
* `segment_id`: a span that is part of a segment, always refers back to it.
* `is_segment`: when set to `true` this span is a segment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If segment is meant to be either a transaction or an interaction, should we build it into the spec here?

so segment: "transaction" | "interaction" | null, where null says that the span is not a segment.

mitsuhiko and others added 2 commits March 29, 2023 18:59
Co-authored-by: Mark Story <[email protected]>
Co-authored-by: Abhijeet Prasad <[email protected]>
Comment on lines +37 to +38
* Enable a path that allows clients to send 100% of spans outwards at least to a local aggregator,
preferrably a remote relay

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which types of aggregation do you have in mind here ?
Is that the usual metrics extraction or something else?


* `op`: defines the core operation that the span represents (eg: `db.query`)
* `description`: the most significant description of what this operation is (eg: the database query).
The description also gets processed and cleaned in the process.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we expect to use this?
When we experimented with span search and when we built suspect spans one of the major issues was that all the useful information was stored unstructured in this field and all queries would have been regexes. It did not end well.

@nikhars , if this is going to be a searchable field you may want to look into ngram indexes.

establishing of a connection it could attach a `socket.connect_time` or similar
numbers. Other examples might be things like `cache.result_size` etc.

Measurements are ingested broken down by segment and span level tags before dynamic

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerned about breaking down by all tags and letting these be arbitrary.
We have plenty of unique values in tags today.
Spans represent currently on average 25x the volume of transactions. If the cardinality will be extreme (like unique values in tags to propagate) pre-aggregation will be worthless independently on the storage used, so we need to be careful here.

@dcramer
Copy link
Member

dcramer commented Mar 31, 2023

Ive not been able to review this in depth but I wanted to escalate the convo we're having in Slack.

I do not believe a session should be a trace, or vice versa. To me these are different constructs. I, as a user, expect a trace to represent a request cycle. That is, when I take an action, a trace occurs, and it ends when that action fully completes. That said, there are problems in all models.

To illustrate what I mean, I want to call out several cases that begin root traces:

  • I curl an API endpoint, it starts when the edge receives the request
  • I load a SPA, it starts when the browser initiaties the request
  • I load Slack, on my Desktop, it starts when I load it
  • I start a match in League of Legends - or maybe I open the game client?

When do these traces end?

  • API endpoint is straight forward. Commonly it'll end when the response is returned (unless some async work happens)
  • SPA could end when the rendering completes, however, there may still be things going on. What happens with that instrumentation?
  • Slack, it could end when the app has fully loaded, but what about all the other tasks going on. Should background tasks be their own trace? Then they're sessions again. Should whoevers pushing data to Slack initiate the trace? You lose context.

My pov is all models are trash. You lose data if you make traces behave as expected, which is what I originally describe here. Traces were not designed for long lived applications and frankly speaking they do not work for it.

Should we throw it all away, call it something else, and just go with it? e.g. we could just call it Session Tracing, explicitly say a trace generally speaking will live for an entire session? This has its own problems.

I load a SPA, or Slack, and keep it open for days. All those requests are a single trace.

  • How do I sample? Sampling an entire session is prohibitively expensive, thus a session still needs bounds.
  • How do I make sense of the trace data? Theres infinite segments/transactions inside of it. Segments solve for this but its still an immense amount of complexity.
  • How do I make sense of interactions?

I'm sure there are more things I've forgotten to write up here, but the problem does not seem to have a right answer.

@mattjohnsonpint
Copy link
Contributor

mattjohnsonpint commented Apr 1, 2023

If we're really considering a new model, IMHO this would be a good time to tackle the time synchronization problem. Timestamps aren't enough IMHO, because each clock that generates them is synchronized independently. It's all to easy for a "distributed trace" to show nonsensical timestamps when viewed as a whole, such as server responses that complete before the request has been sent. I have some ideas here we could experiment with. Would this be welcome as part of this new model?

* capture entire traces
* browser tabs = a trace
* index and extract metrics on a span level
* clients send 100% of metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this is in the TODO, but how does this proposal help with this "100% sending" or "memory-pressure/payload size" issue listed above? what about the current CPU overhead?

It is helpful for me to remember our own sampling limits, even after dynamic sampling has been implemented, are often set extremely low for performance reasons https://github.com/getsentry/sentry/blob/cc47128efd5a00e66b2475acc318ff57860fdf94/src/sentry/utils/sdk.py#L35

Copy link
Member

@JoshFerge JoshFerge Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even with a local relay I don't understand how we could ever have metrics sampled enough to be useful for SLAs / Alerts where performance of the process is not severely degraded in medium throughput and above processes. It seems to me the only way to accomplish this is for metrics collection directly in the process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah maybe this is what

extract metrics from spans pre-sampling.

is going to address?

@maggiepint
Copy link

Sorry for rather drive-by comments but since I have a vested interest in better UI tracing a couple thoughts.

I want to note from experience that most SLO initiatives fail to meaningfully improve reliability in organizations. This is generally because SLOs are measured over individual services and endpoints, and that aggregation doesn't connect meaningfully to user experience. That's really the same problem @dcramer is calling out here.

There's a lot of opportunity to build a better model, but I think you'd have to think somewhat platform specific. If I live just in React in my head (though I think conceptually this fits the whole web platform), the meaningful measurements I would want for my users are:

  • Route/url change
  • Interactions that load new stuff inline (infinite scroll, drawers, modals, etc)
  • Events to server (like form posts)
  • Events from server (server sent events)

When you think about it - nearly all of these events will have some 'main' component/DOM element associated with them. Figuring out what that main component is is is hard because they nest, but logically:

  • router loads
  • New suspense or error boundary loads (well designed modals and drawers will have these, less well designed stuff this is tough)
  • Button or link clicks/form posts

All mark the start of a new transaction and most likely the end of the previous one. They also likely represent a logical 'feature' in terms an average product manager would understand.

Server sent events are a wild-card because they represent an almost parallel thread to your main UI thread - but I do feel they could almost always be looked at separately as they are async and at least shouldn't represent real-time interaction (the chat scenario being the weird counterexample that must always exist - but then each chat exchange can likely be thought of as it's own transaction as well.)

This might be possible through auto-instrumentation if you are willing to go into the frameworks (historically y'all are). I think you end up with sort of a 'transaction type' and then a 'transaction boundary' following this schema - where the 'transaction type' is 'route | suspense | button/form' and the boundary is the name of the associated thing (dom ID/name, component name, etc).

@weyert
Copy link

weyert commented Apr 5, 2023

You might find the OpenTelemetry Client Instrumentation SIG group interesting:
https://docs.google.com/document/d/16Vsdh-DM72AfMg_FIt9yT9ExEWF4A_vRbQ3jRNBe09w/edit#heading=h.yplevr950565

Quick scan I didn't see Sentry mentioned in it yet. I think it would be great to consider
join the group and solve it together.

@maggiepint You can have composite SLO and give different weights to the underlying ones, you can set the compositeWeight in a OpenSLO definition that allows for a end-to-end user journey.

number of spans of a certain category per segment and have these counters be
promoted into the segment.

## Batches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an interest in a tracing model that actually works for batched workloads.

I'm trying to figure out how a tracing story could work for arroyo, and so far it seems that one would both:

  1. want a transaction wrapped around a function that operates on a batch
  2. want to be able to correlate that transaction to the origin of each message in that batch

so that to me implies that one transaction may have multiple parent traces

that's the only solution i could make sense of, but it sucks for a lot of reasons, so I'm curious whether we have a better plan here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants