-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combined Dynamic Sampling #111
Open
jan-auer
wants to merge
3
commits into
main
Choose a base branch
from
combined-dynamic-sampling
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
- Start Date: 2023-08-29 | ||
- RFC Type: feature | ||
- RFC PR: <https://github.com/getsentry/rfcs/pull/111> | ||
- RFC Status: draft | ||
|
||
# Summary | ||
|
||
Reduce the number of profiles and replays dropped through dynamic sampling by | ||
changing how trace sampling decisions are computed. | ||
|
||
# Motivation | ||
|
||
Profiling and Replays are two Sentry features that heavily depend on connected | ||
traces with transactions and spans. These spans provide vital debugging | ||
information to the point where both features are considered incomplete without | ||
an attached trace. | ||
|
||
In practice, all of this data has to be sampled on the client (SDK) so that the | ||
SDK does not incur too much overhead on the measured application and does not | ||
saturate the network link to Sentry. For this, the SDK features three sample | ||
rates: | ||
|
||
1. `tracesSampleRate`: Defines the percentage of traces captured and sent to | ||
Sentry. | ||
2. `profilesSampleRate`: Defines the percentage of **traces** for which a | ||
profile will be started and sent to Sentry. | ||
3. `replaysSessionSampleRate`: Defines the percentage of **sessions** for which | ||
a replay will be started and sent to Sentry. Sessions usually outlive traces. | ||
|
||
Since independent sample rates multiply, this creates a situation at low sample | ||
rates for Replays where the overlap between sampled traces and sampled Replays | ||
is too low. Since server-side sampling additionally drops traces, this leaves a | ||
vanishingly small number of replays with traces stored. | ||
|
||
Profiles have addressed this by coupling the client-side sample decision to | ||
traces. The profile sample rate is a subset of the traces sample rate, so | ||
profiling runs only in sampled traces. This ensures that 100% of profiles have a | ||
trace attached. However, traces with profiles are still subject to server-side | ||
sampling. Since the server-side sample decision is decoupled from the client, | ||
the number of stored traces with profiles can still be too low, therefore. | ||
|
||
Finally, there is no connection between sampled replays and sampled profiles. A | ||
trace with a profile does not have a higher chance of having a replay captured | ||
or vice versa. | ||
|
||
# Objective | ||
|
||
The goal of this RFC is to increase the affinity of replays and profiles to | ||
stored (indexed) traces so that the likelihood of capturing traces with **both** | ||
replays and profiles is maximized. | ||
|
||
At the same time, the impact on effective dynamic sampling rates should be | ||
minimized so that the number of stored events still closely tracks the target | ||
fidelity. | ||
|
||
# Background | ||
|
||
Sampling on SDKs and the server is each based on two components: | ||
|
||
1. A random number between `0.0` and `1.0`, called "sample seed". | ||
2. A cutoff percentage, called "sample rate". | ||
|
||
The sampler compares the seed to the rate: If the seed is lower, the data is | ||
sampled, otherwise it is dropped. | ||
|
||
Sampling can be performed multiple times in series, either with the same rate or | ||
with different rates. The outcome of this depends primarily on how the sample | ||
seed is handled: | ||
|
||
- Using different seeds effectively multiplies the sample rates. For example | ||
_10% x 5% = 0.5%_. This is what happens between client-side and server-side | ||
sampling today. | ||
- Reusing the same seed repeats the sampling decision. This is how dynamic | ||
sampling operates in ingestion without a need for synchronization between | ||
transactions or spans. The result is the lowest of all the applied sample | ||
rates. | ||
|
||
The second option introduced above is also referred to as "coupled sampling", | ||
since the sample decisions are coupled to each other. It can also be applied to | ||
correlated data of different kinds each with their own sample rate. This creates | ||
an interesting effect: Data sampled with the lower sample rate always has the | ||
other kind of data available, since the decision is coupled. | ||
|
||
![coupled-sampling](https://github.com/getsentry/rfcs/assets/1433023/5cf1dca7-b2af-42e2-83ca-231687b12e37) | ||
|
||
# Coupled Sampling Decision | ||
|
||
This RFC proposes to couple the client-side sample rates of sparse data | ||
(profiles and replays) between each other and with server-side dynamic sampling: | ||
|
||
1. When the trace starts, the SDK computes a sample seed for all sparse data. | ||
2. Profiles and replays are recorded if their respective sample rates are higher | ||
than the sparse data sample seed. | ||
3. The sparse data sample seed is placed on the DSC and propagated to all | ||
connected SDKs, as well as sent in Envelope headers to Sentry. | ||
4. If available, Sentry uses the sparse data sample seed to compute the | ||
server-side dynamic sampling decision instead of the trace ID. If the seed is | ||
not present in the DSC, the sever applies dynamic sampling like before. | ||
|
||
With this strategy, profiles and replays are sampled in common traces as much as | ||
the sample rates permit, and dynamic sampling will prefer to store traces with | ||
all data available. | ||
|
||
## Sub-Trace Sampling | ||
|
||
Sparse data that is existentially bound to traces is only collected if the trace | ||
is sampled. Therefore, the SDK first always computes the trace sampling | ||
decision. Only if that decision is positive, the SDK proceeds to check the | ||
sparse sample seed against the respective sample rate. In practice, this makes | ||
sample rates for spase data are "relative" to the traces sample rate. | ||
|
||
Profiles already follow this model. To support coupled sampling for Replays, the | ||
same shall be applied to sampling of Replay sessions: The replay sampling | ||
decision should become coupled to the one of the trace, and the sample rate is | ||
relative to the traces sample rate. | ||
|
||
**TBD:** Decision on how to update SDK options. | ||
|
||
## Implications on Trace Sampling | ||
|
||
The trace sampling decision is intentionally **not** based off the sparse data | ||
sample seed. SDKs continue to use the existing sampling and propagation strategy | ||
for trace sampling decisions. The sample seed for traces does not have to be | ||
synchronized between SDKs, since in this case the sampling decision suffices. | ||
|
||
Keeping these seeds separate is important so that server-side dynamic sampling | ||
still computes an independent sample decision for traces. | ||
|
||
# Drawbacks | ||
|
||
The sample seed directly determines the server-side sampling decision. If a | ||
client tampers with the seed, the result can be significant over or under | ||
sampling. | ||
|
||
# Unresolved questions | ||
|
||
1. How should the `replaysSessionSampleRate` SDK option be handled following | ||
this RFC? It can stay absolute, in which case the SDK must compute the factor | ||
to the traces sample rate. If it changes or a new option is introduced, users | ||
need to update their SDK options. | ||
2. How should the `replaysOnErrorSampleRate` behave in comparison to the above? | ||
3. Replays are sampled across entire sessions. See the [RFC | ||
0109](https://github.com/getsentry/rfcs/pull/109)) for a discussion on | ||
session sampling. If session sampling is adopted, the spare data sample seed | ||
should be propagated across the entire session. | ||
|
||
# Discarded Options | ||
|
||
- Creating a Replay bias constrains the ability to apply other forms of dynamic | ||
sampling, leads to lower quality traces, and in practice causes significant | ||
over-sampling. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section will be revisited. Following a conversation with @mydea, we will keep the replay sample rate completely independent of trace sampling (however subject to RFC 109). I'll write up a revision of this section that takes necessary changes into account.