Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce additional types of exceptions next to mechanism.handled exceptions #10

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

ueman
Copy link

@ueman ueman commented Sep 17, 2022

This RFC suggests a feature which introduces additional types of exceptions next to mechanism.handled.

Currently, exception which cause the running software to exit are marked as handled: false. This isn't enough for SDKs where an exception can be unhandled but at the same time doesn't cause the software to exit. The aim of this RFC is to fix that.

Rendered RFC

Disclaimer: I'm no Sentry employee. However, this is an issue I would really love to see fixed, and I was directed here, to write up my ideas. I hope that's okay.

This PR also looks a bit related to this issue: getsentry/relay#306

@marandaneto marandaneto changed the title Introduce additional types of exceptions next to unhandled exceptions Introduce additional types of exceptions next to mechanism.handled exceptions Sep 18, 2022
@marandaneto marandaneto marked this pull request as draft September 18, 2022 07:36
ueman and others added 5 commits September 18, 2022 09:36
Co-authored-by: Manoel Aranda Neto <[email protected]>
Co-authored-by: Manoel Aranda Neto <[email protected]>
Co-authored-by: Manoel Aranda Neto <[email protected]>
Co-authored-by: Manoel Aranda Neto <[email protected]>
Co-authored-by: Manoel Aranda Neto <[email protected]>
@marandaneto
Copy link
Contributor

Thanks for the detailed message @lobsterkatie
@smeubank and Daniel Khan will discuss this in the next planning.


Currently, there's an unhandled label on the issue's page but it's only highlighted for process termination errors (if `handled: false`).

In order to propagate those exception types, the exception mechanism needs to be adapted:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this option more, because it is backwards compatible.
But I would not add another boolean, the new "process_terminated" variable needs to be added because "handled" is a boolean and thus not very future proof.
So I would add a number/enum kind of type called something more general like error_state or such that can then have values like process_terminated, timeout, deadlock, caught-fire, ...

@antonpirker
Copy link
Member

One general question: How can we send data to sentry, when the process terminates? Is this even possible? (can we store the error fast enough to disk to send on next start?)

@antonpirker
Copy link
Member

And @lobsterkatie can this new thing catch real browser crashes: https://web.dev/reporting-api/ maybe?

@marandaneto
Copy link
Contributor

One general question: How can we send data to sentry, when the process terminates? Is this even possible? (can we store the error fast enough to disk to send on next start?)

yes

@cleptric cleptric self-requested a review September 30, 2022 13:06
@lobsterkatie
Copy link
Member

And @lobsterkatie can this new thing catch real browser crashes: web.dev/reporting-api maybe?

Not really. Or, rather, technically yes, but not in all browsers, and not with really any useful data. The problem is, it's not a hook like onerror is, it's just the ability to set a URL to receive a set data payload. (It's the same way CSP reports work.) And in the case of a crash, that data payload you get is pretty underwhelming:

image

(You can test it out by using this demo site.)

Co-authored-by: Manoel Aranda Neto <[email protected]>
@smeubank
Copy link
Member

transfering questions from Eran Arkin

A few questions:

  1. What splitting (crashed/not crashed) will allow the user to achieve? I understand it will impact the health metrics, but will it let them fix issues more easily? To triage issues quicker? Alert the right people?
  2. Do we have an estimate on what % of our customer base this impacts split by free/paid? Are there workarounds today that people are using that we can gauge the interest level?
  3. I’m not sure what’s the cost of piping this change to the UI (i.e., filters, search, alerts, etc.). So that’s something we will need to budget.

Points 1 & 3 I think developers here can answer from experience working with and on Sentry

point 2, % of customer based would be for all platforms where the boolean doesn't really apply. In theory they would all be impacted

@marandaneto
Copy link
Contributor

transfering questions from Eran Arkin

A few questions:

  1. What splitting (crashed/not crashed) will allow the user to achieve? I understand it will impact the health metrics, but will it let them fix issues more easily? To triage issues quicker? Alert the right people?
  2. Do we have an estimate on what % of our customer base this impacts split by free/paid? Are there workarounds today that people are using that we can gauge the interest level?
  3. I’m not sure what’s the cost of piping this change to the UI (i.e., filters, search, alerts, etc.). So that’s something we will need to budget.

Points 1 & 3 I think developers here can answer from experience working with and on Sentry

point 2, % of customer based would be for all platforms where the boolean doesn't really apply. In theory they would all be impacted

  1. What splitting (crashed/not crashed) will allow the user to achieve? I understand it will impact the health metrics, but will it let them fix issues more easily? To triage issues quicker? Alert the right people?

This is answered in the RFC already, but I can elaborate more:
You'd be able to prioritize issues better crashes > unhandled > hadled and ignore less important issues (cut noise).
Because of the importance of issues, you can triage quicker and alert only for the important ones, saving you time to be alerted for something not so important or not being alerted at all.

  1. I’m not sure what’s the cost of piping this change to the UI (i.e., filters, search, alerts, etc.). So that’s something we will need to budget.

The changes on the SDKs should be very quick, but I can't estimate filters, search, alerts etc, it's not part of my scope.

@ueman
Copy link
Author

ueman commented Oct 24, 2022

transfering questions from Eran Arkin

A few questions:

  1. What splitting (crashed/not crashed) will allow the user to achieve? I understand it will impact the health metrics, but will it let them fix issues more easily? To triage issues quicker? Alert the right people?

Yep, currently the problem is that our app basically never crashes (as in the app quits, that's how the platform works), so there's no easy way to differentiate between exception which we caught ourselves vs those that were unhandled but didn't crash (as in quitting the app) the application. And those are the one we're interested in fixing.

  1. Do we have an estimate on what % of our customer base this impacts split by free/paid? Are there workarounds today that people are using that we can gauge the interest level?

No clue, since I'm not working for Sentry but for a company which is a paying enterprise customer. Workarounds are filtering based on the mechanism with which the exception was reported, but that requires that the SDK does add a mechanism in those cases and it also requires a probably unrealistic good understanding of the SDK and UI. I just happen to have contributed a lot of code to Sentry SDKs, so I'm not a good example.

Since Sentry is just starting to push into areas where this makes a difference (the Flutter, Unity and React.Native SDKs are relatively new compared to the pure Android and iOS SDKs) this is also a growth opportunity and not just an improvement opportunity.

@lobsterkatie
Copy link
Member

  1. What splitting (crashed/not crashed) will allow the user to achieve? I understand it will impact the health metrics, but will it let them fix issues more easily? To triage issues quicker? Alert the right people?

I'd echo everything @marandaneto and @ueman's said, and add that for JS, what the customer gains is data which isn't totally bogus, which feels like a good thing in and of itself. I/we would have fixed the data long ago if there weren't UI/product consequences, but even then it would have been only semi-accurate, because the reality is that there really is a distinction between crashes and unhandled errors. If we were to fix the booleans to be the correct booleans tomorrow, people's crash rates on their releases would skyrocket, because all sorts of things we're incorrectly marking as handled would get marked as unhandled and therefore count as crashing the app, even though that's not reflective of reality. (There's the added complication I detailed above of what exactly does count as a crash in JS, especially on the browser side, but at least if there were three options, there'd be the possibility of reporting accurate data, and then it'd just be on us SDK devs to figure out what to send when.)

  1. Do we have an estimate on what % of our customer base this impacts split by free/paid? Are there workarounds today that people are using that we can gauge the interest level?
    point 2, % of customer based would be for all platforms where the boolean doesn't really apply. In theory they would all be impacted

There's no real workaround here. As for impact, I don't know the total numbers, but the mere fact that it affects JS SDK users already means we're talking about a plurality of our customers, both paid and unpaid. I wouldn't be surprised if adding in all of the other affected platforms pushed us past 50%.

@lobsterkatie
Copy link
Member

lobsterkatie commented Oct 27, 2022

FWIW, I wrote up the JS-specific parts of this issue here: getsentry/sentry-javascript#6073

The JS SDK team chatted and agreed that as a first step, we will start sending improved (even if not perfect) data somewhere else (TBD where) in the event, so that a) the scale of the problem can be quantified, and b) any backend or UI folks who do eventually pick up that end of the work have example data to work with. (This will likely be somewhere viewable in the JSON but not directly in the UI for the moment, since for now it's really just analytics.)

I'll update here once exactly where and exactly what we'll send is hammered out. Once we do that, maybe other SDKs can do a similar thing.

@szokeasaurusrex
Copy link
Member

I agree in general with your idea of having three separate categories to split errors recorded manually by users, errors recorded by Sentry auto-instrumentation, and errors which crash the program.

However, I think we should be taking a step back here to reconsider our terminology and how we indicate error severity to users. The current handled/unhandled terminology is somewhat misleading in my opinion, since at least in the Python SDK that I am most familiar with, it does not actually indicate whether the exception gets caught, but rather whether the exception was reported manually by the user or by the SDK itself via an instrumentation. Furthermore, the UI visually indicates "unhandled" exceptions as more severe than "handled" ones, and I question whether this is necessarily the case.

For this reason, I propose changing the handled/unhandled terminology to something more meaningful, and I also propose creating a separate measure of error severity.

Capturing mechanism

For the capturing mechanism, we should have two categories to separate manually and automatically captured errors. These would replace the current labels of "handled" and "unhandled," respectively.

  • Manually captured: Errors which users record via Sentry.capture* methods.
  • Automatically captured: Errors which the Sentry SDK recorded automatically.

Unlike the current “handled” and “unhandled” categories, automatically captured errors would not necessarily be marked as higher severity than manually captured ones.

Severity levels

To indicate error severity, we should have a separate concept of error severity. I propose three error severity levels as follows, but am open to other suggestions:

  • Low: Errors which are less likely to be caused by a bug, and more likely to be caused by user error, for example a 404 error returned by a web server. This category would be the default for manually captured errors, but users would be able to set a high severity on a per-event basis, instead.
  • High: Errors which likely indicate a bug in the code or some other more serious issue that developers need to fix, for example a 500 error returned by a web server. This category would likely be the default for most types of automatically captured errors, but users would be able to set a low severity on a per-event or per-error-type basis, instead.
  • Critical: Errors which crashed the process. This category would be reserved for crashes; i.e. users would be unable to manually mark an error as critical, and errors which crash the process would always be marked as critical regardless of configurations.

This separate severity level mechanism would be more flexible, since users would have the option to set a different severity level for certain errors or configure a different default level for certain classes of errors, instead of tying the severity level to how the error was captured.

@Lms24
Copy link
Member

Lms24 commented Sep 18, 2023

I generally like the idea of moving away from handled/unhandled. Two things:

For the capturing mechanism, we should have two categories to separate manually and automatically captured errors. These would replace the current labels of "handled" and "unhandled," respectively.

For the mechanism, we already have the type field. Maybe the way to go is to standardize what values we set for the respective cases and leave handled unset by default.

we should have a separate concept of error severity

I guess severity levels are hard to define well across different platforms. For example, in Browser JS we can’t detect a full crash and it basically never happens. Also besides Http error responses (which not all of our SDKs capture), it’s probably hard to distinguish between high/low reliably.

Whatever we do here, if we want to solve this for the entire product, we need to coordinate this change with how we set the session status and, based on that, how we calculate release health. Maybe the key here is adapting the calculation to a certain platform, so that it makes (somewhat) sense for web, mobile and backend/server projects respectively.

@cleptric cleptric assigned cleptric and unassigned lobsterkatie and therealarkin May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet