Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGTRAP Android freeze #1982

Closed
LeviWilliams opened this issue Nov 13, 2023 · 51 comments
Closed

SIGTRAP Android freeze #1982

LeviWilliams opened this issue Nov 13, 2023 · 51 comments
Labels
bug Something isn't working

Comments

@LeviWilliams
Copy link

Description

Recently we upgraded our Skia version from 0.1.197 to 0.1.214 and now we are seeing a bunch of "SIGTRAP: Trace/breakpoint trap" in production on Android devices with a variety of versions. Currently on react-native 0.72.3 if helpful.
Screenshot 2023-11-13 at 3 23 19 PM.

We reproduced a couple times on a Pixel 6 and the app just freezes. I understand this error is vague, we currently have no leads as to why this happens after updating the lib though if we are able to provide a repro we will.

Let us know if there are any ideas on what we can try, thanks for the help as always.

Version

0.1.214

Steps to reproduce

Snack, code example, screenshot, or link to a repository

@LeviWilliams LeviWilliams added the bug Something isn't working label Nov 13, 2023
@wcandillon
Copy link
Contributor

We would definitely need a reproducible example and also a sense of the APIs which are used (Reanimated version, Skia animations, etc)

@laurens-lamberts
Copy link

Hi @wcandillon, @LeviWilliams,

We also experience this crash in production. So much - unfortunately - that we received a warning from Google regarding 'Android vitals bad behavior'. Our app will become less discoverable and receive a warning at the store page if this crash is not resolved soon.

We currently have no clue when/where specifically this crash occurs. Therefore we cannot provide a reproducible example at the moment.
For users that experience the crash, we do notice that this crash is experienced only once every app-update. Next sessions are not affected most of the times.

For our app, in the last 7 days 1.5k users experienced 1.7k crashes of the SIGTRAP type, originated from librnskia.so.

Two crashes occur, both indicated by SIGTRAP, both about 50% of the total occurrences;

[split_config.arm64_v8a.apk!librnskia.so] SkTDPQueue<GrGpuResource*, &GrResourceCache::CompareTimestamp(GrGpuResource* const&, GrGpuResource* const&), &GrResourceCache::AccessResourceIndex(GrGpuResource* const&)>::remove(GrGpuResource*)

and

[split_config.arm64_v8a.apk!librnskia.so] GrResourceCache::notifyARefCntReachedZero(GrGpuResource*, GrIORef<GrGpuResource>::LastRemovedRef)
SIGTRAP

Some more details

100% foreground crashes, spreaded to usage over Android versions and devices.

Versions used in the context of the above;
Skia: 0.1.221
Reanimated: 3.5.4
React-native: 0.72.7

With Skia 0.1.210: No significant crashes
With Skia 0.1.214: many SIGSEGV crashes, and also some SIGTRAP crashes
With Skia 0.1.221: many SIGTRAP crashes, no more SIGSEGV crashes.

Hope this helps tracking down the issue.

@wcandillon
Copy link
Contributor

@laurens-lamberts Thank you for this precious data.And I hope that we can get this sorted out as soon as possible. I need to review things more carefully on my side but from 0.1.210 to 0.1.214, the only update that I am seeing on the native code is the Skia version upgrade. I could do another upgrade to see if this helps. (alternatively we could downgrade as well to see if this helps).

The error seems to be Skia specific, I will investigate this a bit deeper and let you know if I find anything.

@wcandillon
Copy link
Contributor

Strangely enough, I cannot find any relevant change from 0.1.214 to 0.1.220.

Going forward, we will setup some of release program to check if we introduce such regressions to releases.

We have a RN Skia client of approximately the same scale as you running 0.1.213 (has no crash reports). I will contact them about the issue and see if there is a way to maybe just try to upgrade to m119 in an isolated manner.
This client is not using any of the recently deprecated APIs.
The holidays may make things a bit slower there but I will report back.

@laurens-lamberts I suggest we do the following:

  • I'll keep doing surface level investigation to see if we can get any interesting lead
  • Let's upgrade the app to use Skia m121 and also not use the deprecated APIs and see if the crash still occur.
  • If it does, we will make a custom package that is the latest RN Skia but uses m116 (I believe it wouldn't be an excessive amount of work to make such package).
  • If that fixes, the issue, we will discuss on how to proceed.

@laurens-lamberts
Copy link

Thanks a lot @wcandillon, we're really happy with your support proposal on this issue.
It means a lot to us, and we are motivated to help tracking down the issue.

Due to the high impact when issues arise during deployment in the christmas / newyear period, we will postpone next releases to January. For the upcoming release of our project we upgraded to the following library versions (all latest);

"react-native": "0.73.1",
"react-native-reanimated": "3.6.1",
"@shopify/react-native-skia": "0.1.230",

If any new versions of the above packages appear before our release, we will update to ensure having the latest of all.

We always perform our releases phased, so as soon as we got insight in crash rates we will share them with you. This will likely be the end of January / beginning of Februari.

For my information, where in the react-native-skia library do I find the reference to the internal Skia version number (like m121)?

@wcandillon
Copy link
Contributor

This is were you can find the Skia version used : https://github.com/Shopify/react-native-skia/blob/main/.gitmodules
In the built package, I don't believe this information is available, that something we could do potentially if you would find it useful.

I will continue to investigate this a bit and also do the upgrade to m121 and we can tackle this more aggressively after the holidays. I think that we are lucky to have a Skia client that uses 113 at scale but with only non-deprecated API, that will give us a lot of information once/when they deploy 114 and above.

@laurens-lamberts
Copy link

Yes, that's great. Looking forward to hearing their experiences with later versions.
Thanks for showing me where to find the Skia version used. Maybe we can use that in combination with the release notes of skia to troubleshoot some issues in the future.

@espenjanson
Copy link

Any updates on this/ways to resolve it @laurens-lamberts @wcandillon @LeviWilliams ? Anything we can do to help? We just had to downgrade Skia to 196 because of thousands of crashes in production due to this error. Would be awesome to be able to upgrade since we want to move on to RN 0.73 (which according to release notes are not fully supported until 213) 🙏

@wcandillon
Copy link
Contributor

@espenjanson Yes anything that would help to reproduce the issue or more details on the conditions of the crash would be extremely useful. I'm surprised you are on 196 because 197 notoriously fixes a crash related to animations.

We have been coordinating with @laurens-lamberts to find the root cause of the issue but without success yet. We have a large client who's running the latest version of Skia without any crashes (this same client had a large amount of crashes in production with 196). This means that the issue is likely related to a particular API but we haven't been able to identify it yet.

@wcandillon
Copy link
Contributor

@espenjanson could you send me a list of Skia APIs and components you are using? You can do it privately as well by email.

@espenjanson
Copy link

espenjanson commented Jan 18, 2024

@wcandillon thanks for quick response. We'd love to help in any way we can. Any chance you could provide the package.json (or at least parts of it, such as react-native and reanimated version and perhaps other libraries that could affect skia)?

If you want to, we can send you a minimal functioning app project with our crashing package.json and all the components we have that use Skia. Can put together a zip or a repo, whatever works better for you. If needed, we can also provide more detailed stack traces from Google Play and Sentry.

Will put the team on this immediately. Thanks a million for paying attention to this!

@Nodonisko
Copy link
Contributor

@wcandillon We have exactly same issue with quite big number of crashes with exact same error messages that started to
appear after we updated Skia in January 2024.

[split_config.arm64_v8a.apk!librnskia.so] GrResourceCache::notifyARefCntReachedZero(GrGpuResource*, GrIORef<GrGpuResource>::LastRemovedRef)
SIGTRAP
[split_config.arm64_v8a.apk!librnskia.so] SkTDPQueue<GrGpuResource*, &GrResourceCache::CompareTimestamp(GrGpuResource* const&, GrGpuResource* const&), &GrResourceCache::AccessResourceIndex(GrGpuResource* const&)>::remove(GrGpuResource*)
SIGTRAP

My guess it's not possible to create 100% reproducible example because crash is quite random, but it's probably related to Skia + Animations. I will try to some app that uses same features as our production app and that will run that animations in some forever loop and hope it will crash after some time. Also I will try to just mount and unmount our components very quickly in some forever loop.

I will let you know if I will find something.

@wcandillon
Copy link
Contributor

@Nodonisko is this on the latest version? It looks like this may have been fixed after the latest Skia version upgrade.

@laurens-lamberts
Copy link

Hi @wcandillon,

We are live on 1.1.0 of react-native-skia and still experience the crash. Is this already using the latest Skia version?
80% of our Android crashes are the SIGTRAP one from librnskia.so, and it drops our crash-free rate to 98.23 at the moment. iOS is very stable. 99.91% crash-free for us.

@wcandillon
Copy link
Contributor

I'm slowly formulating a plan to tackle this issue.
As long as we cannot reproduce the issue, this would require us to deploy an unreleased version of RN Skia to a segment of users to see it solves the issue or not. Would this be reasonable?

Is there a sense of which screen the crash is happening? This would allow us to funnel the API/code that might be faulty.

@Nodonisko
Copy link
Contributor

@wcandillon We are not at latest version we have like two months old version. We will it but it will take us another month to test it in production.

In mean time we were quite lucky and one of our testers managed to catch crash on video. It's not much helpful but at least we know on which screen it's happening. Sadly it's screen full of Skia components and animations :D

I will try to prepare some standalone app from that screen so we can try to reproduce it in more isolated env. Hope I will have this done today or tmrw.

screen-20240527-115456.mp4

@wcandillon
Copy link
Contributor

wcandillon commented May 28, 2024 via email

@Nodonisko
Copy link
Contributor

Sadly it's production version.

@Nodonisko
Copy link
Contributor

Nodonisko commented May 29, 2024

So I spend most of yesterday trying to reproduce the issue. I created special version of our homescreen that runs animations in loop, mounting unmounting components etc. and let it run on two different Android devices for 30 minutes few times. I also tried both production and debug builds. So far I did not get single crash...

I also noticed that there is one single crash in our GPlay console that is happening with nearly same signature [libhwui.so] GrResourceCache::notifyARefCntReachedZero(GrGpuResource*, GrIORef<GrGpuResource>::LastRemovedRef) and it's not SIGTRAP but it's SIGSEGV and it actually has some stacktrace:

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
pid: 0, tid: 16686 >>> io.trezor.suite <<<

backtrace:
  #00  pc 0x0000000000558aac  /system/lib64/libhwui.so (GrResourceCache::notifyARefCntReachedZero(GrGpuResource*, GrIORef<GrGpuResource>::LastRemovedRef)+364)
  #01  pc 0x0000000000562074  /system/lib64/libhwui.so (GrTextureProxy::~GrTextureProxy()+96)
  #02  pc 0x000000000056220c  /system/lib64/libhwui.so (virtual thunk to GrTextureProxy::~GrTextureProxy()+40)
  #03  pc 0x0000000000686fdc  /system/lib64/libhwui.so (SkImage_Gpu::~SkImage_Gpu()+24)
  #04  pc 0x0000000000299924  /system/lib64/libhwui.so (android::uirenderer::AutoBackendTextureRelease::unref(bool)+108)
  #05  pc 0x000000000029a110  /system/lib64/libhwui.so (android::uirenderer::DeferredLayerUpdater::destroyLayer()+188)
  #06  pc 0x000000000029ac14  /system/lib64/libhwui.so (android::uirenderer::DeferredLayerUpdater::detachSurfaceTexture()+28)
  #07  pc 0x0000000000284014  /system/lib64/libhwui.so (std::__1::__function::__func<decltype(fp()) android::uirenderer::WorkQueue::runSync<android::uirenderer::renderthread::RenderProxy::dumpProfileInfo(int, int)::$_28>(android::uirenderer::renderthread::RenderProxy::dumpProfileInfo(int, int)::$_28&&)::'lambda'(), std::__1::allocator<decltype(fp()) android::uirenderer::WorkQueue::runSync<android::uirenderer::renderthread::RenderProxy::dumpProfileInfo(int, int)::$_28>(android::uirenderer::renderthread::RenderProxy::dumpProfileInfo(int, int)::$_28&&)::'lambda'()>, void ()>::operator()() (.2a0230ca9784b3ed733f337e97c21a2e)+92)
  #08  pc 0x0000000000274bac  /system/lib64/libhwui.so (android::uirenderer::WorkQueue::process()+588)
  #09  pc 0x00000000002951ac  /system/lib64/libhwui.so (android::uirenderer::renderthread::RenderThread::threadLoop()+416)
  #10  pc 0x0000000000013414  /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+424)
  #11  pc 0x00000000000ba598  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208)
  #12  pc 0x0000000000053f3c  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+68)

I am not sure if it's related or it's some completely unrelated random crash with similar signature, but it's suspicious.

Any ideas what we can try next? We can try to deploy some special version of Skia from @wcandillon if that could help in our next release, but I will take another month to roll this out to our users.

I will also try to narrow in which version crash occurred first.

@Nodonisko
Copy link
Contributor

So it seems that crash occurred after update from 0.1.188 to 0.1.216 according to our Sentry.

Also found this issue in Skia issue trackers https://issues.skia.org/issues/333423686

@wcandillon
Copy link
Contributor

@Nodonisko Does https://issues.skia.org/issues/333423686 look like it could be related? It wasn't clear to me looking the bug report

@Nodonisko @laurens-lamberts @LeviWilliams I would like to find some way to reproduce the issue (even in release mode) and/or pin down the scenario in which the error happen. I wrote this example that stresses the Skia APIs and I tested also in release mode: https://github.com/user-attachments/files/15758252/StressTest.zip Please let me know how I could update the test scenario to better match your circumstances.

In #2396, we are experiencing a clear race condition which we are currently investigating and that might shed some light on what is happening.

@Nodonisko
Copy link
Contributor

Nodonisko commented Jun 26, 2024

@wcandillon Error in https://issues.skia.org/issues/333423686 looks very similar to our error but not sure it's related.

About stress test, I run it on my two devices in debug mode and no crash so far.

I also went through Sentry data and it seems it crashes very often (not exclusively) when user goes to new screen, both when previous screen is unmounted (like tab change) or also if new screen is pushed into stack, which leads me to idea that this probably happens when some Skia component is mounted.

@Nodonisko
Copy link
Contributor

We just released new version of app with 1.3.7 Skia version and this issue still persist. My colleague just got this crash when he did some hover gesture over our graph (Revolut style graph animation).

@alexnaiman
Copy link

Hello @wcandillon @Nodonisko ! Any updates here?

We're still experiencing both issues on react-native-skia version 1.3.9.

[split_config.arm64_v8a.apk!librnskia.so] SkTDPQueue<GrGpuResource*, &GrResourceCache::CompareTimestamp(GrGpuResource* const&, GrGpuResource* const&), &GrResourceCache::AccessResourceIndex(GrGpuResource* const&)>::remove(GrGpuResource*)
[split_config.arm64_v8a.apk!librnskia.so] GrResourceCache::notifyARefCntReachedZero(GrGpuResource*, GrIORef<GrGpuResource>::LastRemovedRef)

Any workaround or solution to help us mitigate these would be greatly appreciated, as we’re currently seeing hundreds of crashes related to both problems

Could we consider downgrading to version 196/210? @espenjanson /@laurens-lamberts mentioned in this thread that it helped them, though I don’t see this as a sustainable long-term fix, even if it does work. (Also, @espenjanson / @laurens-lamberts , did it actually resolve the issues? Have you encountered any more problems since?)

other relevant information:

"react-native": "0.74.2",
"react-native-reanimated": "3.12.1",

Skia API used:
These APIs were used even before we started seeing the crashes the play console

  • Canvas
  • LinearGradient
  • Rect
  • vec
  • Oval
  • Blur

These APIs were used on the latest release, the one where we started seeing the crashes. We created a new component that is used on several lists (FlashList, FlatList, SectionList). Each list has somewhere around 10-50 items

  • Blur
  • BlurMask
  • Canvas
  • Circle
  • Mask
  • Image
  • SkImage
  • Skia.Data.fromBytes
  • Image.MakeImageFromEncoded

@tmgrask
Copy link

tmgrask commented Nov 19, 2024

We are observing the same pair of crashes reported above. Like other commenters above, the crash is rare enough that we have not been able to reproduce it consistently (or at all?) during development.

[split_config.arm64_v8a.apk!librnskia.so] SkTDPQueue<GrGpuResource*, &GrResourceCache::CompareTimestamp(GrGpuResource* const&, GrGpuResource* const&), &GrResourceCache::AccessResourceIndex(GrGpuResource* const&)>::remove(GrGpuResource*)
[split_config.arm64_v8a.apk!librnskia.so] GrResourceCache::notifyARefCntReachedZero(GrGpuResource*, GrIORef<GrGpuResource>::LastRemovedRef)

react-native-skia has been a joy to program with, but this crash is a bit of a show stopper as it is happening frequently enough to raise our crash rate above the "bad behaviour" threshold in Google Play.

We have been running react-native-skia version 1.3.13 and the issue persists after rolling out an optimistic upgrade to 1.5.2. A downgrade to 0.1.210 (a crash-less version reported above) is not possible without considerable effort to replace our use of newer react-native-skia features like ParagraphBuilder.

@wcandillon
Copy link
Contributor

@tmgrask We are actually working on a fix for this at the moment. We found a serious threading issue on Android which we are working on. These crashes are hard to reproduce so I cannot guarantee 100%. However the indication that 0.1.210 is crashless for you is good piece of data for us to check if our current assumptions are correct.

@wcandillon
Copy link
Contributor

The 0.1.210 is not necessarily consistent with some of our assumption but we think that #2749 is fixing substancial threading issues with OpenGL/Skia. We are working on confirming/testing everything

@tmgrask
Copy link

tmgrask commented Nov 19, 2024

Good to know, I will follow that threading PR.

the indication that 0.1.210 is crashless for you is good piece of data for us

Just to be clear: we have not rolled out out a version with 0.1.210, so I cannot personally attest to it being crash free, I was only investigating it as a downgrade target because of this comment above: #1982 (comment)

With Skia 0.1.210: No significant crashes
With Skia 0.1.214: many SIGSEGV crashes, and also some SIGTRAP crashes
With Skia 0.1.221: many SIGTRAP crashes, no more SIGSEGV crashes.

Thanks for the quick update and for the work on this great package!

@wcandillon
Copy link
Contributor

wcandillon commented Nov 19, 2024

published 1.5.5 could you check to see if it helps? We fixed an important thread-safety issue there but these crashes can be hard to reproduce so I'd be curious to know if it helped.

@Nodonisko
Copy link
Contributor

Thanks @wcandillon we will test it and let you know but I don't think we will have enough data earlier than end of December or early January due to our release process.

@joacub
Copy link

joacub commented Nov 20, 2024

still happening now even more, this was not happening to me before this update:

Image

@wcandillon
Copy link
Contributor

@joacub Any chance you would have an example that could help us reproduce the crash?

@joacub
Copy link

joacub commented Nov 20, 2024

expo: 52
"@shopify/react-native-skia": "1.5.5"
"react-native-reanimated": "3.16.1"

 return (
      <Canvas style={{ flex: 1, position: 'absolute', left: 0, right: -1, top: 0, bottom: 0 }}>
        <Rect x={0} y={0} width={width} height={height}>
          <LinearGradientSkia start={start} end={end} colors={colors} />
        </Rect>
      </Canvas>
    );

downgrading skia to 1.5.4 works, last changes you did is what is causing this

@wcandillon
Copy link
Contributor

can you give me more of a snippet? what are width, height, start, end?

@joacub
Copy link

joacub commented Nov 20, 2024

can you give me more of a snippet? what are width, height, start, end?

that does not matter I already tested with the same snipped as you have in the demos and same behavior, you can the same example as skin docs has for linear gradient

@wcandillon
Copy link
Contributor

@joacub are you saying you are able to consistently reproduce the issue? You could help me reproduce it on my side? I would be also curious to know when it happens (e.g entering/leaving a particular screen for instance)

@joacub
Copy link

joacub commented Nov 20, 2024

@joacub are you saying you are able to consistently reproduce the issue? You could help me reproduce it on my side? I would be also curious to know when it happens (e.g entering/leaving a particular screen for instance)

yes it happen always in some android devices it is not related to the android version seems to be something with some devices, with one samsung galaxy a12 android 12 it is happening any time at the beginning just in the first render.

skia version 1.5.4 is working, start happening in 1.5.5

@tmgrask
Copy link

tmgrask commented Nov 20, 2024

Hi @wcandillon, I did some fuzz testing of our app using react-native-skia 1.5.4 and 1.5.6 in AWS device farm. Unfortunately it appears as though the SIGTRAP crashes remain, and 1.5.6 has introduced a new even more frequent crash (and faster, it prevents the first screen with skia on it from rendering at all). Here is what I see from device farm:

Both versions have SIGTRAP crashes that originate in librnskia.so:

1.5.4: "Exerciser detected crash: Native crash: Trap (8)"
Example from logs:

signal 5 (SIGTRAP), code 1 (TRAP_BRKPT), fault addr 0x71aaf54f9c
x0 000000727505cd18 x1 0000000000000055 x2 0000000000000020 x3 0000000000000002
...
backtrace:
#00 pc 000000000064df9c /data/app/<my.app>/base.apk!librnskia.so (offset 0x2cce000) (BuildId: 45a10a38b12eb6fcb888c570a64460536107a05e)

1.5.6: "Exerciser detected crash: Native crash: Trap (6)"
Example from logs:

signal 5 (SIGTRAP), code 1 (TRAP_BRKPT), fault addr 0x6e07400b28
...
Backtrace:
#00 pc 000000000063eb28 /data/app/~~KkTWmNEB2iGgU3BJaVms1Q==/<my.app>/base.apk!librnskia.so (offset 0x2b8f000) (BuildId: ec64f9b748e967e24d1aa66564062dae70ff3012)

Additionally, 1.5.6 has a new error that shows up quite frequently (more frequently than the SIGTRAP, at least in device farm):

"Exerciser detected crash: java.lang.RuntimeException: Failed to create window surface (12)"

Info RNSkia EGL Error: Bad Alloc (12291) in /Users/<me>/checkout/tmgrask/<myapp>/node_modules/@shopify/react-native-skia/android/cpp/rnskia-android/gl/Display.h:89
...
Error AndroidRuntime java.lang.RuntimeException: Failed to create window surface // 
at com.shopify.reactnative.skia.SkiaDomView.surfaceSizeChanged(Native Method) // 
at com.shopify.reactnative.skia.SkiaBaseView.onSurfaceTextureSizeChanged(Unknown Source:35) // 
at android.view.TextureView.onSizeChanged(TextureView.java:379) // 
at android.view.View.sizeChange(View.java:24679) // 
at android.view.View.setFrame(View.java:24612) // 
at android.view.View.layout(View.java:24469) // 
at com.shopify.reactnative.skia.SkiaBaseView.onLayout(Unknown Source:52) // 
at android.view.View.layout(View.java:24472) // 
at android.view.ViewGroup.layout(ViewGroup.java:6772) // 
at com.facebook.react.uimanager.v.j(SourceFile:1) // 
at com.facebook.react.uimanager.v.updateLayout(Unknown Source:93) // 
at com.swmansion.reanimated.layoutReanimation.ReanimatedNativeHierarchyManager.updateLayout(Unknown Source:1) // 
at com.facebook.react.uimanager.l1$u.d(SourceFile:1) // 
at com.facebook.react.uimanager.l1$a.run(Unknown Source:141) // 
at com.facebook.react.uimanager.l1.T(SourceFile:1) // 
at com.facebook.react.uimanager.l1.w(SourceFile:1) // 
at com.facebook.react.uimanager.l1$j.doFrameGuarded(Unknown Source:31) // 
at com.facebook.react.uimanager.j.doFrame(Unknown Source:0) // 
at com.facebook.react.modules.core.k$a.doFrame(Unknown Source:46) // 
at android.view.Choreographer$CallbackRecord.run(Choreographer.java:1008) // 
at android.view.Choreographer.doCallbacks(Choreographer.java:809) // 
at android.view.Choreographer.doFrame(Choreographer.java:740) // 
at android.view.Choreographer$FrameDisplayEventReceiver.run(Choreographer.java:995) // 
at android.os.Handler.handleCallback(Handler.java:938) // 
at android.os.Handler.dispatchMessage(Handler.java:99) // 
at android.os.Looper.loop(Looper.java:246) // 
at android.app.ActivityThread.main(ActivityThread.java:8429) // 
at java.lang.reflect.Method.invoke(Native Method) // 
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:596) // 
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1130) //

(included that info log cause it always precedes the crash, and sounds bad)

edit: I see this new error is also reported in #2754

@wcandillon
Copy link
Contributor

wcandillon commented Nov 20, 2024 via email

@tmgrask
Copy link

tmgrask commented Nov 20, 2024

Our app uses many react-native-skia components and APIs, but there is really just 1 main screen where they are all combined, so I cannot pinpoint a specific one. Many of the react-native-skia components are animated via react-native-reanimated. From the videos of the fuzz testing, the crashes are not happening at the same time/with the same elements in view. I have also been trying to reproduce the issue in a controlled environment with no success. I will let you know if I stumble on any more relevant info.

Through google play console I can see the SIGTRAP crashes are happening in producton on Android versions 8 through 14, so android API versions 28->34. I don't think we have enough data to say 35 is not also effected.

@wcandillon
Copy link
Contributor

v1.5.7 is published, I would love to get a read on whether it is improving Android stability?

@tmgrask
Copy link

tmgrask commented Nov 21, 2024

From early testing it looks like you have solved the new (#2754 ) crash in 1.5.7, nice work!

Unfortunately, I still see the SIGTRAP crash in fuzz testing. I have been unable to find a way to reproduce locally but I'm still looking.

signal 5 (SIGTRAP), code 1 (TRAP_BRKPT), fault addr 0x732d4c5b24
...
#00 pc 000000000063eb24 /data/app/my.app-O4RQNt26HPQZi7MdAuk8vA==/base.apk!librnskia.so (offset 0x2b8f000) (BuildId: 03f9b41b4ada98c2a64fb4f7aaacc53c50c09bcb)

@wcandillon
Copy link
Contributor

wcandillon commented Nov 21, 2024

Continuing to investigate the issue, #2761 will help a lot as well. I'm curious to see if it reduces the crashes. However if your app is crossing thread boundaries, you will now get null for some images in places where you might have gotten a result (but setting the direct context to a bad state).

@tmgrask
Copy link

tmgrask commented Nov 22, 2024

@wcandillon I am able to very inconsistently reproduce the crash in our app by running the Android exerciser monkey against a release build on an emulator. I almost hesitate to share this with you because I worry it will waste your time to chase this very rare repro.

Our app source code is now public, I made a branch with the monkey script that (rarely) causes the crash: https://github.com/Psiphon-Inc/conduit/tree/skia-crash-repro

The app uses react-native-skia in many places, but src/components/ConduitOrbToggle.tsx is a good place to look at some of the more involved usage.

steps for very rare repro:

  1. Pixel 3a XL emulator on API 30 (maybe others will crash as well, based on our data they should)
  2. Stub in configs and build the app (branch uses mocked backend)
touch android/app/src/main/res/raw/psiphon_config
touch android/app/src/main/res/raw/embedded_server_entries

npm run build-release
adb install android/app/build/outputs/apk/release/app-release.apk
  1. Run the monkey script until it fails.
./repro.sh

I just ran it 10 times and got it once, on the 7th time:

// signal 5 (SIGTRAP), code 1 (TRAP_BRKPT), fault addr 0x7582307b28
...
// backtrace:
//       #00 pc 000000000063eb28  /data/app/~~bapVmLH_hs5eKgZ49e6hPw==/ca.psiphon.conduit-E5x5fHyNUSNXT2MaBr8VFg==/base.apk!librnskia.so (offset 0x2b8f000) (BuildId: ec64f9b748e967e24d1aa66564062dae70ff3012)

I have yet to see it work from an npm run android dev build, metro can't keep up with the monkey spam.

Not exactly the most reproducible reproduction!

@wcandillon
Copy link
Contributor

Thank you @tmgrask this is huge. I'm am trying it now. On you side, if you use debuggable in buildTypes.release in build.gradle, do you get more information about the crash? I'm very excited to get to the bottom of this.

@wcandillon
Copy link
Contributor

@tmgrask I was able to reproduce the crash but now trying to reproduce it with debug symbols and maybe adding some try/catches at the jni layer to see if some unhandled exception happens. I will publish 1.5.9 also which improve stability on Android.

Any chance you could contribute the orb as a standalone example in https://github.com/Shopify/react-native-skia/tree/main/apps/paper/src?

I didn't know about this monkey command but it's very useful :) I want to add it to the CI on both Skia and also RN WebGPU. Hope we can get to the bottom of this quickly.

@tmgrask
Copy link

tmgrask commented Nov 24, 2024

I tried debuggable true (and did a bit of research about symbolicating these crashes), but ultimately have not yet come up with any more info about the crash. This type of Android native debugging is new to me, I also didn't know about the monkey command until a couple days ago. Wish I could be more helpful in digging into the native crash. I will continue to run the new builds you publish through my testing harness, at least until we figure this one out.

Upgrading to 1.5.9 things look a little different:

Here is a new crash I just saw locally (./repro.sh) that is explicitly originating from react-native-skia, and appears to be related to changes made in 1.5.9:

// signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x5
// Cause: null pointer dereference
//     x0  0000000000000000  x1  b400006fddff3f18  x2  0000007fe328bea8  x3  0000000000000028
//     x4  0000000000000010  x5  0000000000000000  x6  0000000000000000  x7  0000007fe328b3e1
//     x8  b400006fddff3f10  x9  b400007000000005  x10 b4000070fe039480  x11 0000000000000040
//     x12 0000000000000000  x13 00000b368ff8a736  x14 001ee5798cf1b100  x15 0000000029aaaaab
//     x16 0000000000000001  x17 0000006dacebd11c  x18 0000007243fba000  x19 0000000000000005
//     x20 00000072435e5000  x21 b40000701e0db658  x22 c0c0c0c0c0c0c0c1  x23 0000000000000055
//     x24 0000000000000030  x25 0000007fe328bfc0  x26 0000000000000000  x27 0000000000000001
//     x28 000000000000003e  x29 0000007fe328bf90
//     lr  0000006dacebd160  sp  0000007fe328bf70  pc  0000006dacebd160  pst 0000000080001000
//
// backtrace:
//       #00 pc 0000000000272160  /data/app/~~ioLWZazCaYhbZ1ScgYbuQg==/ca.psiphon.conduit-tw2QTRmYHGBKE1a67_-yKg==/base.apk!librnskia.so (offset 0x2b8f000) (void std::__ndk1::__invoke_void_return_wrapper<void>::__call<RNSkia::RNSkView::requestRedraw()::'lambda'()&>(RNSkia::RNSkView::requestRedraw()::'lambda'()&)+68) (BuildId: fc15d99eba007fa8d87de71a776b8032c41b2df5)
//       #01 pc 000000000027ae04  /data/app/~~ioLWZazCaYhbZ1ScgYbuQg==/ca.psiphon.conduit-tw2QTRmYHGBKE1a67_-yKg==/base.apk!librnskia.so (offset 0x2b8f000) (MainThreadDispatcher::processMessages()+360) (BuildId: fc15d99eba007fa8d87de71a776b8032c41b2df5)
//       #02 pc 000000000027ac70  /data/app/~~ioLWZazCaYhbZ1ScgYbuQg==/ca.psiphon.conduit-tw2QTRmYHGBKE1a67_-yKg==/base.apk!librnskia.so (offset 0x2b8f000) (MainThreadDispatcher::MainThreadDispatcher()::'lambda'(int, int, void*)::__invoke(int, int, void*)+48) (BuildId: fc15d99eba007fa8d87de71a776b8032c41b2df5)
//       #03 pc 0000000000019dac  /system/lib64/libutils.so (android::Looper::pollInner(int)+916) (BuildId: d1aa3b02347f658128fc75fb371856b9)
//       #04 pc 00000000000199b0  /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+112) (BuildId: d1aa3b02347f658128fc75fb371856b9)
//       #05 pc 0000000000110f74  /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44) (BuildId: dc13c3ae89f2044ec9e55326de275210)
//       #06 pc 000000000020fadc  /system/framework/arm64/boot-framework.oat (art_jni_trampoline+140) (BuildId: 339e94a38e629aea381d1192e0258d731e293228)
//       #07 pc 000000000063ec40  /system/framework/arm64/boot-framework.oat (android.os.MessageQueue.next+192) (BuildId: 339e94a38e629aea381d1192e0258d731e293228)
//       #08 pc 000000000063b6f8  /system/framework/arm64/boot-framework.oat (android.os.Looper.loop+744) (BuildId: 339e94a38e629aea381d1192e0258d731e293228)
//       #09 pc 00000000003fce90  /system/framework/arm64/boot-framework.oat (android.app.ActivityThread.main+752) (BuildId: 339e94a38e629aea381d1192e0258d731e293228)
//       #10 pc 00000000001337e8  /apex/com.android.art/lib64/libart.so (art_quick_invoke_static_stub+568) (BuildId: b628ec1e4df42966356fcd82bcb1136d)
//       #11 pc 00000000001a8a94  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+228) (BuildId: b628ec1e4df42966356fcd82bcb1136d)
//       #12 pc 00000000005556f8  /apex/com.android.art/lib64/libart.so (art::InvokeMethod(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jobject*, _jobject*, unsigned long)+1364) (BuildId: b628ec1e4df42966356fcd82bcb1136d)
//       #13 pc 00000000004d4f04  /apex/com.android.art/lib64/libart.so (art::Method_invoke(_JNIEnv*, _jobject*, _jobject*, _jobjectArray*)+52) (BuildId: b628ec1e4df42966356fcd82bcb1136d)
//       #14 pc 00000000000896f4  /apex/com.android.art/javalib/arm64/boot.oat (art_jni_trampoline+180) (BuildId: 13577ce71153c228ecf0eb73fc39f45010d487f8)
//       #15 pc 000000000088eed8  /system/framework/arm64/boot-framework.oat (com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run+136) (BuildId: 339e94a38e629aea381d1192e0258d731e293228)
//       #16 pc 0000000000897608  /system/framework/arm64/boot-framework.oat (com.android.internal.os.ZygoteInit.main+2280) (BuildId: 339e94a38e629aea381d1192e0258d731e293228)
//       #17 pc 00000000001337e8  /apex/com.android.art/lib64/libart.so (art_quick_invoke_static_stub+568) (BuildId: b628ec1e4df42966356fcd82bcb1136d)
//       #18 pc 00000000001a8a94  /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+228) (BuildId: b628ec1e4df42966356fcd82bcb1136d)
//       #19 pc 0000000000554134  /apex/com.android.art/lib64/libart.so (art::JValue art::InvokeWithVarArgs<art::ArtMethod*>(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, art::ArtMethod*, std::__va_list)+448) (BuildId: b628ec1e4df42966356fcd82bcb1136d)
//       #20 pc 00000000005545e8  /apex/com.android.art/lib64/libart.so (art::JValue art::InvokeWithVarArgs<_jmethodID*>(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jmethodID*, std::__va_list)+92) (BuildId: b628ec1e4df42966356fcd82bcb1136d)
//       #21 pc 0000000000438b1c  /apex/com.android.art/lib64/libart.so (art::JNI<true>::CallStaticVoidMethodV(_JNIEnv*, _jclass*, _jmethodID*, std::__va_list)+656) (BuildId: b628ec1e4df42966356fcd82bcb1136d)
//       #22 pc 0000000000099424  /system/lib64/libandroid_runtime.so (_JNIEnv::CallStaticVoidMethod(_jclass*, _jmethodID*, ...)+124) (BuildId: dc13c3ae89f2044ec9e55326de275210)
//       #23 pc 00000000000a08b0  /system/lib64/libandroid_runtime.so (android::AndroidRuntime::start(char const*, android::Vector<android::String8> const&, bool)+836) (BuildId: dc13c3ae89f2044ec9e55326de275210)
//       #24 pc 0000000000003580  /system/bin/app_process64 (main+1336) (BuildId: 3254c0fd94c1b04edc39169c6c635aac)
//       #25 pc 0000000000049450  /apex/com.android.runtime/lib64/bionic/libc.so (__libc_init+108) (BuildId: 22c2fa8a4f6044df2cbd42d53d857c5f)
//

requestRedraw was part of the 1.5.9 changes according to https://github.com/Shopify/react-native-skia/blame/ba1db84d5f5403831aa3a59cbe2fcdf118ceac74/packages/skia/cpp/rnskia/RNSkView.h#L173

When I put an APK with 1.5.9 through device farm fuzz test, there is another new crash that happens quite a bit. The backtrace doesn't explicitly mention rnskia, so perhaps a red herring, but it is noteworthy that I do not observe this crash in a build with 1.5.8.

// signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------
// Abort message: 'Pure virtual function called!'
//     x0  0000000000000000  x1  000000000000155b  x2  0000000000000006  x3  0000007fe328bd40
//     x4  fefefefefefefeff  x5  fefefefefefefeff  x6  fefefefefefefeff  x7  7f7f7f7f7f7f7f7f
//     x8  00000000000000f0  x9  3a60ae15ff2a7b2d  x10 0000000000000000  x11 ffffffc0fffffbdf
//     x12 0000000000000001  x13 00000b2fafba4d7c  x14 00175db2a232a900  x15 0000000029aaaaab
//     x16 00000072404e4c80  x17 00000072404c6430  x18 0000007243fba000  x19 000000000000155b
//     x20 000000000000155b  x21 00000000ffffffff  x22 0000007fe328be70  x23 0000007fe328beb0
//     x24 0000007fe328bf60  x25 0000007fe328bfc0  x26 0000000000000000  x27 0000000000000001
//     x28 000000000000003f  x29 0000007fe328bdc0
//     lr  0000007240479e60  sp  0000007fe328bd20  pc  0000007240479e8c  pst 0000000000001000
//
// backtrace:
//       #00 pc 000000000004de8c  /apex/com.android.runtime/lib64/bionic/libc.so (abort+164) (BuildId: 22c2fa8a4f6044df2cbd42d53d857c5f)
//       #01 pc 000000000009dd84  /data/app/~~ioLWZazCaYhbZ1ScgYbuQg==/ca.psiphon.conduit-tw2QTRmYHGBKE1a67_-yKg==/base.apk!libc++_shared.so (offset 0x2df000) (BuildId: 982d68842b3bd6a164609be09a533324b1f28526)

The number of SIGTRAP crashes is down (but still happening), however I have a feeling the SIGABRT crash above may be masking it, since the SIGABRT crash is happening relatively frequently.

@wcandillon
Copy link
Contributor

wcandillon commented Nov 24, 2024

@tmgrask I published v1.5.10 which fixes the dangling pointer issue.

Now testing the monkey script. The screen animation looks beautiful btw 😯

@wcandillon
Copy link
Contributor

@tmgrask I ran the monkey script 10 times no crash, that's a good sign no? On previous version I feel like could reproduce it pretty quickly?

@tmgrask
Copy link

tmgrask commented Nov 24, 2024

@wcandillon 1.5.10 looks way better!

Overall, device farm fuzz SIGTRAP rate is way down:

rn-skia SIGTRAP SIGABRT (EGL_BAD_ALLOC)
1.5.6 6 0
1.5.7 7 1
1.5.8 8 0
1.5.9 4 0
1.5.10 1 1

I am unsure if the SIGABRT EGL_BAD_ALLOC error is relevant, but it does reference SkRect, EGL, and maybe some other rn-skia internals. I'll paste it below.

I did repro SIGTRAP locally once, so it is still possible, but it does indeed seem much less common.

This improvement is enough for me to ship an update to users, to hopefully alleviate our Google Play crash rate issue.

SIGABRT (EGL_BAD_ALLOC):

signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr --------
Abort message: 'Encountered EGL error 12291 EGL_BAD_ALLOC during rendering'
    x0   0000000000000000  x1   0000000000004bce  x2   0000000000000006  x3   0000000000000008
    x4   7500000000000000  x5   7500000000000000  x6   7500000000000000  x7   00000000000080f5
    x8   0000000000000083  x9   3feabd8fdba74ee5  x10  0000000000000000  x11  0000000000000001
    x12  ffffffffffffffff  x13  0000000000000002  x14  ffffffffffffffff  x15  7500000000000000
    x16  00000078fa18f2f8  x17  00000078fa130410  x18  0000000000000000  x19  0000000000004bae
    x20  0000000000004bce  x21  0000000000000003  x22  00000078e74512a8  x23  00000078d48100c8
    x24  0000000000000001  x25  0000000000000001  x26  00000078d3c76328  x27  00000078f9caf738
    x28  7fffffffffffffff  x29  00000078d3c75c00  x30  00000078fa0e2bd4
    sp   00000078d3c75bc0  pc   00000078fa130418  pstate 0000000060000000
backtrace:
    #00 pc 000000000006b418  /system/lib64/libc.so (tgkill+8)
    #01 pc 000000000001dbd0  /system/lib64/libc.so (abort+88)
    #02 pc 0000000000007f44  /system/lib64/liblog.so (__android_log_assert+304)
    #03 pc 000000000005030c  /system/lib64/libhwui.so (_ZN7android10uirenderer12renderthread10EglManager11swapBuffersERKNS1_5FrameERK6SkRect+588)
    #04 pc 000000000004dd2c  /system/lib64/libhwui.so (_ZN7android10uirenderer12renderthread14OpenGLPipeline11swapBuffersERKNS1_5FrameEbRK6SkRectPNS0_9FrameInfoEPb+120)
    #05 pc 000000000004b720  /system/lib64/libhwui.so (_ZN7android10uirenderer12renderthread13CanvasContext4drawEv+228)
    #06 pc 000000000004ede4  /system/lib64/libhwui.so (_ZN7android10uirenderer12renderthread13DrawFrameTask3runEv+184)
    #07 pc 00000000000559d8  /system/lib64/libhwui.so (_ZN7android10uirenderer12renderthread12RenderThread10threadLoopEv+348)
    #08 pc 000000000001160c  /system/lib64/libutils.so (_ZN7android6Thread11_threadLoopEPv+280)
    #09 pc 00000000000fcde0  /system/lib64/libandroid_runtime.so (_ZN7android14AndroidRuntime15javaThreadShellEPv+136)
    #10 pc 0000000000067dac  /system/lib64/libc.so (_ZL15__pthread_startPv+36)
    #11 pc 000000000001f324  /system/lib64/libc.so (__start_thread+68)

The screen animation looks beautiful btw 😯

Thanks! Your videos from beautiful Zurich Switzerland were very helpful!

@wcandillon
Copy link
Contributor

wcandillon commented Nov 25, 2024

This is very exciting.
marking it as fixed by v1.5.10

Is it possible that the Encountered EGL error 12291 EGL_BAD_ALLOC during rendering error happens in Android Skia not RN Skia? It looks like based on the stack trace? Anyways let's keep this on a separate issue if needed.

I love the monkey script, I really want to add it to the CI. if you want to contribute an example to the example app that would get tested there that would be fun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants