Storage calls seem to result in main thread deadlocks #277

tristan-warner-smith · 2023-11-23T17:22:19Z

Describe the bug

We've updated from the previous SDK Analytics iOS to Analytics Swift and we're experiencing regular app hangs
During a hang, pausing execution shows three threads accessing Storage functions that use the blocking syncQueue.sync.
We see:
- Storage.write as a result of a track call where we're making a call to Analytics.track(name:properties:) from the main thread
- Storage.remove from HttpClient.startBatchUpload > SegmentDestination.flush > Storage.remove
- Storage.write as a result of another track call, this one from the sync.segment.com serial queue

See attached images.

To Reproduce

The behaviour manifests often when switching to other apps and back but often immediately hangs the app on start, leaving it unusable

Our initialisation is:

let analytics = Analytics(
    configuration: Configuration(
        writeKey: Config.shared.segmentSourceKey
    )
    .trackApplicationLifecycleEvents(true)
    .flushAt(3)
    .flushInterval(10)
)
analytics.add(plugin: FirebaseDestination())

let engage = TwilioEngage { _, _ in }
analytics.add(plugin: engage)
analytics.add(plugin: CustomizeSegmentTrackCalls())

Initialisation is lazy on the first track event we fire.

Expected behavior

The app should never hang.

Screenshots

Platform (please complete the following information):

Library Version in use: 1.4.8
Platform being tested: iOS 17
Integrations in use: SegmentFirebase, Twilio Engage
main commit c4bb71dea0b38c179b2d87b56ee096a06ce2ea86

Additional context

The workaround we've had to adopt in the very short-term before you address this bug is to fork the repo and make Storage.write call syncQueue.async rather than syncQueue.sync. This is obviously not ideal, we aren't domain experts in your codebase, but in our testing this change removes all the main-thread blocking issues we encountered and still seems to send Segment events as expected.
Both track calls we make are early on in the app's initialisation.
We can share a video walking through the stack trace at the point in time of the images above, we'll want to share that privately though.
Sometimes it may just be two calls, not three but the end result is the same.

The text was updated successfully, but these errors were encountered:

bsneed · 2023-11-23T19:39:09Z

Thanks @tristan-warner-smith, I'll have a look at this and see if I can reproduce. If you could give me an example of the track calls your sending, that would be helpful. There were some instances where large IO on the main thread sent to the storage class could result in something that takes long enough for the OS to kill the app. You might pull main and see if that improves things for you in the meantime.

However, your current work around will almost certainly result in corrupt data given time. Those sync's are there on purpose to synchronize writes to a single file from potentially multiple threads.

tristan-warner-smith · 2023-11-23T23:03:47Z

However, your current work around will almost certainly result in corrupt data given time. Those sync's are there on purpose to synchronize writes to a single file from potentially multiple threads.

I assumed as much, the issue is we now have a release in production that we get no telemetry for, no Crashlytics crash reports and we know when this issue occurs the app is completely unusable.

Our users depend on us for immediate relief from pain, stress etc so we can't wait for perfect in this case.

If you can think of any better workarounds @bsneed that don't jeopardise our app's usability, we'd really appreciate it.

These are similar events emitted roughly on app start:

Name: abcd_abcdefghijk_abcdefj
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"is_first_session": true,
	"platform": "iOS",
	"trialStart": false,
	"clicked_branch_link": false,
	"appVersion": "3.20.0"
]

Name: "abcdefghij_abcde"
Parameters: [
	"trialStart": false,
	"platform": "iOS",
	"appVersion": "3.20.0",
	"userId": "abcdefghijklmnopqrstuvwxyz01"
]

Name: "abcdefghijklmno_abcde"
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"variant_id": "baseline",
	"appVersion": "3.20.0",
	"event_timestamp": "2023-11-23T22:49:46.669Z",
	"trialStart": false,
	"experiment_id": "ab_abcdef_abcdefgh_abc",
	"platform": "iOS"
]

Name: "abcdefghijklmno_abcde"
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"experiment_id": "ab_abcde_ab_abcd",
	"appVersion": "3.20.0",
	"event_timestamp": "2023-11-23T22:49:46.669Z",
	"trialStart": false,
	"variant_id": "baseline",
	"platform": "iOS"
]

Name: "abcdefghijklmno_abcde"
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"experiment_id": "ab_abcd_abcdefghijkl_abc_ab",
	"platform": "iOS",
	"trialStart": false,
	"variant_id": "baseline",
	"event_timestamp": "2023-11-23T22:49:46.669Z",
	"appVersion": "3.20.0"
]

bsneed · 2023-11-23T23:43:21Z

Thanks @tristan-warner-smith! To clarify, I was just sharing information with you. You should always do (as you are already) what's best for you and your customers and I would never suggest otherwise. We'll dive into this on Monday, but since you have a repro scenario it'd be useful to check out main.

johncDepop · 2023-11-24T10:18:12Z

Could you outline a potential scenario that represents "However, your current work around will almost certainly result in corrupt data given time." Ie: is it about ordering of events/losing events

bsneed · 2023-11-24T18:27:35Z

@johncDepop it's about 2 threads trying to write to a json file at the same time potentially. ie: thread1 is mid-way through writing an event, thread2 writes an event in the middle, you might end up with {myEvent="someThin{myEvent="someOtherThing"},g"}, instead of {myEvent="someThing"},{myEvent="someOtherThing"},

moaahmad · 2023-11-27T10:44:26Z

Hi @bsneed, just to add to the context @tristan-warner-smith posted earlier, I thought it might be useful to add the following screenshots of the app hanging when we background and foreground the app. We've updated our fork to be in sync with main and have addressed an instance of potential large IO on the main thread being sent to the storage class on app launch, but we still see this issue. Can you please advise?

bsneed · 2023-11-29T18:54:24Z

It appears I was incorrect regarding the sync->async thing causing issues. I've built some tests and it's working as it should despite the move to async. Change coming soon. I do think the deadlock there is a symptom of something else, such as a large amount of data being written, etc.

bsneed · 2023-11-29T19:10:54Z

Ok, I stand corrected. In theory it'd work on a serial queue when mixing sync/async. Once I make the change, the thread sanitizer identifies a half dozen deadlocks. I won't be able to do anything about this until I have a reproduction scenario unfortunately. Given that it's all around fileIO, you might look into the content of your events.

tristan-warner-smith · 2023-11-30T16:46:09Z

It appears I was incorrect regarding the sync->async thing causing issues. I've built some tests and it's working as it should despite the move to async. Change coming soon. I do think the deadlock there is a symptom of something else, such as a large amount of data being written, etc.

I shared the exact amount of data we were passing through Segment above (subjectively it doesn't seem like a lot of event data), we've subsequently reduced it down to just 2 events tracked initially and had the same result. Note that the only related change we made to our codebase was upgrading from the previous iOS SDK to this one.

Beyond that we're using a SwiftUI-based project and initialising a Segment singleton lazily with the first track call.
We use Firebase + Firestore SDK 10.17, it likely does some writes on init but we obviously don't control that.

We'll attempt to create a repro scenario, when we can prioritise the time for it, but the sample project you linked seemed to only have pre-app start synchronous event dispatch (in the app delegate), no post app-start events and no async scenarios from what I saw so it doesn't seem like it would be comprehensive enough to represent the real-world. We mostly interact with the Segment SDK via unstructured Tasks rather than DispatchQueue so if there are specific thread or QoS requirements to use the SDK please let us know as in some cases we'll be calling from MainActor.

If you can try to replicate the environment described above you may see the issue. Let us know if you make any progress, we appreciate all the help you've given us so far to help us onboard the new product. It's really key for the stability of our users.

tristan-warner-smith · 2024-03-01T10:27:49Z

@bsneed I'd appreciate you revisiting this bug report, we tried updating to the latest 1.55 release, given the mention of concurrency fixes to Sovran-Swift but the issue we mentioned above still manifests.

That of the main thread deadlocking due here where the Storage.syncQueue seems to run at the same dos as DispatchQueue.main, where as you know, calling DispatchQueue.main.sync results in the deadlock we're seeing.

On 1.55 we're now exclusively seeing this issue after the app has been backgrounded for ~2 minutes and brought back to the foreground.

bsneed · 2024-03-01T17:40:23Z

I've got a PR coming in the next week or so that removes this whole storage mechanism and replaces it with something quite a bit simpler and less error prone. I'll reopen this until that lands and we can revisit.

bsneed · 2024-03-19T21:28:42Z

Closing this unless something else comes up. The connected PR went out with 1.5.6.

tristan-warner-smith · 2024-03-26T15:41:10Z

We've been able to migrate back to main with these changes, we've only been live two days but not noticed any hangs on app start so far. We're experiencing the network failures mentioned in a separate issue but our Segment analytics seem to be unimpacted so far 🤞

bsneed · 2024-03-26T16:43:32Z

That's great to hear! Looking into other one in tandem.

tristan-warner-smith added the triage label Nov 23, 2023

tristan-warner-smith assigned bsneed Nov 23, 2023

bsneed closed this as completed Nov 29, 2023

bsneed reopened this Mar 1, 2024

bsneed mentioned this issue Mar 4, 2024

Storage Additions & Refactoring #304

Merged

bsneed closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage calls seem to result in main thread deadlocks #277

Storage calls seem to result in main thread deadlocks #277

tristan-warner-smith commented Nov 23, 2023 •

edited

Loading

bsneed commented Nov 23, 2023

tristan-warner-smith commented Nov 23, 2023

bsneed commented Nov 23, 2023

johncDepop commented Nov 24, 2023

bsneed commented Nov 24, 2023

moaahmad commented Nov 27, 2023

bsneed commented Nov 29, 2023 •

edited

Loading

bsneed commented Nov 29, 2023

tristan-warner-smith commented Nov 30, 2023

tristan-warner-smith commented Mar 1, 2024

bsneed commented Mar 1, 2024

bsneed commented Mar 19, 2024

tristan-warner-smith commented Mar 26, 2024

bsneed commented Mar 26, 2024

Storage calls seem to result in main thread deadlocks #277

Storage calls seem to result in main thread deadlocks #277

Comments

tristan-warner-smith commented Nov 23, 2023 • edited Loading

bsneed commented Nov 23, 2023

tristan-warner-smith commented Nov 23, 2023

bsneed commented Nov 23, 2023

johncDepop commented Nov 24, 2023

bsneed commented Nov 24, 2023

moaahmad commented Nov 27, 2023

bsneed commented Nov 29, 2023 • edited Loading

bsneed commented Nov 29, 2023

tristan-warner-smith commented Nov 30, 2023

tristan-warner-smith commented Mar 1, 2024

bsneed commented Mar 1, 2024

bsneed commented Mar 19, 2024

tristan-warner-smith commented Mar 26, 2024

bsneed commented Mar 26, 2024

tristan-warner-smith commented Nov 23, 2023 •

edited

Loading

bsneed commented Nov 29, 2023 •

edited

Loading