Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage calls seem to result in main thread deadlocks #277

Closed
tristan-warner-smith opened this issue Nov 23, 2023 · 14 comments
Closed

Storage calls seem to result in main thread deadlocks #277

tristan-warner-smith opened this issue Nov 23, 2023 · 14 comments
Assignees
Labels

Comments

@tristan-warner-smith
Copy link

tristan-warner-smith commented Nov 23, 2023

Describe the bug

  • We've updated from the previous SDK Analytics iOS to Analytics Swift and we're experiencing regular app hangs
  • During a hang, pausing execution shows three threads accessing Storage functions that use the blocking syncQueue.sync.
  • We see:
    • Storage.write as a result of a track call where we're making a call to Analytics.track(name:properties:) from the main thread
    • Storage.remove from HttpClient.startBatchUpload > SegmentDestination.flush > Storage.remove
    • Storage.write as a result of another track call, this one from the sync.segment.com serial queue

See attached images.

To Reproduce

  • The behaviour manifests often when switching to other apps and back but often immediately hangs the app on start, leaving it unusable

Our initialisation is:

let analytics = Analytics(
    configuration: Configuration(
        writeKey: Config.shared.segmentSourceKey
    )
    .trackApplicationLifecycleEvents(true)
    .flushAt(3)
    .flushInterval(10)
)
analytics.add(plugin: FirebaseDestination())

let engage = TwilioEngage { _, _ in }
analytics.add(plugin: engage)
analytics.add(plugin: CustomizeSegmentTrackCalls())

Initialisation is lazy on the first track event we fire.

Expected behavior

  • The app should never hang.

Screenshots
Main thread Storage.write

Other Storage calls

Platform (please complete the following information):

  • Library Version in use: 1.4.8
  • Platform being tested: iOS 17
  • Integrations in use: SegmentFirebase, Twilio Engage
  • main commit c4bb71dea0b38c179b2d87b56ee096a06ce2ea86

Additional context

  • The workaround we've had to adopt in the very short-term before you address this bug is to fork the repo and make Storage.write call syncQueue.async rather than syncQueue.sync. This is obviously not ideal, we aren't domain experts in your codebase, but in our testing this change removes all the main-thread blocking issues we encountered and still seems to send Segment events as expected.
  • Both track calls we make are early on in the app's initialisation.
  • We can share a video walking through the stack trace at the point in time of the images above, we'll want to share that privately though.
  • Sometimes it may just be two calls, not three but the end result is the same.
@bsneed
Copy link
Contributor

bsneed commented Nov 23, 2023

Thanks @tristan-warner-smith, I'll have a look at this and see if I can reproduce. If you could give me an example of the track calls your sending, that would be helpful. There were some instances where large IO on the main thread sent to the storage class could result in something that takes long enough for the OS to kill the app. You might pull main and see if that improves things for you in the meantime.

However, your current work around will almost certainly result in corrupt data given time. Those sync's are there on purpose to synchronize writes to a single file from potentially multiple threads.

@tristan-warner-smith
Copy link
Author

However, your current work around will almost certainly result in corrupt data given time. Those sync's are there on purpose to synchronize writes to a single file from potentially multiple threads.

  • I assumed as much, the issue is we now have a release in production that we get no telemetry for, no Crashlytics crash reports and we know when this issue occurs the app is completely unusable.

Our users depend on us for immediate relief from pain, stress etc so we can't wait for perfect in this case.

If you can think of any better workarounds @bsneed that don't jeopardise our app's usability, we'd really appreciate it.

These are similar events emitted roughly on app start:

Name: abcd_abcdefghijk_abcdefj
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"is_first_session": true,
	"platform": "iOS",
	"trialStart": false,
	"clicked_branch_link": false,
	"appVersion": "3.20.0"
]

Name: "abcdefghij_abcde"
Parameters: [
	"trialStart": false,
	"platform": "iOS",
	"appVersion": "3.20.0",
	"userId": "abcdefghijklmnopqrstuvwxyz01"
]

Name: "abcdefghijklmno_abcde"
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"variant_id": "baseline",
	"appVersion": "3.20.0",
	"event_timestamp": "2023-11-23T22:49:46.669Z",
	"trialStart": false,
	"experiment_id": "ab_abcdef_abcdefgh_abc",
	"platform": "iOS"
]

Name: "abcdefghijklmno_abcde"
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"experiment_id": "ab_abcde_ab_abcd",
	"appVersion": "3.20.0",
	"event_timestamp": "2023-11-23T22:49:46.669Z",
	"trialStart": false,
	"variant_id": "baseline",
	"platform": "iOS"
]

Name: "abcdefghijklmno_abcde"
Parameters: [
	"userId": "abcdefghijklmnopqrstuvwxyz01",
	"experiment_id": "ab_abcd_abcdefghijkl_abc_ab",
	"platform": "iOS",
	"trialStart": false,
	"variant_id": "baseline",
	"event_timestamp": "2023-11-23T22:49:46.669Z",
	"appVersion": "3.20.0"
]

@bsneed
Copy link
Contributor

bsneed commented Nov 23, 2023

Thanks @tristan-warner-smith! To clarify, I was just sharing information with you. You should always do (as you are already) what's best for you and your customers and I would never suggest otherwise. We'll dive into this on Monday, but since you have a repro scenario it'd be useful to check out main.

@johncDepop
Copy link

Could you outline a potential scenario that represents "However, your current work around will almost certainly result in corrupt data given time." Ie: is it about ordering of events/losing events

@bsneed
Copy link
Contributor

bsneed commented Nov 24, 2023

@johncDepop it's about 2 threads trying to write to a json file at the same time potentially. ie: thread1 is mid-way through writing an event, thread2 writes an event in the middle, you might end up with {myEvent="someThin{myEvent="someOtherThing"},g"}, instead of {myEvent="someThing"},{myEvent="someOtherThing"},

@moaahmad
Copy link

Hi @bsneed, just to add to the context @tristan-warner-smith posted earlier, I thought it might be useful to add the following screenshots of the app hanging when we background and foreground the app. We've updated our fork to be in sync with main and have addressed an instance of potential large IO on the main thread being sent to the storage class on app launch, but we still see this issue. Can you please advise?

Screenshot 2023-11-27 at 09 08 52
Screenshot 2023-11-27 at 09 06 36
Screenshot 2023-11-27 at 09 08 32

@bsneed
Copy link
Contributor

bsneed commented Nov 29, 2023

It appears I was incorrect regarding the sync->async thing causing issues. I've built some tests and it's working as it should despite the move to async. Change coming soon. I do think the deadlock there is a symptom of something else, such as a large amount of data being written, etc.

@bsneed
Copy link
Contributor

bsneed commented Nov 29, 2023

Ok, I stand corrected. In theory it'd work on a serial queue when mixing sync/async. Once I make the change, the thread sanitizer identifies a half dozen deadlocks. I won't be able to do anything about this until I have a reproduction scenario unfortunately. Given that it's all around fileIO, you might look into the content of your events.

@bsneed bsneed closed this as completed Nov 29, 2023
@tristan-warner-smith
Copy link
Author

It appears I was incorrect regarding the sync->async thing causing issues. I've built some tests and it's working as it should despite the move to async. Change coming soon. I do think the deadlock there is a symptom of something else, such as a large amount of data being written, etc.

I shared the exact amount of data we were passing through Segment above (subjectively it doesn't seem like a lot of event data), we've subsequently reduced it down to just 2 events tracked initially and had the same result. Note that the only related change we made to our codebase was upgrading from the previous iOS SDK to this one.

Beyond that we're using a SwiftUI-based project and initialising a Segment singleton lazily with the first track call.
We use Firebase + Firestore SDK 10.17, it likely does some writes on init but we obviously don't control that.

We'll attempt to create a repro scenario, when we can prioritise the time for it, but the sample project you linked seemed to only have pre-app start synchronous event dispatch (in the app delegate), no post app-start events and no async scenarios from what I saw so it doesn't seem like it would be comprehensive enough to represent the real-world. We mostly interact with the Segment SDK via unstructured Tasks rather than DispatchQueue so if there are specific thread or QoS requirements to use the SDK please let us know as in some cases we'll be calling from MainActor.

If you can try to replicate the environment described above you may see the issue. Let us know if you make any progress, we appreciate all the help you've given us so far to help us onboard the new product. It's really key for the stability of our users.

@tristan-warner-smith
Copy link
Author

@bsneed I'd appreciate you revisiting this bug report, we tried updating to the latest 1.55 release, given the mention of concurrency fixes to Sovran-Swift but the issue we mentioned above still manifests.

That of the main thread deadlocking due here where the Storage.syncQueue seems to run at the same dos as DispatchQueue.main, where as you know, calling DispatchQueue.main.sync results in the deadlock we're seeing.

On 1.55 we're now exclusively seeing this issue after the app has been backgrounded for ~2 minutes and brought back to the foreground.

image

@bsneed
Copy link
Contributor

bsneed commented Mar 1, 2024

I've got a PR coming in the next week or so that removes this whole storage mechanism and replaces it with something quite a bit simpler and less error prone. I'll reopen this until that lands and we can revisit.

@bsneed
Copy link
Contributor

bsneed commented Mar 19, 2024

Closing this unless something else comes up. The connected PR went out with 1.5.6.

@bsneed bsneed closed this as completed Mar 19, 2024
@tristan-warner-smith
Copy link
Author

We've been able to migrate back to main with these changes, we've only been live two days but not noticed any hangs on app start so far. We're experiencing the network failures mentioned in a separate issue but our Segment analytics seem to be unimpacted so far 🤞

@bsneed
Copy link
Contributor

bsneed commented Mar 26, 2024

That's great to hear! Looking into other one in tandem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants