Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InferenceSession - Catastrophic Error or Unspecified Error is thrown #22815

Open
saddam213 opened this issue Nov 12, 2024 · 7 comments
Open

InferenceSession - Catastrophic Error or Unspecified Error is thrown #22815

saddam213 opened this issue Nov 12, 2024 · 7 comments
Labels
api:CSharp issues related to the C# API ep:DML issues related to the DirectML execution provider

Comments

@saddam213
Copy link

saddam213 commented Nov 12, 2024

Describe the issue

Version 1.19.0
Sometimes when starting an InferenceSession this exception, Catastrophic Error or Unspecified Error is thrown

No other sessions will work at all until the application is stopped/started

New Unrelated Issue from Version 1.20.0
[ErrorCode:Fail] Trying to add a domain to DomainToVersion map, but the domain is already exist with version range (1, 1000). domain: "com.microsoft.extensions"

This is new to 1.20.0 happens at random like the other 2 error, however seems to be unrelated per the comments below, I upgraded to 1.20.0 to see if the first 2 error were resolved, but it has not, and has introduced this new one

To reproduce

new InferenceSession("Model.onnx") with a known working model

This is extremely hard to replicate, but we are getting plenty of error reports, in most cases it happens the first time after a system reboot, sometimes it just happens randomly

Urgency

Urgent, live application that has started failing globally

Platform

Windows

OS Version

10 & 11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.0

ONNX Runtime API

C#

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

1.19.0

@github-actions github-actions bot added api:CSharp issues related to the C# API ep:DML issues related to the DirectML execution provider labels Nov 12, 2024
@saddam213
Copy link
Author

saddam213 commented Nov 12, 2024

Bit more context:

I am the developer of Amuse.ai, our app has been out for about a year running DirectML inference without issue

a few months back we upgraded from 1.18.1 to 1.19.0, then we started getting a few error reports of "Catastrophic Error" when the user tried to load a model

However it is now 10-20 reports a day, so its somehow getting worse? windows update?

After upgrading to 1.20.0 we now also get this new error, actually hoping its the root cause of Catastrophic Error because I can't find that anywhere in OnnxRuntime

@saddam213
Copy link
Author

2024-11-13 09:19:13.0746439 [E:onnxruntime:, inference_session.cc:2118 onnxruntime::InferenceSession::Initialize::<lambda_a18664140bfa1274480334618139aa6c>::operator ()] Exception during initialization: D:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(576)\onnxruntime.DLL!00007FFA288EE903: (caller: 00007FFA2886E449) Exception(1) tid(87c) 8000FFFF Catastrophic failure

2024-11-13 09:19:13.9065553 [E:onnxruntime:, inference_session.cc:2118 onnxruntime::InferenceSession::Initialize::<lambda_a18664140bfa1274480334618139aa6c>::operator ()] Exception during initialization: D:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(576)\onnxruntime.DLL!00007FFA288EE903: (caller: 00007FFA2886E449) Exception(2) tid(1a78) 80004005 Unspecified error

@skottmckay
Copy link
Contributor

AFAIK "com.microsoft.extensions" is used by onnxruntime-extensions. The extensions have to be manually registered by calling SessionOptions.RegisterOrtExtensions. If you call that multiple times you'll get an error about the DomainToVersion map.

However that seems completely unrelated to any DML issues.

@saddam213
Copy link
Author

saddam213 commented Nov 13, 2024

AFAIK "com.microsoft.extensions" is used by onnxruntime-extensions. The extensions have to be manually registered by calling SessionOptions.RegisterOrtExtensions. If you call that multiple times you'll get an error about the DomainToVersion map.

However that seems completely unrelated to any DML issues.

Ok, then that error is a new one and unrelated to the other 2

Was hoping this new exception was the cause, but just looks like a brand new issue that bricks OnnxRuntime, sigh

We are unable to rollback to 1.18.1 as Flux and SD3-Large models do not run on the lower opset

@saddam213 saddam213 changed the title [ErrorCode:Fail] Trying to add a domain to DomainToVersion map, but the domain is already exist with version range (1, 1000). domain: "com.microsoft.extensions" InferenceSession - Catastrophic Error or Unspecified Error is thrown Nov 13, 2024
@saddam213
Copy link
Author

saddam213 commented Nov 13, 2024

Seems to be system dependent, some systems do it some don't, we have about 3000 concurrent active users and maybe 4% face this issue

I only have 1 Laptop PC that does it, sometimes, no rhyme or reason, same OS, same everything

There is not state stored by the app that would affect DirectML initialization, just seems to be a race condition inside the DML EP during initialization

@fdwr
Copy link
Contributor

fdwr commented Nov 16, 2024

Some debugging questions:

  • Do you know notice it on any particular GPU and driver range version? You mentioned even for ORT 1.19.1 that it "its somehow getting worse? windows update?". DirectML.dll in System32\ hasn't been updated for a while, as DirectML.dll is matched with the version of onnxruntime.dll, but driver updates could be a possibility.
  • If you keep the same version of ORT but use an older version of DirectML (https://www.nuget.org/packages/Microsoft.AI.DirectML) do the failures go away?
  • Is the model proprietary? If so, are there are parts of the model that can be shared for repro purposes if the model weights were zeroed?
  • Do you get any more diagnostic information with the DML debug layer running? RUNTIME_EXCEPTION, 80070057 The parameter is incorrect in v1.17.3 #20464 (comment)

@saddam213
Copy link
Author

saddam213 commented Nov 16, 2024

This issue seems to occur across various GPUs and driver types—I haven't identified a clear pattern in the crash reports. Both AMD and NVIDIA GPUs appear to trigger this error.

I have tested almost every combination of DirectML.dll (1.15.x) and onnxruntime.dll + Microsoft.ML.OnnxRuntime.dll (1.19.x). However, I do know that 1.14.1 and 1.18.0 work reliably without any failures.

The issue does not appear to be model-dependent. It occurs with the first model that is attempted to load, and the exception is thrown instantly. It doesn’t seem to even reach IO/Disk, as even something as simple as a tokenizer, which has been used successfully thousands of times before, will throw the error.

I’ve been diagnosing this issue for several months and held off reporting it earlier to ensure it wasn’t an issue in our application. After replicating the issue with a simple few lines of code outside of OnnxStack and Amuse, I am confident that this is not a bug in our codebase or models.

This feels environmental to me due to the way the error presents itself in some cases. In about 90% of reports, the crash occurs the first time the app is started after a reboot.

At one point, I was able to replicate this issue consistently on a test laptop, where it occurred after every reboot. Below are the tests I conducted:

Test 1 - Amuse

  1. Install App → Start App → Load Model → OK
  2. Restart PC
  3. Start App → Load Model → Fail
  4. Start App → Load Model → OK
  5. No further failures until the next PC reboot.

Test 2 - Amuse

  1. Install App
  2. Restart PC
  3. Start App → Load Model → Fail
  4. Start App → Load Model → OK
  5. No further failures until the next PC reboot.

In this test, the app was never started before the reboot and had no state—just fresh files copied to disk.

Test 3 - Debug App

  1. Start AppA → Load Model → OK
  2. Start AppB → Load Model → OK
  3. Restart PC
  4. Start AppA → Load Model → Fail
  5. Start AppB → Load Model → OK
  6. Start AppA → Load Model → OK
  7. No further failures until the next PC reboot.

The debug app was a simple .NET console application that opened a new model InferenceSession. I tested with multiple models, and the issue seemed to happen regardless of the model used. Additionally, the order of starting AppA or AppB did not matter: the first app would fail, and the second would succeed. This rules out issues with Amuse, OnnxStack, OnnxRuntime.Extensions, or the Self-Contained build.

Interestingly, the test laptop eventually stopped exhibiting this behavior and has not done so again, regardless of how many times I reboot. This intermittent behavior suggests a strange race condition.

I’ve also occasionally encountered this issue during development in Visual Studio—about 2-3 times per week out of thousands of model loads.

OnnxRuntime 1.20.0 exhibits the same behavior, however does provide the new exception reported above, however that seems to be related to OnnxRuntime.Extensions, but happens at the same time so seems noteworthy

Currently, we are in the process of rolling back to OnnxRuntime 1.18.0, and this will hopefully roll out to users in the next few releases. We anticipate having more definitive information from that version soon.

I will set up the debugging environment as you suggested and hope to capture more details if I’m lucky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api:CSharp issues related to the C# API ep:DML issues related to the DirectML execution provider
Projects
None yet
Development

No branches or pull requests

3 participants