InferenceSession - Catastrophic Error or Unspecified Error is thrown #22815

saddam213 · 2024-11-12T17:38:42Z

Describe the issue

Version 1.19.0
Sometimes when starting an InferenceSession this exception, Catastrophic Error or Unspecified Error is thrown

No other sessions will work at all until the application is stopped/started

New Unrelated Issue from Version 1.20.0
[ErrorCode:Fail] Trying to add a domain to DomainToVersion map, but the domain is already exist with version range (1, 1000). domain: "com.microsoft.extensions"

This is new to 1.20.0 happens at random like the other 2 error, however seems to be unrelated per the comments below, I upgraded to 1.20.0 to see if the first 2 error were resolved, but it has not, and has introduced this new one

To reproduce

new InferenceSession("Model.onnx") with a known working model

This is extremely hard to replicate, but we are getting plenty of error reports, in most cases it happens the first time after a system reboot, sometimes it just happens randomly

Urgency

Urgent, live application that has started failing globally

Platform

Windows

OS Version

10 & 11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.0

ONNX Runtime API

C#

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

1.19.0

The text was updated successfully, but these errors were encountered:

saddam213 · 2024-11-12T17:53:55Z

Bit more context:

I am the developer of Amuse.ai, our app has been out for about a year running DirectML inference without issue

a few months back we upgraded from 1.18.1 to 1.19.0, then we started getting a few error reports of "Catastrophic Error" when the user tried to load a model

However it is now 10-20 reports a day, so its somehow getting worse? windows update?

After upgrading to 1.20.0 we now also get this new error, actually hoping its the root cause of Catastrophic Error because I can't find that anywhere in OnnxRuntime

saddam213 · 2024-11-12T20:20:17Z

2024-11-13 09:19:13.0746439 [E:onnxruntime:, inference_session.cc:2118 onnxruntime::InferenceSession::Initialize::<lambda_a18664140bfa1274480334618139aa6c>::operator ()] Exception during initialization: D:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(576)\onnxruntime.DLL!00007FFA288EE903: (caller: 00007FFA2886E449) Exception(1) tid(87c) 8000FFFF Catastrophic failure

2024-11-13 09:19:13.9065553 [E:onnxruntime:, inference_session.cc:2118 onnxruntime::InferenceSession::Initialize::<lambda_a18664140bfa1274480334618139aa6c>::operator ()] Exception during initialization: D:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(576)\onnxruntime.DLL!00007FFA288EE903: (caller: 00007FFA2886E449) Exception(2) tid(1a78) 80004005 Unspecified error

skottmckay · 2024-11-13T08:35:18Z

AFAIK "com.microsoft.extensions" is used by onnxruntime-extensions. The extensions have to be manually registered by calling SessionOptions.RegisterOrtExtensions. If you call that multiple times you'll get an error about the DomainToVersion map.

However that seems completely unrelated to any DML issues.

saddam213 · 2024-11-13T18:00:13Z

AFAIK "com.microsoft.extensions" is used by onnxruntime-extensions. The extensions have to be manually registered by calling SessionOptions.RegisterOrtExtensions. If you call that multiple times you'll get an error about the DomainToVersion map.

However that seems completely unrelated to any DML issues.

Ok, then that error is a new one and unrelated to the other 2

Was hoping this new exception was the cause, but just looks like a brand new issue that bricks OnnxRuntime, sigh

We are unable to rollback to 1.18.1 as Flux and SD3-Large models do not run on the lower opset

saddam213 · 2024-11-13T18:26:22Z

Seems to be system dependent, some systems do it some don't, we have about 3000 concurrent active users and maybe 4% face this issue

I only have 1 Laptop PC that does it, sometimes, no rhyme or reason, same OS, same everything

There is not state stored by the app that would affect DirectML initialization, just seems to be a race condition inside the DML EP during initialization

fdwr · 2024-11-16T01:03:48Z

Some debugging questions:

Do you know notice it on any particular GPU and driver range version? You mentioned even for ORT 1.19.1 that it "its somehow getting worse? windows update?". DirectML.dll in System32\ hasn't been updated for a while, as DirectML.dll is matched with the version of onnxruntime.dll, but driver updates could be a possibility.
If you keep the same version of ORT but use an older version of DirectML (https://www.nuget.org/packages/Microsoft.AI.DirectML) do the failures go away?
Is the model proprietary? If so, are there are parts of the model that can be shared for repro purposes if the model weights were zeroed?
Do you get any more diagnostic information with the DML debug layer running? RUNTIME_EXCEPTION, 80070057 The parameter is incorrect in v1.17.3 #20464 (comment)

saddam213 · 2024-11-16T22:46:50Z

This issue seems to occur across various GPUs and driver types—I haven't identified a clear pattern in the crash reports. Both AMD and NVIDIA GPUs appear to trigger this error.

I have tested almost every combination of DirectML.dll (1.15.x) and onnxruntime.dll + Microsoft.ML.OnnxRuntime.dll (1.19.x). However, I do know that 1.14.1 and 1.18.0 work reliably without any failures.

The issue does not appear to be model-dependent. It occurs with the first model that is attempted to load, and the exception is thrown instantly. It doesn’t seem to even reach IO/Disk, as even something as simple as a tokenizer, which has been used successfully thousands of times before, will throw the error.

I’ve been diagnosing this issue for several months and held off reporting it earlier to ensure it wasn’t an issue in our application. After replicating the issue with a simple few lines of code outside of OnnxStack and Amuse, I am confident that this is not a bug in our codebase or models.

This feels environmental to me due to the way the error presents itself in some cases. In about 90% of reports, the crash occurs the first time the app is started after a reboot.

At one point, I was able to replicate this issue consistently on a test laptop, where it occurred after every reboot. Below are the tests I conducted:

Test 1 - Amuse

Install App → Start App → Load Model → OK
Restart PC
Start App → Load Model → Fail
Start App → Load Model → OK
No further failures until the next PC reboot.

Test 2 - Amuse

Install App
Restart PC
Start App → Load Model → Fail
Start App → Load Model → OK
No further failures until the next PC reboot.

In this test, the app was never started before the reboot and had no state—just fresh files copied to disk.

Test 3 - Debug App

Start AppA → Load Model → OK
Start AppB → Load Model → OK
Restart PC
Start AppA → Load Model → Fail
Start AppB → Load Model → OK
Start AppA → Load Model → OK
No further failures until the next PC reboot.

The debug app was a simple .NET console application that opened a new model InferenceSession. I tested with multiple models, and the issue seemed to happen regardless of the model used. Additionally, the order of starting AppA or AppB did not matter: the first app would fail, and the second would succeed. This rules out issues with Amuse, OnnxStack, OnnxRuntime.Extensions, or the Self-Contained build.

Interestingly, the test laptop eventually stopped exhibiting this behavior and has not done so again, regardless of how many times I reboot. This intermittent behavior suggests a strange race condition.

I’ve also occasionally encountered this issue during development in Visual Studio—about 2-3 times per week out of thousands of model loads.

OnnxRuntime 1.20.0 exhibits the same behavior, however does provide the new exception reported above, however that seems to be related to OnnxRuntime.Extensions, but happens at the same time so seems noteworthy

Currently, we are in the process of rolling back to OnnxRuntime 1.18.0, and this will hopefully roll out to users in the next few releases. We anticipate having more definitive information from that version soon.

I will set up the debugging environment as you suggested and hope to capture more details if I’m lucky.

github-actions bot added api:CSharp issues related to the C# API ep:DML issues related to the DirectML execution provider labels Nov 12, 2024

saddam213 changed the title ~~[ErrorCode:Fail] Trying to add a domain to DomainToVersion map, but the domain is already exist with version range (1, 1000). domain: "com.microsoft.extensions"~~ InferenceSession - Catastrophic Error or Unspecified Error is thrown Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InferenceSession - Catastrophic Error or Unspecified Error is thrown #22815

InferenceSession - Catastrophic Error or Unspecified Error is thrown #22815

saddam213 commented Nov 12, 2024 •

edited

Loading

saddam213 commented Nov 12, 2024 •

edited

Loading

saddam213 commented Nov 12, 2024

skottmckay commented Nov 13, 2024

saddam213 commented Nov 13, 2024 •

edited

Loading

saddam213 commented Nov 13, 2024 •

edited

Loading

fdwr commented Nov 16, 2024

saddam213 commented Nov 16, 2024 •

edited

Loading

InferenceSession - Catastrophic Error or Unspecified Error is thrown #22815

InferenceSession - Catastrophic Error or Unspecified Error is thrown #22815

Comments

saddam213 commented Nov 12, 2024 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

saddam213 commented Nov 12, 2024 • edited Loading

saddam213 commented Nov 12, 2024

skottmckay commented Nov 13, 2024

saddam213 commented Nov 13, 2024 • edited Loading

saddam213 commented Nov 13, 2024 • edited Loading

fdwr commented Nov 16, 2024

saddam213 commented Nov 16, 2024 • edited Loading

saddam213 commented Nov 12, 2024 •

edited

Loading

saddam213 commented Nov 12, 2024 •

edited

Loading

saddam213 commented Nov 13, 2024 •

edited

Loading

saddam213 commented Nov 13, 2024 •

edited

Loading

saddam213 commented Nov 16, 2024 •

edited

Loading