[DRAFT] Capacity aware partitioning #22766

yuslepukhin · 2024-11-07T19:52:21Z

Description

Allow users to specify per EP specific resource constraints.
This PR demonstrates how this can be done with CUDA on memory constraint.
In this implementation, we stop assigning nodes to CUDA once we reach the specified memory limit.
However, there is a provision for EP to do it automatically.

Motivation and Context

We want to allow models to run in constrained environments.

include/onnxruntime/core/graph/graph_viewer.h

onnxruntime/test/framework/session_state_test.cc

tianleiwu · 2024-11-08T05:38:38Z

How about the intermediate memory usage (workspace) for each node? That is usually unknown during partitioning, and even unknown during inference since op has no interface to tell its workspace size right now. For example, MultiHeadAttention op might call different cuda kernels (flash attention, cutlass fmha, tensorrt fmha kernel or unfused kernel), each has different memory consumption.

yuslepukhin · 2024-11-08T17:37:11Z

How about the intermediate memory usage (workspace) for each node? That is usually unknown during partitioning, and even unknown during inference since op has no interface to tell its workspace size right now.

This is true. The function is currently accounts for initializers and inputs. It cannot account for temporary allocations because those are done at inference time, and partitioning takes place well before kernels are instantiated.

The approach of computing memory patterns cannot be taken here since that relies on the presence of a runnable model which we do not have today in a constrained environment.

This PR is still at the experimental stage. I envision that most of the burden would be placed on the individual EPs GetCapability since EPs are in the best position to know the constraints which can be different from memory only or there might be additional considerations that may be accounted in the Resource Count.

The simplest way is to add an additional if/else to enumerate the kernels and attempt to infer the amount of temporary space. However, that creates an additional maintenance burden since we already have plenty of such places in optimizers and what not where we need to make sure that changes to individual kernels are reflected.

However, it would still work in its current form. One can try one setting and then lower it if the consumption is too much.

Another idea would be to run the model beforehand and record the consumption. Then use that trace to set the limit n the constrained environment.

tianleiwu · 2024-11-11T21:52:42Z

The function is currently accounts for initializers and inputs. It cannot account for temporary allocations

If so, I think the feature is not very helpful for vision or LLMs models due to the limitations.

Another idea would be to run the model beforehand and record the consumption. Then use that trace to set the limit n the constrained environment.

That's a good idea, and it will be great that we can support the use case.

BTW, a general way to help capacity constraint is that we can have a way to manually configure location of initializers and inputs. This can be extended to support offloading initializers to CPU, and only load them on the GPU when needed.

Implement GetSizeFromTensorTypeProo Wire in accounting Make CUDA EP resource aware and account on assignment Fix missing accountant for Ort format Remove redundant functions Remove unnecessary interface Fix DML issue, minor fixes Fix alert DEMO changes

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2025-01-21T18:54:15Z

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

+      }
+    }
  }
  /*
  std::vector<std::unique_ptr<ComputeCapability>> result;


Suggested change

}

}

}

/*

std::vector<std::unique_ptr<ComputeCapability>> result;

// XXX: For demo only

// constexpr const size_t kNodeCountThreshold = 800;

// static std::atomic_size_t nodes_assigned = 0;

// if (nodes_assigned.fetch_add(1) > kNodeCountThreshold) {

// ORT_THROW("CUDA EP is running out of memory");

onnxruntime/core/framework/graph_partitioner.cc

+  ~SizeTAccountant() = default;
+
+  explicit SizeTAccountant(size_t threshold, const InlinedHashMap<std::string, NodeAllocationStats>&& node_stats)
+      : IResourceAccountant(threshold), node_stats_(std::move(node_stats)) {}


onnxruntime/test/framework/session_state_test.cc

+// Produces node stats for the model. This requires running the model.
+// TEST(SessionStateTest, TestResourceAwareParitioningSaveNodeStats) {
+//
+//  const auto& log_manager = DefaultLoggingManager();
+//  log_manager.SetDefaultLoggerSeverity(onnxruntime::logging::Severity::kVERBOSE);
+//  const auto& default_logger = log_manager.DefaultLogger();
+//  std::unordered_map<std::string, int> domain_to_version;
+//  domain_to_version[kOnnxDomain] = 16;  // We can make it a parameter
+//  Model model("LargeModel", false, ModelMetaData(), PathString(), IOnnxRuntimeOpSchemaRegistryList(),
+//              domain_to_version, {}, default_logger);
+//
+//  const std::vector<int64_t> input_shape = {1024, 1024};
+//  constexpr const size_t approx_init_a_size = 1024 * 1024;  // 1Mb
+//  constexpr const size_t approx_init_b_size = 1024 * 1024;  // 1Mb
+//
+//  auto& graph = model.MainGraph();
+//  BuildTestModel(graph, input_shape, approx_init_a_size, approx_init_b_size);
+//  ASSERT_STATUS_OK(graph.Resolve());
+//
+//  auto model_proto = model.ToProto();
+//  const auto model_string = model_proto.SerializeAsString();
+//  std::ofstream model_file("model.onnx", std::ios::binary);
+//}


onnxruntime/test/framework/session_state_test.cc

+// TEST(SessionStateTest, TestResourceAwarePartitioning_LargeLimit) {
+//   const auto& log_manager = DefaultLoggingManager();
+//   log_manager.SetDefaultLoggerSeverity(onnxruntime::logging::Severity::kVERBOSE);
+//   const auto& default_logger = log_manager.DefaultLogger();
+//   std::unordered_map<std::string, int> domain_to_version;
+//   domain_to_version[kOnnxDomain] = 16;  // We can make it a parameter
+//   Model model("LargeModel", false, ModelMetaData(), PathString(), IOnnxRuntimeOpSchemaRegistryList(),
+//               domain_to_version, {}, default_logger);
+//
+//   // Input Shape
+//   const std::vector<int64_t> input_shape = {1024, 1024};
+//   constexpr const size_t approx_init_a_size = 1024 * 1024;  // 1Mb
+//   constexpr const size_t approx_init_b_size = 1024 * 1024;  // 1Mb
+//
+//   auto& graph = model.MainGraph();
+//   BuildTestModel(graph, input_shape, approx_init_a_size, approx_init_b_size);
+//   ASSERT_STATUS_OK(graph.Resolve());
+//
+//   OrtThreadPoolParams to;
+//   to.thread_pool_size = 1;
+//   auto tp = concurrency::CreateThreadPool(&onnxruntime::Env::Default(), to, concurrency::ThreadPoolType::INTRA_OP);
+//
+//   ExecutionProviders execution_providers;
+//   auto tmp_cpu_execution_provider = DefaultCudaExecutionProvider();
+//   tmp_cpu_execution_provider->SetLogger(&default_logger);
+//   ASSERT_STATUS_OK(execution_providers.Add(kCudaExecutionProvider, std::move(tmp_cpu_execution_provider)));
+//
+//   KernelRegistryManager krm;
+//   ASSERT_STATUS_OK(krm.RegisterKernels(execution_providers));
+//
+//   DataTransferManager dtm;
+//   ExternalDataLoaderManager edlm;
+//   profiling::Profiler profiler;
+//   // Try to load the model without restrictions
+//   // and verify nodes have been placed to CUDA
+//   SessionOptions sess_options;
+//   sess_options.enable_mem_pattern = false;
+//   sess_options.execution_mode = ExecutionMode::ORT_SEQUENTIAL;
+//   sess_options.use_deterministic_compute = false;
+//   sess_options.enable_mem_reuse = false;
+//   ASSERT_STATUS_OK(sess_options.config_options.AddConfigEntry(kOrtSessionOptionsResourceCudaPartitioningSettings,
+//                                                               "4206592"));
+//
+//   SessionState session_state(graph, execution_providers, tp.get(), nullptr, dtm, edlm,
+//                              default_logger, profiler, sess_options);
+//
+//   GraphPartitioner partitioner(krm, execution_providers);
+//   layout_transformation::TransformLayoutFunction transform_layout_fn;
+//   layout_transformation::DebugGraphFn debug_graph_fn;
+//   ASSERT_STATUS_OK(
+//       partitioner.Partition(graph, session_state.GetMutableFuncMgr(), transform_layout_fn,
+//                             sess_options.config_options, default_logger,
+//                             GraphPartitioner::Mode::kNormal, debug_graph_fn));
+//
+//   // All nodes have been placed to CUDA
+//   const auto& graph_nodes = graph.Nodes();
+//   for (const auto& node : graph_nodes) {
+//     EXPECT_EQ(node.GetExecutionProviderType(), kCudaExecutionProvider);
+//   }
+// }


onnxruntime/test/framework/session_state_test.cc

+// TEST(SessionStateTest, TestResourceAwarePartitioning_SecondNodeCutOff) {
+//   const auto& log_manager = DefaultLoggingManager();
+//   log_manager.SetDefaultLoggerSeverity(onnxruntime::logging::Severity::kVERBOSE);
+//   const auto& default_logger = log_manager.DefaultLogger();
+//   std::unordered_map<std::string, int> domain_to_version;
+//   domain_to_version[kOnnxDomain] = 16;  // We can make it a parameter
+//   Model model("LargeModel", false, ModelMetaData(), PathString(), IOnnxRuntimeOpSchemaRegistryList(),
+//               domain_to_version, {}, default_logger);
+//
+//   // Input Shape
+//   const std::vector<int64_t> input_shape = {1024, 1024};
+//   constexpr const size_t approx_init_a_size = 1024 * 1024;  // 1Mb
+//   constexpr const size_t approx_init_b_size = 1024 * 1024;  // 1Mb
+//
+//   auto& graph = model.MainGraph();
+//   BuildTestModel(graph, input_shape, approx_init_a_size, approx_init_b_size);
+//   ASSERT_STATUS_OK(graph.Resolve());
+//
+//   OrtThreadPoolParams to;
+//   to.thread_pool_size = 1;
+//   auto tp = concurrency::CreateThreadPool(&onnxruntime::Env::Default(), to, concurrency::ThreadPoolType::INTRA_OP);
+//
+//   ExecutionProviders execution_providers;
+//   auto tmp_cpu_execution_provider = DefaultCudaExecutionProvider();
+//   tmp_cpu_execution_provider->SetLogger(&default_logger);
+//   ASSERT_STATUS_OK(execution_providers.Add(kCudaExecutionProvider, std::move(tmp_cpu_execution_provider)));
+//
+//   KernelRegistryManager krm;
+//   ASSERT_STATUS_OK(krm.RegisterKernels(execution_providers));
+//
+//   DataTransferManager dtm;
+//   ExternalDataLoaderManager edlm;
+//   profiling::Profiler profiler;
+//   // Try to load the model without restrictions
+//   // and verify nodes have been placed to CUDA
+//   SessionOptions sess_options;
+//   sess_options.enable_mem_pattern = false;
+//   sess_options.execution_mode = ExecutionMode::ORT_SEQUENTIAL;
+//   sess_options.use_deterministic_compute = false;
+//   sess_options.enable_mem_reuse = false;
+//   ASSERT_STATUS_OK(sess_options.config_options.AddConfigEntry(kOrtSessionOptionsResourceCudaPartitioningSettings,
+//                                                               "16383"));
+//
+//   SessionState session_state(graph, execution_providers, tp.get(), nullptr, dtm, edlm,
+//                              default_logger, profiler, sess_options);
+//
+//   GraphPartitioner partitioner(krm, execution_providers);
+//   layout_transformation::TransformLayoutFunction transform_layout_fn;
+//   layout_transformation::DebugGraphFn debug_graph_fn;
+//   ASSERT_STATUS_OK(
+//       partitioner.Partition(graph, session_state.GetMutableFuncMgr(), transform_layout_fn,
+//                             sess_options.config_options, default_logger,
+//                             GraphPartitioner::Mode::kNormal, debug_graph_fn));
+//
+//   // Second node did not make it to CUDA
+//   const auto& graph_nodes = graph.Nodes();
+//   size_t count = 0;
+//   for (const auto& node : graph_nodes) {
+//     if (count == 0) {
+//       EXPECT_EQ(node.GetExecutionProviderType(), kCudaExecutionProvider);
+//     } else {
+//       EXPECT_TRUE(node.GetExecutionProviderType().empty());
+//     }
+//     count++;
+//   }
+// }


yuslepukhin changed the title ~~[DRAFT] Graph aware partitioning~~ [DRAFT] Resource aware partitioning Nov 7, 2024

yuslepukhin changed the title ~~[DRAFT] Resource aware partitioning~~ [DRAFT] Capacity aware partitioning Nov 7, 2024

yuslepukhin commented Nov 7, 2024

View reviewed changes

include/onnxruntime/core/graph/graph_viewer.h Outdated Show resolved Hide resolved

yuslepukhin commented Nov 7, 2024

View reviewed changes

include/onnxruntime/core/graph/graph_viewer.h Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Nov 7, 2024

View reviewed changes

onnxruntime/test/framework/session_state_test.cc Fixed Show fixed Hide fixed

onnxruntime/test/framework/session_state_test.cc Fixed Show fixed Hide fixed

onnxruntime/test/framework/session_state_test.cc Fixed Show fixed Hide fixed

yuslepukhin force-pushed the yuslepukhin/graph_constrained_paritioning branch 3 times, most recently from d515976 to 6244735 Compare January 16, 2025 22:52

yuslepukhin force-pushed the yuslepukhin/graph_constrained_paritioning branch from 6244735 to b2bb641 Compare January 16, 2025 23:00

yuslepukhin added 2 commits January 16, 2025 18:42

Fix rebase issues

c8304fc

Address some build issues

5e92e90

github-actions bot reviewed Jan 21, 2025

View reviewed changes

yuslepukhin added 2 commits January 21, 2025 13:57

Adjust for some build problems

da576ce

Implement node memory stats collection

8b3f05a

github-advanced-security bot found potential problems Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Capacity aware partitioning #22766

[DRAFT] Capacity aware partitioning #22766

yuslepukhin commented Nov 7, 2024

tianleiwu commented Nov 8, 2024 •

edited

Loading

yuslepukhin commented Nov 8, 2024 •

edited

Loading

tianleiwu commented Nov 11, 2024

github-actions bot left a comment

github-actions bot Jan 21, 2025

-      }
-    }
-  }
-  /*
-  std::vector<std::unique_ptr<ComputeCapability>> result;
+      // XXX: For demo only
+      // constexpr const size_t kNodeCountThreshold = 800;
+      // static std::atomic_size_t nodes_assigned = 0;
+      // if (nodes_assigned.fetch_add(1) > kNodeCountThreshold) {
+      //  ORT_THROW("CUDA EP is running out of memory");

[DRAFT] Capacity aware partitioning #22766

Are you sure you want to change the base?

[DRAFT] Capacity aware partitioning #22766

Conversation

yuslepukhin commented Nov 7, 2024

Description

Motivation and Context

tianleiwu commented Nov 8, 2024 • edited Loading

yuslepukhin commented Nov 8, 2024 • edited Loading

tianleiwu commented Nov 11, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Jan 21, 2025

Choose a reason for hiding this comment

tianleiwu commented Nov 8, 2024 •

edited

Loading

yuslepukhin commented Nov 8, 2024 •

edited

Loading