🐛 [Bug] A Segmentation fault occurs when torchtrt::ts::compile using Torch-TensorRT #2842

demuxin · 2024-05-16T03:51:10Z

Bug Description

When I use the code below to compile the touchscript model, a segmentation fault occurs.

I compiled the Torch-TensorRT source code with debug mode, and ran the program using GDB.

I found out this error appears on line 100.

TensorRT/core/partitioning/shape_analysis.cpp

Line 222 in 4b993f8

torch::jit::IValue jit_results_ivalues = cur_mod.forward(jit_inputs_ivalues);

Then I continued to debug, I found out seg_block.raw_inputs() on line 182 is std::vector of length 0, it lead to jit_inputs_ivalues on line 222 is also std::vector of length 0.

TensorRT/core/partitioning/shape_analysis.cpp

Lines 179 to 222 in 4b993f8

 std::vector<torch::jit::IValue> jit_inputs_ivalues; 

 // set inputs ivalues, now supports Tensor/Int to pass argumentes between different segments 

 for (auto& input : seg_block.raw_inputs()) { 

 TORCHTRT_CHECK( 

 ivalues_maps.count(input), 

 "Could not find torch::jit::Value* " << input->debugName() << " produced from " 

 << util::node_info(input->node()) 

 << " in lowering graph for mini graph input.\n"); 

 if (input->node()->kind() == torch::jit::prim::Param) { 

 jit_inputs_ivalues.push_back(ivalues_maps[input]); 

 } else if (input->type()->isSubtypeOf(torch::jit::TensorType::get())) { 

 jit_inputs_ivalues.push_back(ivalues_maps[input].toTensor()); 

 } else if (input->type()->isSubtypeOf(torch::jit::IntType::get())) { 

 jit_inputs_ivalues.push_back(ivalues_maps[input].toInt()); 

 } else if (input->type()->isSubtypeOf(torch::jit::BoolType::get())) { 

 jit_inputs_ivalues.push_back(ivalues_maps[input].toBool()); 

 } else if (input->type()->isSubtypeOf(torch::jit::FloatType::get())) { 

 jit_inputs_ivalues.push_back(ivalues_maps[input].toDouble()); 

 } else if (input->type()->isSubtypeOf(torch::jit::StringType::get())) { 

 jit_inputs_ivalues.push_back(ivalues_maps[input].toString()); 

 } else if (input->type()->kind() == torch::jit::TypeKind::ListType) { 

 // create list 

 jit_inputs_ivalues.push_back(ivalues_maps[input].toList()); 

 ; 

 } else if (input->type()->kind() == torch::jit::TypeKind::TupleType) { 

 // create tuple 

 jit_inputs_ivalues.push_back(ivalues_maps[input].toTuple()); 

 } else if (input->type()->kind() == torch::jit::TypeKind::NumberType) { 

 jit_inputs_ivalues.push_back(ivalues_maps[input].toScalar()); 

 } else if (input->type()->kind() == torch::jit::TypeKind::DictType) { 

 jit_inputs_ivalues.push_back(ivalues_maps[input].toGenericDict()); 

 } else if (input->type()->kind() == torch::jit::TypeKind::DeviceObjType) { 

 jit_inputs_ivalues.push_back(ivalues_maps[input].toDevice()); 

 } else { 

 TORCHTRT_THROW_ERROR( 

 "Expected to find type " << input->type()->str() << " for value " << input->debugName() 

 << " but get nothing. "); 

 } 

 } 

 // run segments to get outputs for later segments input shape, and other arguments such as Int 

 std::vector<torch::jit::IValue> jit_results; 

 torch::jit::IValue jit_results_ivalues = cur_mod.forward(jit_inputs_ivalues);

This is my simplified version of the code.

torch::Device* device_ = new torch::Device(torch::DeviceType::CUDA);
device_->set_index(0);

torch::jit::script::Module model = torch::jit::load(model_path);
model.to("cuda");
model.eval();
model.to(torch::kHalf);

std::vector<int64_t> input_dim{1, 3, 832, 1440};
auto input = torchtrt::Input(input_dim, torchtrt::DataType::kHalf);

size_t _1_GB = 1 << 30;
torchtrt::ts::CompileSpec compile_settings({ input });
compile_settings.enabled_precisions.insert(torchtrt::DataType::kHalf);
compile_settings.workspace_size = _1_GB;
compile_settings.truncate_long_and_double = true;
compile_settings.num_avg_timing_iters = 1;
torchtrt::ts::compile(model, compile_settings);

And I can share model to you to debug this error.

This is stack traces:

#0  0x00007fffe3752699 in torch::jit::InterpreterStateImpl::callstack() const () at /usr/local/libtorch/lib/libtorch_cpu.so
#1  0x00007fffe375537c in torch::jit::InterpreterStateImpl::handleError(std::exception const&, bool, c10::NotImplementedError*, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >) () at /usr/local/libtorch/lib/libtorch_cpu.so
#2  0x00007fffe3763fc4 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator<c10::IValue> >&) ()
    at /usr/local/libtorch/lib/libtorch_cpu.so
#3  0x00007fffe374d156 in torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) ()
    at /usr/local/libtorch/lib/libtorch_cpu.so
#4  0x00007fffe373e2c8 in torch::jit::GraphExecutorImplBase::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) ()
    at /usr/local/libtorch/lib/libtorch_cpu.so
#5  0x00007fffe338e1b9 in torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) const () at /usr/local/libtorch/lib/libtorch_cpu.so
#6  0x00007fff49a0b97e in torch::jit::Module::forward(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&)
    (this=0x7fff3d950d30, inputs=std::vector of length 0, capacity 0, kwargs=std::unordered_map with 0 elements)
    at /usr/local/libtorch/include/torch/csrc/jit/api/module.h:116
#7  0x00007fff49a06589 in torch_tensorrt::core::partitioning::getSegmentsOutputByRunning(torch_tensorrt::core::partitioning::SegmentedBlock&, std::unordered_map<torch::jit::Value const*, c10::IValue, std::hash<torch::jit::Value const*>, std::equal_to<torch::jit::Value const*>, std::allocator<std::pair<torch::jit::Value const* const, c10::IValue> > >&, torch_tensorrt::core::partitioning::PartitioningInfo const&, torch_tensorrt::core::ir::ShapeMode const&)
    (seg_block=..., ivalues_maps=std::unordered_map with 340 elements = {...}, partitioning_info=..., shape_mode=@0x7fff3d9512ac: torch_tensorrt::core::ir::ShapeMode::kOPT) at /workspace/Torch-TensorRT/core/partitioning/shape_analysis.cpp:222
#8  0x00007fff49a08627 in torch_tensorrt::core::partitioning::runShapeAnalysis(torch_tensorrt::core::partitioning::PartitioningCtx*, torch::jit::Block*, std::unordered_map<torch::jit::Value const*, c10::IValue, std::hash<torch::jit::Value const*>, std::equal_to<torch::jit::Value const*>, std::allocator<std::pair<torch::jit::Value const* const, c10::IValue> > >&, torch_tensorrt::core::ir::ShapeMode const&)
    (ctx=0x7fff3d951860, block=0x7ffe81ad67b0, example_tensor_map=std::unordered_map with 340 elements = {...}, shape_mode=@0x7fff3d9512ac: torch_tensorrt::core::ir::ShapeMode::kOPT) at /workspace/Torch-TensorRT/core/partitioning/shape_analysis.cpp:354
#9  0x00007fff499ed4b5 in torch_tensorrt::core::partitioning::partition(torch_tensorrt::core::partitioning::PartitioningCtx*, bool)
    (ctx=0x7fff3d951860, expect_full_compilation=false) at /workspace/Torch-TensorRT/core/partitioning/partitioning.cpp:607

let me add something else. this is the value of seg_block:

(torch_tensorrt::core::partitioning::SegmentedBlock &) @0x7ffe820abff0: {id_ = 182, 
  target_ = torch_tensorrt::core::partitioning::SegmentedBlock::kTensorRT, min_shapes_ = std::vector of length 0, capacity 0, 
  opt_shapes_ = std::vector of length 0, capacity 0, max_shapes_ = std::vector of length 0, capacity 0, 
  in_types_ = std::vector of length 0, capacity 0, inputs_ = std::vector of length 0, capacity 0, outputs_ = std::vector of length 12, capacity 16 = {
    0x7fff1ece1be0, 0x7ffe9226bd50, 0x7ffe807fdda0, 0x7ffe9353ff00, 0x7ffe93ee3ed0, 0x7ffe906eec30, 0x7ffe83e18b00, 0x7ffe93eba2b0, 0x7ffe92c758e0, 
    0x7ffe82c9c050, 0x7ffe819ea960, 0x7ffe93325580}, nodes_ = std::vector of length 44, capacity 44 = {0x7ffe83e7ae80, 0x7ffe83528ef0, 0x7ffe82c827f0, 
    0x7ffe82ce1010, 0x7ffe910037c0, 0x7ffe9071f7f0, 0x7ffe920cd4e0, 0x7ffe93518a00, 0x7ffe71722010, 0x7ffe82c66960, 0x7ffe932441f0, 0x7ffe9056f330, 
    0x7fff1f39f820, 0x7ffe921e7c00, 0x7ffe71964950, 0x7ffe9319aa60, 0x7ffe923da820, 0x7ffe71739210, 0x7ffe81198fc0, 0x7ffe923bc340, 0x7ffe9088eff0, 
    0x7ffe9172bb60, 0x7ffe92cde400, 0x7fff1ebfc690, 0x7fff1e2cd500, 0x7ffe923d89f0, 0x7ffe708e9e90, 0x7ffe82c25930, 0x7ffe90ef4430, 0x7fff1f710d40, 
    0x7fff1ee9e5f0, 0x7ffe93526720, 0x7ffe707150d0, 0x7ffe904c4720, 0x7ffe80e4b9d0, 0x7ffe706574f0, 0x7ffe92c3c0f0, 0x7fff1f014510, 0x7fff1ec493a0, 
    0x7ffe93e64310, 0x7ffe9293c660, 0x7ffe93e01ef0, 0x7ffe90f889f0, 0x7ffe835e4060}, 
  g_ = std::shared_ptr<torch::jit::Graph> (use count 2, weak count 1) = {get() = 0x7ffe91987fb0}, old_to_new_ = std::unordered_map with 67 elements = {
    [0x7ffe819ea960] = 0x7ffe80f0f130, [0x7ffe91aa6730] = 0x7ffe80f0eb40, [0x7ffe93e0f810] = 0x7ffe80f0e8a0, [0x7ffe717a7760] = 0x7ffe80f0e600, 
    [0x7fff1e0a6b50] = 0x7fff1e0b7490, [0x7ffe83e18cf0] = 0x7ffe929a7f40, [0x7ffe9226bd50] = 0x7ffe91989290, [0x7ffe92356dc0] = 0x7ffe833bcc90, 
    [0x7ffe920e29b0] = 0x7ffe929a7a70, [0x7ffe83ef4dc0] = 0x7ffe929a7870, [0x7ffe93347150] = 0x7ffe929a9220, [0x7ffe83e18b00] = 0x7ffe929a70b0, 
    [0x7ffe93eba2b0] = 0x7ffe929a7370, [0x7ffe906eec30] = 0x7ffe929a6e50, [0x7ffe910046c0] = 0x7ffe833bdd50, [0x7ffe82c66aa0] = 0x7ffe9224fe40, 
    [0x7ffe906ef280] = 0x7ffe92251570, [0x7ffe92c758e0] = 0x7ffe833bc4f0, [0x7ffe807fdda0] = 0x7ffe9224f130, [0x7fff1ecfcf80] = 0x7ffe833be110, 
    [0x7ffe906ef620] = 0x7ffe92251050, [0x7ffe93201590] = 0x7ffe922512b0, [0x7ffe923ebd60] = 0x7ffe80f0edf0, [0x7fff1edfc460] = 0x7ffe92250a00, 
    [0x7ffe8233ff10] = 0x7ffe9224fa80, [0x7ffe83529030] = 0x7ffe91988980, [0x7ffe82c9c050] = 0x7fff1e0b6310, [0x7ffe81a4e2f0] = 0x7ffe92250da0, 
    [0x7ffe83ef4e40] = 0x7ffe929a75f0, [0x7ffe82c82930] = 0x7ffe91988bd0, [0x7ffe92322910] = 0x7ffe833be7f0, [0x7ffe82ce1150] = 0x7ffe91988e10, 
    [0x7fff1ece1be0] = 0x7ffe92250080, [0x7ffe91003900] = 0x7ffe91989050, [0x7ffe9207db00] = 0x7ffe92250790, [0x7ffe93ee3ed0] = 0x7ffe9224f7c0, 
    [0x7ffe83ef4ba0] = 0x7ffe92250300, [0x7ffe81a4e580] = 0x7ffe833bd520, [0x7ffe806d50a0] = 0x7ffe919895a0, [0x7ffe9353ff00] = 0x7ffe9224f3d0, 
    [0x7ffe92228280] = 0x7ffe929a9440, [0x7ffe9353ef90] = 0x7ffe929a8040, [0x7ffe923acbe0] = 0x7ffe929a8320, [0x7ffe81a4ea10] = 0x7ffe929a8b90, 
    [0x7ffe93318a70] = 0x7ffe929a8970, [0x7ffe83e7afc0] = 0x7ffe91988660, [0x7fff1e2cd780] = 0x7ffe833bc250, [0x7ffe93325580] = 0x7ffe833bbfd0, 
    [0x7ffe719cf920] = 0x7ffe833bc750, [0x7ffe706f3570] = 0x7fff1e0b8510, [0x7ffe910467b0] = 0x7fff1e0b7a70, [0x7ffe93296510] = 0x7ffe833bd040, 
    [0x7ffe9102d3d0] = 0x7ffe929a7970, [0x7ffe835771b0] = 0x7ffe833bca10, [0x7fff1eb11620] = 0x7fff1e0b65b0, [0x7ffe906ee800] = 0x7fff1e0b67d0, 
    [0x7ffe7084f8a0] = 0x7ffe929a7cd0, [0x7ffe80613040] = 0x7ffe833bd300, [0x7ffe806d7b50] = 0x7ffe929a8fa0, [0x7ffe932a5e00] = 0x7ffe833bd8a0, 
    [0x7ffe91aa7310] = 0x7ffe833be660, [0x7ffe91f8b4a0] = 0x7ffe833bdb30, [0x7fff1dc708a0] = 0x7ffe833be390, [0x7ffe92319a20] = 0x7fff1e0b5f90, 
    [0x7ffe932d9910] = 0x7fff1e0b6be0, [0x7ffe906eedc0] = 0x7fff1e0b7080, [0x7fff1e3c1110] = 0x7fff1e0b6e60}, do_not_merge_ = false}

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

Torch-TensorRT Version (e.g. 1.0.0):latest source code compiled
PyTorch Version (e.g. 1.0):2.2.1
CPU Architecture:x86
OS (e.g., Linux):ubuntu22.04
How you installed PyTorch (conda, pip, libtorch, source):
Build command you used (if compiling from source):
Are you using local sources or building from archives:
Python version:
CUDA version:12.2
GPU models and configuration:
Any other relevant information:

The text was updated successfully, but these errors were encountered:

demuxin · 2024-05-23T09:40:03Z

Hi @bowang007, any progress on this issue?

demuxin · 2024-05-28T06:39:30Z

@narendasan , is this issue being resolved?

demuxin added the bug Something isn't working label May 16, 2024

narendasan assigned bowang007 May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [Bug] A Segmentation fault occurs when torchtrt::ts::compile using Torch-TensorRT #2842

🐛 [Bug] A Segmentation fault occurs when torchtrt::ts::compile using Torch-TensorRT #2842

demuxin commented May 16, 2024 •

edited

demuxin commented May 23, 2024

demuxin commented May 28, 2024

🐛 [Bug] A Segmentation fault occurs when torchtrt::ts::compile using Torch-TensorRT #2842

🐛 [Bug] A Segmentation fault occurs when torchtrt::ts::compile using Torch-TensorRT #2842

Comments

demuxin commented May 16, 2024 • edited

Bug Description

Environment

demuxin commented May 23, 2024

demuxin commented May 28, 2024

demuxin commented May 16, 2024 •

edited