Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰 [Bug] A Segmentation fault occurs when torchtrt::ts::compile using Torch-TensorRT #2842

Open
demuxin opened this issue May 16, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@demuxin
Copy link

demuxin commented May 16, 2024

Bug Description

When I use the code below to compile the touchscript model, a segmentation fault occurs.

I compiled the Torch-TensorRT source code with debug mode, and ran the program using GDB.

I found out this error appears on line 100.

torch::jit::IValue jit_results_ivalues = cur_mod.forward(jit_inputs_ivalues);

Then I continued to debug, I found out seg_block.raw_inputs() on line 182 is std::vector of length 0, it lead to jit_inputs_ivalues on line 222 is also std::vector of length 0.

std::vector<torch::jit::IValue> jit_inputs_ivalues;
// set inputs ivalues, now supports Tensor/Int to pass argumentes between different segments
for (auto& input : seg_block.raw_inputs()) {
TORCHTRT_CHECK(
ivalues_maps.count(input),
"Could not find torch::jit::Value* " << input->debugName() << " produced from "
<< util::node_info(input->node())
<< " in lowering graph for mini graph input.\n");
if (input->node()->kind() == torch::jit::prim::Param) {
jit_inputs_ivalues.push_back(ivalues_maps[input]);
} else if (input->type()->isSubtypeOf(torch::jit::TensorType::get())) {
jit_inputs_ivalues.push_back(ivalues_maps[input].toTensor());
} else if (input->type()->isSubtypeOf(torch::jit::IntType::get())) {
jit_inputs_ivalues.push_back(ivalues_maps[input].toInt());
} else if (input->type()->isSubtypeOf(torch::jit::BoolType::get())) {
jit_inputs_ivalues.push_back(ivalues_maps[input].toBool());
} else if (input->type()->isSubtypeOf(torch::jit::FloatType::get())) {
jit_inputs_ivalues.push_back(ivalues_maps[input].toDouble());
} else if (input->type()->isSubtypeOf(torch::jit::StringType::get())) {
jit_inputs_ivalues.push_back(ivalues_maps[input].toString());
} else if (input->type()->kind() == torch::jit::TypeKind::ListType) {
// create list
jit_inputs_ivalues.push_back(ivalues_maps[input].toList());
;
} else if (input->type()->kind() == torch::jit::TypeKind::TupleType) {
// create tuple
jit_inputs_ivalues.push_back(ivalues_maps[input].toTuple());
} else if (input->type()->kind() == torch::jit::TypeKind::NumberType) {
jit_inputs_ivalues.push_back(ivalues_maps[input].toScalar());
} else if (input->type()->kind() == torch::jit::TypeKind::DictType) {
jit_inputs_ivalues.push_back(ivalues_maps[input].toGenericDict());
} else if (input->type()->kind() == torch::jit::TypeKind::DeviceObjType) {
jit_inputs_ivalues.push_back(ivalues_maps[input].toDevice());
} else {
TORCHTRT_THROW_ERROR(
"Expected to find type " << input->type()->str() << " for value " << input->debugName()
<< " but get nothing. ");
}
}
// run segments to get outputs for later segments input shape, and other arguments such as Int
std::vector<torch::jit::IValue> jit_results;
torch::jit::IValue jit_results_ivalues = cur_mod.forward(jit_inputs_ivalues);

This is my simplified version of the code.

torch::Device* device_ = new torch::Device(torch::DeviceType::CUDA);
device_->set_index(0);

torch::jit::script::Module model = torch::jit::load(model_path);
model.to("cuda");
model.eval();
model.to(torch::kHalf);

std::vector<int64_t> input_dim{1, 3, 832, 1440};
auto input = torchtrt::Input(input_dim, torchtrt::DataType::kHalf);

size_t _1_GB = 1 << 30;
torchtrt::ts::CompileSpec compile_settings({ input });
compile_settings.enabled_precisions.insert(torchtrt::DataType::kHalf);
compile_settings.workspace_size = _1_GB;
compile_settings.truncate_long_and_double = true;
compile_settings.num_avg_timing_iters = 1;
torchtrt::ts::compile(model, compile_settings);

And I can share model to you to debug this error.

This is stack traces:

#0  0x00007fffe3752699 in torch::jit::InterpreterStateImpl::callstack() const () at /usr/local/libtorch/lib/libtorch_cpu.so
#1  0x00007fffe375537c in torch::jit::InterpreterStateImpl::handleError(std::exception const&, bool, c10::NotImplementedError*, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >) () at /usr/local/libtorch/lib/libtorch_cpu.so
#2  0x00007fffe3763fc4 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator<c10::IValue> >&) ()
    at /usr/local/libtorch/lib/libtorch_cpu.so
#3  0x00007fffe374d156 in torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) ()
    at /usr/local/libtorch/lib/libtorch_cpu.so
#4  0x00007fffe373e2c8 in torch::jit::GraphExecutorImplBase::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) ()
    at /usr/local/libtorch/lib/libtorch_cpu.so
#5  0x00007fffe338e1b9 in torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) const () at /usr/local/libtorch/lib/libtorch_cpu.so
#6  0x00007fff49a0b97e in torch::jit::Module::forward(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&)
    (this=0x7fff3d950d30, inputs=std::vector of length 0, capacity 0, kwargs=std::unordered_map with 0 elements)
    at /usr/local/libtorch/include/torch/csrc/jit/api/module.h:116
#7  0x00007fff49a06589 in torch_tensorrt::core::partitioning::getSegmentsOutputByRunning(torch_tensorrt::core::partitioning::SegmentedBlock&, std::unordered_map<torch::jit::Value const*, c10::IValue, std::hash<torch::jit::Value const*>, std::equal_to<torch::jit::Value const*>, std::allocator<std::pair<torch::jit::Value const* const, c10::IValue> > >&, torch_tensorrt::core::partitioning::PartitioningInfo const&, torch_tensorrt::core::ir::ShapeMode const&)
    (seg_block=..., ivalues_maps=std::unordered_map with 340 elements = {...}, partitioning_info=..., shape_mode=@0x7fff3d9512ac: torch_tensorrt::core::ir::ShapeMode::kOPT) at /workspace/Torch-TensorRT/core/partitioning/shape_analysis.cpp:222
#8  0x00007fff49a08627 in torch_tensorrt::core::partitioning::runShapeAnalysis(torch_tensorrt::core::partitioning::PartitioningCtx*, torch::jit::Block*, std::unordered_map<torch::jit::Value const*, c10::IValue, std::hash<torch::jit::Value const*>, std::equal_to<torch::jit::Value const*>, std::allocator<std::pair<torch::jit::Value const* const, c10::IValue> > >&, torch_tensorrt::core::ir::ShapeMode const&)
    (ctx=0x7fff3d951860, block=0x7ffe81ad67b0, example_tensor_map=std::unordered_map with 340 elements = {...}, shape_mode=@0x7fff3d9512ac: torch_tensorrt::core::ir::ShapeMode::kOPT) at /workspace/Torch-TensorRT/core/partitioning/shape_analysis.cpp:354
#9  0x00007fff499ed4b5 in torch_tensorrt::core::partitioning::partition(torch_tensorrt::core::partitioning::PartitioningCtx*, bool)
    (ctx=0x7fff3d951860, expect_full_compilation=false) at /workspace/Torch-TensorRT/core/partitioning/partitioning.cpp:607

let me add something else. this is the value of seg_block:

(torch_tensorrt::core::partitioning::SegmentedBlock &) @0x7ffe820abff0: {id_ = 182, 
  target_ = torch_tensorrt::core::partitioning::SegmentedBlock::kTensorRT, min_shapes_ = std::vector of length 0, capacity 0, 
  opt_shapes_ = std::vector of length 0, capacity 0, max_shapes_ = std::vector of length 0, capacity 0, 
  in_types_ = std::vector of length 0, capacity 0, inputs_ = std::vector of length 0, capacity 0, outputs_ = std::vector of length 12, capacity 16 = {
    0x7fff1ece1be0, 0x7ffe9226bd50, 0x7ffe807fdda0, 0x7ffe9353ff00, 0x7ffe93ee3ed0, 0x7ffe906eec30, 0x7ffe83e18b00, 0x7ffe93eba2b0, 0x7ffe92c758e0, 
    0x7ffe82c9c050, 0x7ffe819ea960, 0x7ffe93325580}, nodes_ = std::vector of length 44, capacity 44 = {0x7ffe83e7ae80, 0x7ffe83528ef0, 0x7ffe82c827f0, 
    0x7ffe82ce1010, 0x7ffe910037c0, 0x7ffe9071f7f0, 0x7ffe920cd4e0, 0x7ffe93518a00, 0x7ffe71722010, 0x7ffe82c66960, 0x7ffe932441f0, 0x7ffe9056f330, 
    0x7fff1f39f820, 0x7ffe921e7c00, 0x7ffe71964950, 0x7ffe9319aa60, 0x7ffe923da820, 0x7ffe71739210, 0x7ffe81198fc0, 0x7ffe923bc340, 0x7ffe9088eff0, 
    0x7ffe9172bb60, 0x7ffe92cde400, 0x7fff1ebfc690, 0x7fff1e2cd500, 0x7ffe923d89f0, 0x7ffe708e9e90, 0x7ffe82c25930, 0x7ffe90ef4430, 0x7fff1f710d40, 
    0x7fff1ee9e5f0, 0x7ffe93526720, 0x7ffe707150d0, 0x7ffe904c4720, 0x7ffe80e4b9d0, 0x7ffe706574f0, 0x7ffe92c3c0f0, 0x7fff1f014510, 0x7fff1ec493a0, 
    0x7ffe93e64310, 0x7ffe9293c660, 0x7ffe93e01ef0, 0x7ffe90f889f0, 0x7ffe835e4060}, 
  g_ = std::shared_ptr<torch::jit::Graph> (use count 2, weak count 1) = {get() = 0x7ffe91987fb0}, old_to_new_ = std::unordered_map with 67 elements = {
    [0x7ffe819ea960] = 0x7ffe80f0f130, [0x7ffe91aa6730] = 0x7ffe80f0eb40, [0x7ffe93e0f810] = 0x7ffe80f0e8a0, [0x7ffe717a7760] = 0x7ffe80f0e600, 
    [0x7fff1e0a6b50] = 0x7fff1e0b7490, [0x7ffe83e18cf0] = 0x7ffe929a7f40, [0x7ffe9226bd50] = 0x7ffe91989290, [0x7ffe92356dc0] = 0x7ffe833bcc90, 
    [0x7ffe920e29b0] = 0x7ffe929a7a70, [0x7ffe83ef4dc0] = 0x7ffe929a7870, [0x7ffe93347150] = 0x7ffe929a9220, [0x7ffe83e18b00] = 0x7ffe929a70b0, 
    [0x7ffe93eba2b0] = 0x7ffe929a7370, [0x7ffe906eec30] = 0x7ffe929a6e50, [0x7ffe910046c0] = 0x7ffe833bdd50, [0x7ffe82c66aa0] = 0x7ffe9224fe40, 
    [0x7ffe906ef280] = 0x7ffe92251570, [0x7ffe92c758e0] = 0x7ffe833bc4f0, [0x7ffe807fdda0] = 0x7ffe9224f130, [0x7fff1ecfcf80] = 0x7ffe833be110, 
    [0x7ffe906ef620] = 0x7ffe92251050, [0x7ffe93201590] = 0x7ffe922512b0, [0x7ffe923ebd60] = 0x7ffe80f0edf0, [0x7fff1edfc460] = 0x7ffe92250a00, 
    [0x7ffe8233ff10] = 0x7ffe9224fa80, [0x7ffe83529030] = 0x7ffe91988980, [0x7ffe82c9c050] = 0x7fff1e0b6310, [0x7ffe81a4e2f0] = 0x7ffe92250da0, 
    [0x7ffe83ef4e40] = 0x7ffe929a75f0, [0x7ffe82c82930] = 0x7ffe91988bd0, [0x7ffe92322910] = 0x7ffe833be7f0, [0x7ffe82ce1150] = 0x7ffe91988e10, 
    [0x7fff1ece1be0] = 0x7ffe92250080, [0x7ffe91003900] = 0x7ffe91989050, [0x7ffe9207db00] = 0x7ffe92250790, [0x7ffe93ee3ed0] = 0x7ffe9224f7c0, 
    [0x7ffe83ef4ba0] = 0x7ffe92250300, [0x7ffe81a4e580] = 0x7ffe833bd520, [0x7ffe806d50a0] = 0x7ffe919895a0, [0x7ffe9353ff00] = 0x7ffe9224f3d0, 
    [0x7ffe92228280] = 0x7ffe929a9440, [0x7ffe9353ef90] = 0x7ffe929a8040, [0x7ffe923acbe0] = 0x7ffe929a8320, [0x7ffe81a4ea10] = 0x7ffe929a8b90, 
    [0x7ffe93318a70] = 0x7ffe929a8970, [0x7ffe83e7afc0] = 0x7ffe91988660, [0x7fff1e2cd780] = 0x7ffe833bc250, [0x7ffe93325580] = 0x7ffe833bbfd0, 
    [0x7ffe719cf920] = 0x7ffe833bc750, [0x7ffe706f3570] = 0x7fff1e0b8510, [0x7ffe910467b0] = 0x7fff1e0b7a70, [0x7ffe93296510] = 0x7ffe833bd040, 
    [0x7ffe9102d3d0] = 0x7ffe929a7970, [0x7ffe835771b0] = 0x7ffe833bca10, [0x7fff1eb11620] = 0x7fff1e0b65b0, [0x7ffe906ee800] = 0x7fff1e0b67d0, 
    [0x7ffe7084f8a0] = 0x7ffe929a7cd0, [0x7ffe80613040] = 0x7ffe833bd300, [0x7ffe806d7b50] = 0x7ffe929a8fa0, [0x7ffe932a5e00] = 0x7ffe833bd8a0, 
    [0x7ffe91aa7310] = 0x7ffe833be660, [0x7ffe91f8b4a0] = 0x7ffe833bdb30, [0x7fff1dc708a0] = 0x7ffe833be390, [0x7ffe92319a20] = 0x7fff1e0b5f90, 
    [0x7ffe932d9910] = 0x7fff1e0b6be0, [0x7ffe906eedc0] = 0x7fff1e0b7080, [0x7fff1e3c1110] = 0x7fff1e0b6e60}, do_not_merge_ = false}

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • Torch-TensorRT Version (e.g. 1.0.0):latest source code compiled
  • PyTorch Version (e.g. 1.0):2.2.1
  • CPU Architecture:x86
  • OS (e.g., Linux):ubuntu22.04
  • How you installed PyTorch (conda, pip, libtorch, source):
  • Build command you used (if compiling from source):
  • Are you using local sources or building from archives:
  • Python version:
  • CUDA version:12.2
  • GPU models and configuration:
  • Any other relevant information:
@demuxin demuxin added the bug Something isn't working label May 16, 2024
@demuxin
Copy link
Author

demuxin commented May 23, 2024

Hi @bowang007, any progress on this issue?

@demuxin
Copy link
Author

demuxin commented May 28, 2024

@narendasan , is this issue being resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants