You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running create table as select or merge into queries with large result sets randomly causes segfaults in k_way_merge_sort_partition. This causes Databend to crash and become unresponsive - when I restart it all data is erased. I am seemingly able to work around it with:
set global enable_loser_tree_merge_sort=0;
set global enable_parallel_multi_merge_sort=0;
I'm not sure which setting resolves it because the pipeline I'm running which causes the segfault takes ~30 minutes to set up the database to run the query which triggers the segfault - and as already mentioned when the segfault occurs it erases all data.
############################### Crash fault info ###############################
PID: 35
Version: v1.2.660-nightly-55cef11019(rust-1.81.0-nightly-2024-11-17T22:08:20.457627205Z)
Timestamp(UTC): 2024-11-19 19:59:07.596544683 UTC
Timestamp(Local): 2024-11-19 19:59:07.596563975 +00:00
QueryId: "76bbd6bb-c6df-481f-a168-09caad581d70"
Signal 11 (SIGSEGV), si_code 1 (Unknown), Address 0x33665f7c228877
Backtrace:
0: backtrace::backtrace::libunwind::trace[inlined]
at /opt/rust/cargo/git/checkouts/backtrace-rs-fb1f822361417489-shallow/72265be/src/backtrace/libunwind.rs:116:5
1: backtrace::backtrace::trace_unsynchronized[inlined]
at /opt/rust/cargo/git/checkouts/backtrace-rs-fb1f822361417489-shallow/72265be/src/backtrace/mod.rs:66:5
2: databend_common_tracing::crash_hook::CrashHandler::recv_signal[inlined]
at /workspace/src/common/tracing/src/crash_hook.rs:101:13
3: databend_common_tracing::crash_hook::signal_handler@7c4a824
at /workspace/src/common/tracing/src/crash_hook.rs:272:9
4: <unknown>
5: <unknown>@92b8c
6: <u8 as core::slice::cmp::SliceOrd>::compare[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/slice/cmp.rs:199:34
7: <A as core::slice::cmp::SlicePartialOrd>::partial_compare[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/slice/cmp.rs:138:14
8: core::slice::cmp::<impl core::cmp::PartialOrd for [T]>::partial_cmp[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/slice/cmp.rs:39:9
9: core::cmp::PartialOrd::ge[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/cmp.rs:1233:18
10: core::cmp::impls::<impl core::cmp::PartialOrd<&B> for &A>::ge[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/cmp.rs:1691:13
11: databend_common_pipeline_transforms::processors::transforms::sort::list_domain::Candidate<T>::find_target::{{closure}}[inlined]
at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/list_domain.rs:300:20
12: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut@6829a34
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/ops/function.rs:294:13
13: core::iter::traits::iterator::Iterator::find_map::check::{{closure}}[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/iter/traits/iterator.rs:2907:32
14: <alloc::vec::into_iter::IntoIter<T,A> as core::iter::traits::iterator::Iterator>::try_fold[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/alloc/src/vec/into_iter.rs:340:25
15: core::iter::traits::iterator::Iterator::find_map[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/iter/traits/iterator.rs:2913:9
16: <core::iter::adapters::filter_map::FilterMap<I,F> as core::iter::traits::iterator::Iterator>::next[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/iter/adapters/filter_map.rs:64:9
17: databend_common_pipeline_transforms::processors::transforms::sort::list_domain::Candidate<T>::find_target@6839bf8
at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/list_domain.rs:298:22
18: databend_common_pipeline_transforms::processors::transforms::sort::list_domain::Candidate<T>::calc_partition@6832604
at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/list_domain.rs:213:43
19: databend_common_pipeline_transforms::processors::transforms::sort::k_way_merge_sort_partition::KWaySortPartitioner<R,S>::calc_partition_point@68ff028
at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/k_way_merge_sort_partition.rs:163:9
20: databend_common_pipeline_transforms::processors::transforms::sort::k_way_merge_sort_partition::KWaySortPartitioner<R,S>::build_task@68ffa3c
at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/k_way_merge_sort_partition.rs:167:25
21: databend_common_pipeline_transforms::processors::transforms::sort::k_way_merge_sort_partition::KWaySortPartitioner<R,S>::next_task@68fe9dc
at /workspace/src/query/pipeline/transforms/src/processors/transforms/sort/k_way_merge_sort_partition.rs:149:12
22: <databend_common_pipeline_transforms::processors::transforms::transform_k_way_merge_sort::KWayMergePartitionerProcessor<R> as databend_common_pipeline_core::processors::processor::Processor>::process@67f9b08
at /workspace/src/query/pipeline/transforms/src/processors/transforms/transform_k_way_merge_sort.rs:345:20
23: databend_common_pipeline_core::processors::processor::ProcessorPtr::process@67534b8
at /workspace/src/query/pipeline/core/src/processors/processor.rs:169:9
24: databend_query::pipelines::executor::executor_worker_context::ExecutorWorkerContext::execute_sync_task[inlined]
at /workspace/src/query/service/src/pipelines/executor/executor_worker_context.rs:169:9
25: databend_query::pipelines::executor::executor_worker_context::ExecutorWorkerContext::execute_task@8a9a4b4
at /workspace/src/query/service/src/pipelines/executor/executor_worker_context.rs:132:52
26: databend_query::pipelines::executor::query_pipeline_executor::QueryPipelineExecutor::execute_single_thread@8a960d4
at /workspace/src/query/service/src/pipelines/executor/query_pipeline_executor.rs:406:35
27: databend_query::pipelines::executor::query_pipeline_executor::QueryPipelineExecutor::execute_threads::{{closure}}::{{closure}}[inlined]
at /workspace/src/query/service/src/pipelines/executor/query_pipeline_executor.rs:378:50
28: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/panic/unwind_safe.rs:272:9
29: std::panicking::try::do_call[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panicking.rs:553:40
30: std::panicking::try[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panicking.rs:517:19
31: std::panic::catch_unwind@8bf2b14
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panic.rs:350:14
32: databend_common_base::runtime::catch_unwind::catch_unwind@8714a6c
at /workspace/src/common/base/src/runtime/catch_unwind.rs:47:11
33: databend_query::pipelines::executor::query_pipeline_executor::QueryPipelineExecutor::execute_threads::{{closure}}[inlined]
at /workspace/src/query/service/src/pipelines/executor/query_pipeline_executor.rs:378:34
34: databend_common_base::runtime::runtime_tracker::ThreadTracker::tracking_function::{{closure}}::{{closure}}[inlined]
at /workspace/src/common/base/src/runtime/runtime_tracker.rs:208:17
35: databend_common_base::runtime::thread::Thread::named_spawn::{{closure}}[inlined]
at /workspace/src/common/base/src/runtime/thread.rs:78:21
36: std::sys::backtrace::__rust_begin_short_backtrace@80d2be0
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/sys/backtrace.rs:155:18
37: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/thread/mod.rs:542:17
38: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/panic/unwind_safe.rs:272:9
39: std::panicking::try::do_call[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panicking.rs:553:40
40: std::panicking::try[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panicking.rs:517:19
41: std::panic::catch_unwind[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/panic.rs:350:14
42: std::thread::Builder::spawn_unchecked_::{{closure}}[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/thread/mod.rs:541:30
43: core::ops::function::FnOnce::call_once{{vtable.shim}}@80d5bc8
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/core/src/ops/function.rs:250:5
44: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/alloc/src/boxed.rs:2064:9
45: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once[inlined]
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/alloc/src/boxed.rs:2064:9
46: std::sys::pal::unix::thread::Thread::new::thread_start@a358cc4
at /rustc/cf2df68d1f5e56803c97d91e2b1a9f1c9923c533/library/std/src/sys/pal/unix/thread.rs:108:17
47: <unknown>@7ee90
48: <unknown>@e7b1c
49: <unknown>
How to Reproduce?
A general scenario that seems to cause this is enabling disk spilling, populating a wide table with ~50m rows, then doing a create table x as select * from large_table with enough joins to cause disk spilling. I triggered the segfault running locally in a non-clustered configuration on an m1 Max MacBook Pro with 64gb of ram. But I also managed to crash a 3 node Databend cluster while running the same query with the same dataset as what caused the segfault locally - so I suspect the crash is due to the same issue.
I'm sorry I don't have more specific reproduction steps. It was very difficult to reproduce - most of the time Databend would just crash with no stack trace but I managed to catch one twice. Happy to answer any additional questions or send more specific repro queries privately if it's helpful.
Are you willing to submit PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Search before asking
Version
Version: v1.2.660-nightly-55cef11019(rust-1.81.0-nightly-2024-11-17T22:08:20.457627205Z)
What's Wrong?
Running create table as select or merge into queries with large result sets randomly causes segfaults in k_way_merge_sort_partition. This causes Databend to crash and become unresponsive - when I restart it all data is erased. I am seemingly able to work around it with:
I'm not sure which setting resolves it because the pipeline I'm running which causes the segfault takes ~30 minutes to set up the database to run the query which triggers the segfault - and as already mentioned when the segfault occurs it erases all data.
How to Reproduce?
A general scenario that seems to cause this is enabling disk spilling, populating a wide table with ~50m rows, then doing a
create table x as select * from large_table
with enough joins to cause disk spilling. I triggered the segfault running locally in a non-clustered configuration on an m1 Max MacBook Pro with 64gb of ram. But I also managed to crash a 3 node Databend cluster while running the same query with the same dataset as what caused the segfault locally - so I suspect the crash is due to the same issue.I'm sorry I don't have more specific reproduction steps. It was very difficult to reproduce - most of the time Databend would just crash with no stack trace but I managed to catch one twice. Happy to answer any additional questions or send more specific repro queries privately if it's helpful.
Are you willing to submit PR?
The text was updated successfully, but these errors were encountered: