High lowres clock drift > 1s in debug mode #2238

nvartolomei · 2024-05-11T14:30:03Z

In redpanda we have a test which began failing very often in debug mode since latest seastar upgrade (CPU is busy with other processes too)

The test gets the current time using the lowres clock and sets a deadline +1 second for establishing a tcp connection. Because of high lowres clock drift, although the time it takes to run the tasks is in milliseconds we do hit the 1s deadline.

I believe this change in behavior is caused by this reactor change #2007.

We preempt every 64 calls to scheduler_need_preempt (every 64 tasks?) under the argument that "latency is not important in debug mode" but that seems to be too much in some cases which use the lowres clock since it is only updated when we break out of the reactor loop.

While we are guilty to have some reactor stalls in this testing code ~100ms, I don't believe that should be an excuse to let lowres clock drift to 1s+. If we were in a seastar::loop with each iteration taking 25ms (stall warning threshold), 64 calls are already 1.6s.

I see a few possible options:

Switch to std::chrono clocks;
Change user code to have very lax expectations on lowres clock everywhere even though it is advertised to have ~task quota granularity;
Change how lowres clock is updated in debug mode;
Change the approach introduced in reactor: poll less frequently in debug mode #2007 to break out of the reactor task queue execution if any of the following conditions are true:
- 64 (or any other N) calls to scheduler_need_preempt
- need_preempt() == true as it is available in release mode (aka np.head != np.tail) so that it respects the task quota even
???

Appreciate any thoughts or ideas on the above 🙏

nvartolomei · 2024-05-11T15:20:11Z

Maybe something like

diff --git a/include/seastar/core/preempt.hh b/include/seastar/core/preempt.hh
index 537be5b5..21882aa9 100644
--- a/include/seastar/core/preempt.hh
+++ b/include/seastar/core/preempt.hh
@@ -49,11 +49,9 @@ inline const preemption_monitor*& get_need_preempt_var() {

 void set_need_preempt_var(const preemption_monitor* pm);

-}
-
-SEASTAR_MODULE_EXPORT
-inline bool need_preempt() noexcept {
-#ifndef SEASTAR_DEBUG
+inline
+bool
+monitor_need_preempt() {
     // prevent compiler from eliminating loads in a loop
     std::atomic_signal_fence(std::memory_order_seq_cst);
     auto np = internal::get_need_preempt_var();
@@ -64,6 +62,14 @@ inline bool need_preempt() noexcept {
     // Possible optimization: read head and tail in a single 64-bit load,
     // and find a funky way to compare the two 32-bit halves.
     return __builtin_expect(head != tail, false);
+}
+
+}
+
+SEASTAR_MODULE_EXPORT
+inline bool need_preempt() noexcept {
+#ifndef SEASTAR_DEBUG
+    return internal::monitor_need_preempt();
 #else
     return true;
 #endif
@@ -71,8 +77,6 @@ inline bool need_preempt() noexcept {

 namespace internal {

-
-
 // Same as need_preempt(), but for the scheduler's use. Outside debug
 // mode they have the same meaning - the task quota expired and we need
 // to check for I/O.
@@ -88,7 +92,7 @@ scheduler_need_preempt() {
     // mode, run some random-but-bounded number of tasks instead. Latency
     // will be high if those tasks are slow, but this is debug mode anyway.
     static thread_local unsigned counter = 0;
-    return ++counter % 64 == 0;
+    return ++counter % 64 == 0 || monitor_need_preempt();
 #endif
 }

The return ++counter % 64 == 0 || monitor_need_preempt(); introduces some additional non-determinism for debug mode which shouldn't hurt.

Adding a second takes ~1ms. Adding multiple of them in a single batch can stall the reactor for a while. This, together with a recent seastar change[^1] caused some timeouts to fire very frequently[^2]. Fix the test by breaking the batch passed to add_segments into smaller segments so that we execute more finer grained tasks and yield more often to the scheduler. [^1]: scylladb/seastar#2238 [^2]: redpanda-data#13275

avikivity · 2024-05-12T08:25:04Z

Reasonable. Does debug more default to 10ms task quotas? Or am I misremembering?

Also your code should rely on need_preempt() rather than doing its own batching.

Adding a segment takes ~1ms. Adding multiple segments in a single batch can stall the reactor for a while. This, together with a recent seastar change[^1] caused some timeouts to fire very frequently[^2]. Fix the problematic test by splitting the batch passed to add_segments so that we execute more finer grained tasks and yield more often to the scheduler. [^1]: scylladb/seastar#2238 [^2]: redpanda-data#13275

nvartolomei · 2024-05-13T10:28:05Z

Reasonable. Does debug more default to 10ms task quotas? Or am I misremembering?

On a quick glance I don't see any code in seastar that would change the task quota in debug mode.

I see only these 2 references:

--

Shall I submit a PR with the patch from the comment above or we have better suggestions?

avikivity · 2024-05-13T10:32:21Z

Reasonable. Does debug more default to 10ms task quotas? Or am I misremembering?

On a quick glance I don't see any code in seastar that would change the task quota in debug mode.

I see only these 2 references:

https://github.com/redpanda-data/seastar/blob/d1ce6f06f86ca9c2db9fa5ee4b90b39a552f14c8/src/core/reactor.cc#L3982

https://github.com/redpanda-data/seastar/blob/aa1c8b30faf4ffe0e62e45de87060d6dee9574bc/include/seastar/core/app-template.hh#L61

--

Ah, we have this in ScyllaDB:

main.cc-To start the scylla server proper, simply invoke as: scylla server (or just scylla).
main.cc-)";
main.cc:    app_cfg.default_task_quota = 500us;
main.cc-#ifdef DEBUG
main.cc-    // Increase the task quota to improve work:poll ratio in slow debug mode.
main.cc:    app_cfg.default_task_quota = 5ms;
main.cc-#endif
main.cc-    app_cfg.auto_handle_sigint_sigterm = false;
main.cc-    app_cfg.max_networking_aio_io_control_blocks = 50000;
main.cc-    // We need to have the entire app config to run the app, but we need to
main.cc-    // run the app to read the config file with UDF specific options so that

Shall I submit a PR with the patch from the comment above or we have better suggestions?

I don't have a better suggestion.

Adding a segment takes ~1ms. Adding multiple segments in a single batch can stall the reactor for a while. This, together with a recent seastar change[^1] caused some timeouts to fire very frequently[^2]. Fix the problematic test by splitting the batch passed to add_segments so that we execute more finer grained tasks and yield more often to the scheduler. [^1]: scylladb/seastar#2238 [^2]: redpanda-data#13275

nyh mentioned this issue May 26, 2024

alternator: test_item_latency is flaky scylladb/scylladb#18847

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High lowres clock drift > 1s in debug mode #2238

High lowres clock drift > 1s in debug mode #2238

nvartolomei commented May 11, 2024

nvartolomei commented May 11, 2024

avikivity commented May 12, 2024

nvartolomei commented May 13, 2024

avikivity commented May 13, 2024

High lowres clock drift > 1s in debug mode #2238

High lowres clock drift > 1s in debug mode #2238

Comments

nvartolomei commented May 11, 2024

nvartolomei commented May 11, 2024

avikivity commented May 12, 2024

nvartolomei commented May 13, 2024

avikivity commented May 13, 2024