RPC delay metrics #2293

xemul · 2024-06-12T09:25:50Z

The new metrics is total time it takes an RPC message to travel back and forth between verb caller and server. It implicitly consists of the time request and response messages spend in seastar and kernel queues on both sides and classical network RTT.

Calculated for verbs that require response only. First, client negotiates with the server that response frame includes the "handler duration" value, which is the handler callback execution time. Then client notices duration between sending request and receiving response minus handler execution time received from server. This is called "delay sample". Total number of samples and their duration summary are exported as counters in the metrics domain.

refs: #323
refs: #1753 (for what metrics domain is)

xemul · 2024-06-24T06:59:31Z

@scylladb/seastar-maint , please review

nyh · 2024-06-24T13:44:16Z

Can you write a bit about the motivation of this specific metric? Is it something you have seen in some other API and found useful (if so, a reference would be nice)? What do you want to use this new feature for?

What bothers me is that the subtraction you're doing - latency measured by client minus latency measured by server - isn't exactly the RTT of anything: It isn't the RTT of the full request (because that would include the handling time, not subtracting it), and it's not the "classic" network RTT because it includes includes queueing time on both client and server.

xemul · 2024-06-24T17:19:48Z

When looking for sources of network latency in scylladb/scylladb#9505 it was good to know how much time it takes for RPC frame to pass through seastar(send)-kernel(send)-network-kernel(recv)-seastar(recv)-and_back (for the record, one of the problematic places found was in seastar(send) -- #1067), so the idea to collect it as metrics popped up.

What bothers me is that the subtraction you're doing - latency measured by client minus latency measured by server - isn't exactly the RTT of anything

Why not?

(cln) send_entry()
(cln) request is queued into client outgoing queue
(cln) request is written into the socket
network processing, whatever it means
(srv) request is read from socket
(srv) call handler
(srv) response is queued into client outgoing queue
(srv) response is written into the socket
network processing again
(cln) response is read from socket
(cln) find and wake so called "outstanding" entry
(cln) response is parsed and returned back

If I measure timestamps of point 1 and 13, their difference is the total time it took request to process. If I measure the time step 7 took, and subtract it from that calculated difference, I get exactly the time it took the request to travel through seastars and kernels on both sides and the network in between.

and it's not the "classic" network RTT because it includes includes queueing time on both client and server.

But it includes the queuing time on both -- client and server, unless I mis-read the code and placed the timestamps not in points 1 and 13 as described above

avikivity · 2024-06-24T18:24:43Z

include/seastar/rpc/rpc.hh

@@ -487,7 +490,7 @@ private:
 private:
 future<> negotiate_protocol(feature_map map);
 void negotiate(feature_map server_features);
- future<std::tuple<int64_t, std::optional<rcv_buf>>>
+ future<std::tuple<int64_t, std::optional<uint32_t>, std::optional<rcv_buf>>>


Better to document what the optional means.

Note: microseconds can overflow 32 bits for an rpc call.

It's ~1 hour. Are there handlers that last more?

src/rpc/rpc.cc

avikivity · 2024-06-24T18:27:15Z

include/seastar/rpc/rpc.hh

@@ -430,6 +430,7 @@ class client : public rpc::connection, public weakly_referencable<client> {
 struct reply_handler_base {
 timer<rpc_clock_type> t;


Now the client has two durations using which it can estimate the
rount-trip time of a packet -- the duration between sending request and
receiving response and the time it too handler to execute on the server
side. Difference is the RTT value.

What about time waiting in the send queue (usually none) or the receive queue? Or waiting for the rpc semaphore?

This new metrics include all of these.
Or does your question really mean "lets adding those metrics too"?

I mean that the difference is not RTT, it's RTT + waits in queues we don't (and can't) measure.

That's the definition of RTT I get used to

Round-trip time (RTT) is the duration it takes for a network request to go from a starting point to a destination and back again to the starting point

and "starting point" here is whoever calls send_helper::operator(). It doesn't tell if there are or there are no queues/delays/throttlers/whatever on the way of the request.

I'm fine to rename it to be clear, but how do you define RTT then?

RTT is the network time without any queues (which ping measures if it doesn't have a bad day).

Do you subtract service time from the measured round trip?

RTT is the network time without any queues (which ping measures if it doesn't have a bad day).

The "without any queues" is pretty optimistic requirement, I would say. I admit that outgoing is likely to take queue-less paths even in the kernel, but incoming all consists of queues.

Do you subtract service time from the measured round trip?

Depends on what "service time" is. If it's duration of the verb handler, then of course, that's the point (9bbe38c)

@@ -953,8 +964,13 @@ namespace rpc { _error = true; } else if (it != _outstanding.end()) { auto handler = std::move(it->second); + auto ht = std::get<1>(msg_id_and_data); // ht stands for "handler time" _outstanding.erase(it); (*handler)(*this, msg_id, std::move(data.value())); + if (ht) { + _stats.rtt_samples++; + _stats.rtt_total += (rpc_clock_type::now() - handler->start) - std::chrono::microseconds(*ht); + } } else if (msg_id < 0) { try { std::rethrow_exception(unmarshal_exception(data.value()));

avikivity · 2024-06-24T18:28:36Z

src/rpc/rpc.cc

+ res += m._c._stats.rtt_total;
+ }
+ return res.count();
+ }, sm::description("Total RTT time in seconds"), { domain_l }),


What does this mean? It will be an average of this-AZ RPC + other-AZ RPC + other-region RPC. Let's say the average is 10ms. What does it mean?

No, no, it won't be averaged between different DCs. The metrics is per metrics domain, Scylla configures metrics domains per-DC and per-verb-group-index

That's how: scylladb/scylladb#15785

The domain as this patch defines it includes two strings:

First, the datacenter the server lives in. This is because grouping metrics for connections to different datacenters makes little sense for several reasons. For example -- packet delays will differ for local-DC vs cross-DC traffic and mixing those latencies together is pointless. Another example -- the amount of traffic may also differ for local- vs cross-DC connections e.g. because of different usage of enryption and/or compression.

Second, each verb-idx gets its own domain. That's to be able to analyze e.g. query-related traffic from gossiper one. For that the existing isolation cookie is taken as is.

I share Avi's wow - I didn't even know a concept of "metric domains" existed. As the number of Scylla and Seastar developers grows, we need to try to be more diligent about documenting things that other developers will need to know. When I wrote Scylla's docs/dev/metrics.md, I spent many hours trying to figure out all the details of how metrics were used in Scylla and explain them all (I didn't write any of the metrics code, so I had to dig and talk to Amnon to find the details). But clearly, in the 6 years that have passed (time flies...) things have changed. We should update this document - and document more of Scylla's basic utilities that every developer needs to know. Moreover, we should perhaps think of documenting the metrics feature in Seastar's documentation, not just in Scylla's...

When I wrote Scylla's docs/dev/metrics.md, I spent many hours trying to figure out all the details of how metrics were used in Scylla and explain them all (I didn't write any of the metrics code, so I had to dig and talk to Amnon to find the details).

It could be a nice blog post as well

We should update this document - and document more of Scylla's basic utilities that every developer needs to know. Moreover, we should perhaps think of documenting the metrics feature in Seastar's documentation, not just in Scylla's...

Is your proposal to update scylla metrics.md with the ... "details" of metrics that come from seastar?
In general I agree, it's worth describing seastar's metrics in more details than just code (and code comments).

It would have been nice to have in Scylla metrics.md become a "book" that documents in length and in depth the meaning of every single metric (such documentation is sorely missing, I can rarely understand anything in the metrics because of the opaque one-sentence descriptions of metric), but what I focused in the current metrics.md is in the overall structure of all metrics and how to use them. For example, I described the metric labels "shard" and "instance" that every metric has. If recently we added per-scheduling-group labels, and per-"domain" (whatever that is) labels that applies to many different metrics, it would be nice to explain them. If Seastar brings its own set of RPC metrics, would be nice to explain it.
In general, I wouldn't complain if more documentation gets written :-)

Ths code should stick to response_frame_headroom constant, not introduce magic numbers on its own. Signed-off-by: Pavel Emelyanov <[email protected]>

Note the time it took handler (server-side) callback to run and propagate it via optional argument down to .respond() method. Error handlers call respond() too, pass this argument disengaged in this case. Signed-off-by: Pavel Emelyanov <[email protected]>

Introduce new negotiation parameter that tells if the client wants server to send it back the handler execution duration. This is for compatibility with older clients that expect server reply to be of some fixed (shorter) size. New response frame, if negotiated, includes the 32 bit handler duration value in microseconds right next to the existing bits. If no duration was provided (sending error response, for example), the value is 0xF..F. After parsing, the duration is returned as optional value and 0xF..F maps to disengaged one. For now client makes no use of it, next patch will. Signed-off-by: Pavel Emelyanov <[email protected]>

Now the client has two durations using which it can estimate the rount-trip time of a packet -- the duration between sending request and receiving response and the time it too handler to execute on the server side. Difference is the delay value. Those values from different calls are accumulated in a client counter as well as the total number of delay samples received (errors and old servers don't send it). Metrics export both as counters, so one can rate it over time and monitor average delay. Signed-off-by: Pavel Emelyanov <[email protected]>

By default RPC client uses new frames, that include handler execution time. Tune up one simple test to validate that legacy frames still work too. Signed-off-by: Pavel Emelyanov <[email protected]>

xemul · 2024-06-25T07:48:59Z

upd:

renamed rtt to delay
updated doc/rpc.md with new negotiation parameter and extended response frame
handle potential 32bit overflow of microseconds duration

xemul changed the title ~~RPC rount-trip time metrics~~ RPC round-trip time metrics Jun 21, 2024

avikivity reviewed Jun 24, 2024

View reviewed changes

src/rpc/rpc.cc Show resolved Hide resolved

avikivity reviewed Jun 24, 2024

View reviewed changes

xemul added 5 commits June 25, 2024 10:04

rpc: Fix hard-coded constants when sending unknown verb reply

cf50944

Ths code should stick to response_frame_headroom constant, not introduce magic numbers on its own. Signed-off-by: Pavel Emelyanov <[email protected]>

test,rpc: Extend simple ping-pong case

595ae18

By default RPC client uses new frames, that include handler execution time. Tune up one simple test to validate that legacy frames still work too. Signed-off-by: Pavel Emelyanov <[email protected]>

xemul force-pushed the br-rpc-rtt branch from bc46aab to 595ae18 Compare June 25, 2024 07:44

xemul changed the title ~~RPC round-trip time metrics~~ RPC delay metrics Jun 25, 2024

xemul requested review from nyh and avikivity July 3, 2024 06:28

avikivity merged commit 826bc0b into scylladb:master Jul 3, 2024
14 checks passed

mykaul mentioned this pull request Jul 9, 2024

Add metrics for RPC scylladb/scylla-monitoring#2104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC delay metrics #2293

RPC delay metrics #2293

xemul commented Jun 12, 2024 •

edited

Loading

xemul commented Jun 24, 2024

nyh commented Jun 24, 2024

xemul commented Jun 24, 2024

avikivity Jun 24, 2024

xemul Jun 24, 2024

avikivity Jun 24, 2024

xemul Jun 24, 2024

avikivity Jun 24, 2024 •

edited

Loading

xemul Jun 24, 2024

avikivity Jun 24, 2024

xemul Jun 25, 2024

avikivity Jun 24, 2024

xemul Jun 24, 2024

xemul Jun 24, 2024

avikivity Jun 24, 2024

nyh Jun 24, 2024

xemul Jun 25, 2024

nyh Jun 25, 2024

xemul commented Jun 25, 2024

		@@ -430,6 +430,7 @@ class client : public rpc::connection, public weakly_referencable<client> {
		struct reply_handler_base {
		timer<rpc_clock_type> t;

RPC delay metrics #2293

RPC delay metrics #2293

Conversation

xemul commented Jun 12, 2024 • edited Loading

xemul commented Jun 24, 2024

nyh commented Jun 24, 2024

xemul commented Jun 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avikivity Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xemul commented Jun 25, 2024

xemul commented Jun 12, 2024 •

edited

Loading

avikivity Jun 24, 2024 •

edited

Loading