Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary event tracer #2417

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft

Conversation

xemul
Copy link
Contributor

@xemul xemul commented Aug 30, 2024

This patch adds ability to capture info about frequent events (such as queued/dispatched/completed IO requests) into a buffer in memory and write it into a file. The goal is to collect a lot of events in relatively short period of time (e.g. -- seconds to minutes) thus getting what exactly is happening in the system, instead of averaged statistical metrics. Events are collected in binary form, so a decoder is included.

In this PR the tracer is put on io-queue to collect IO events. Example of the decoded output:

00000000:000.000002 --- buffer (skipped 0) ---
00000009:- OPENING: shard=0 text=0:main,1:exit,2:lr  
00000021:000.000105 IO Q 25180 r class 2 512
0000002c:000.000106 IO Q 24d80 r class 2 512
00000037:000.000116 IO D 25180
0000003e:000.000137 IO D 24d80
00000045:000.000190 IO C 24d80
0000004c:000.000190 IO C 25180
00000053:000.000304 IO Q 25180 r class 2 512
0000005e:000.000305 IO Q 24d80 r class 2 512
00000069:000.000306 IO D 25180
00000070:000.000306 IO D 24d80
00000077:000.000361 IO C 25180
0000007e:000.000371 IO C 24d80
...

The OPENING message includes shard id and id:name mapping for registered sched classes.
The IO event format above is
<offset in file>:<timestamp in usec> <event name Queue/Dispatch/Complete> <request id>
and for Queued event also -- <direction r/w> <class ID> <length>

For IO there's a set of io_log.trace() per-request calls, but it's not very efficient. First it takes time to format the message and forward it to syslogd (in prod servers). Next, each request traces three messages:

TRACE 2024-08-30 14:02:59,640 [shard 2:lr  ] io - dev 64771 : req 0x60200011ec80 queue  len 512 capacity 80612
TRACE 2024-08-30 14:02:59,640 [shard 2:main] io - dev 64771 : req 0x60200011ec80 submit
TRACE 2024-08-30 14:02:59,640 [shard 2:main] io - dev 64771 : req 0x60200011ec80 complete

that's 290 bytes. Contrary to this, this tracer generates 25 bytes per request. Also note, that timestamp precision is msec, while IO events need at least usec resolution.

refs: #2403

@xemul xemul requested a review from avikivity August 30, 2024 11:08
@avikivity
Copy link
Member

avikivity commented Sep 1, 2024

We should use statically defined tracepoints (<sys/sdt.h>) instead of an ad-hoc format/protocol/parser, like blktrace/blkparse. The challenge is getting the information out in a performant way. I tried systemtap dyninst and failed, several times. Please check if ltt-ng works.

/cc @michoecho

@michoecho
Copy link
Contributor

We should use statically defined tracepoints (<sys/sdt.h>) instead of an ad-hoc format/protocol/parser, like blktrace/blkparse. The challenge is getting the information out in a performant way. I tried systemtap dyninst and failed, several times. Please check if ltt-ng works.

/cc @michoecho

@xemul In other words: Avi has banned custom tracing patches until it is proven that existing tracing frameworks aren't good enough. (I had a custom tracing patch denied too, once.)

If you want to experiment with lttng, you can take https://github.com/scylladb/scylla-enterprise-seastar/commit/f0b870368f733bde6fd54cfbec6052080b5f2db5#diff-ceb4cd71fc94af90f390646cde7f7cc8bc673cbd7fd77208609b42fc634d0fd1 as a starting point. (But to make it compile, you will probably have to prepend LLTNG_UST_(/lltng_ust_) to all lttng identifiers — there was an API change in lltng between v2.12 and v2.13).

@xemul xemul marked this pull request as draft September 4, 2024 07:14
@xemul
Copy link
Contributor Author

xemul commented Sep 4, 2024

@xemul In other words: Avi has banned custom tracing patches until it is proven that existing tracing frameworks aren't good enough.

Yeah, Avi made it crystal clear :D

(I had a custom tracing patch denied too, once.)

There must be denied-binary-tracing club somewhere, optionally we could found one

If you want to experiment with lttng, you can take https://github.com/scylladb/scylla-enterprise-seastar/commit/f0b870368f733bde6fd54cfbec6052080b5f2db5#diff-ceb4cd71fc94af90f390646cde7f7cc8bc673cbd7fd77208609b42fc634d0fd1 as a starting point.

Thanks!

The goal of the tracer is to capture events at the frequence that
exceeds the regular logging and metrics resolution. Basically, it's a
buffer (well -- a bunch of buffers) into which pre-defined trace points
but their binary data.

Overhead: when OFF the it's is single if, when ON it's putting several
bytes-long data, advancing a pointer and checking if the buffer is full.
Current implementation pre-allocates 8 buffers 128k in size. Spoiler: io
events take 25 bytes per request. At 10 kiops this size is able to store
trace data for ~4 seconds.

When a buffer is full it's flushed on disk in a single write, then
re-cycled. If an event happens when there are no buffers available, it
counts the number of skipped events and puts them into the next
available buffer to at least let user know it happened.

Tracing is started for a given duration (in seconds) and total size of
flushed data (in bytes, but it's naturally rounded to 128k). Once either
of the limits is hit, tracing is stopped. This is not to disturb system
too much including both -- run-time overhead and consumed disk space.

Message format is:

  event id (1 bytes)
  timestamp (2 bytes) incremental, duration since previous event in
  usecs, thus holding up to 64 msec length.

then goes the payload of arbitrary length. The length is not saved, it's
impled that event id defines the payload length.

There are 4 service events

- buffer-head: placed at the head of each buffer and includes the number
  of skipped events (4 bytes) and 4-bytes timestamp offset. The latter
  is 2x times larger than per-event timestamp, since it may take time to
  recycle a buffer and it can take more than 64 msec

- opening: an even that's placed when tracing is enabled, includes shard
  number and whatever enabler code wants to put there (spoiler: IO
  tracer puts id:name mapping for registered sched classes)

- tick: there's a timer running that puts this 3-bytes event every 32
  msec. This is to prevent 2-bytes timestamps from overflowing in case
  no events happen for that long and as sanity milestones for decoder

- terminator: when an event doesn't fit the buffer tail room, 1-bytes
  event (without timestamp) is placed there to tell decoder to stop
  decoding current buffer

Signed-off-by: Pavel Emelyanov <[email protected]>
As messages format is determined by the event id, decoder should know
exactly the version from this a trace was generated, thus be tagged with
the same "release ID" in the repository.

Signed-off-by: Pavel Emelyanov <[email protected]>
First trace generator is going to be IO queue code, so put the tracer
object on board and introduce the enabler method. Events will come in
the next patch.

Signed-off-by: Pavel Emelyanov <[email protected]>
There's one interesting even that is generated by fair-queue -- when the
queue starts waiting for shared capacity. In fact, the output throttler
code should exist on IO-queue level (see scylladb#2332), but currently it's in
the fair-queue, so it will need to generate this event.

This patch adds tracer reference on fair-queue for future patching.

Signed-off-by: Pavel Emelyanov <[email protected]>
Add a CLI option that enables tracing after all preparations are done.
The workload jobs will then be traced and trace files will be available
afterwards.

Signed-off-by: Pavel Emelyanov <[email protected]>
There are 3 interesting events -- when a request is queued, when it's
dispatched and when it completes.

The first one comes with request ID, class ID, direction and length. To
save some space on the trace buffer the length is trimmed down to
512-bytes blocks, so 2 bytes are enough for up to 32-MB requests. The
request "ID" is derived from io_desc_read_write pointer as it uniquely
identifes a request while it's in-flight. For that, the pointer is
disguised -- shifted right for 3 bits thus cutting off always zero bits,
and then converted to 32-bit number. This cuts off the upper part of the
pointer, but it seems to be "unique enough" for one shard. This even is
    1 (event id)
  + 2 (timestamp)
  + 1 (direction)
  + 1 (class)
  + 2 (length in blocks)
  + 4 (id)
  = 11 bytes

Dispatch and complete only include the request ID field. Its length is
thus 7 bytes (event id + timestamps + request id). Totally, each request
generates 25 bytes in the buffer.

Signed-off-by: Pavel Emelyanov <[email protected]>
There's only one interesting event for it -- when the queue goes
"pending" state waiting for shared capacity. No arguments here, so it
just takes 3 bytes in the buffer.

Signed-off-by: Pavel Emelyanov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants