-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary event tracer #2417
base: master
Are you sure you want to change the base?
Binary event tracer #2417
Conversation
We should use statically defined tracepoints (<sys/sdt.h>) instead of an ad-hoc format/protocol/parser, like blktrace/blkparse. The challenge is getting the information out in a performant way. I tried systemtap dyninst and failed, several times. Please check if ltt-ng works. /cc @michoecho |
@xemul In other words: Avi has banned custom tracing patches until it is proven that existing tracing frameworks aren't good enough. (I had a custom tracing patch denied too, once.) If you want to experiment with lttng, you can take https://github.com/scylladb/scylla-enterprise-seastar/commit/f0b870368f733bde6fd54cfbec6052080b5f2db5#diff-ceb4cd71fc94af90f390646cde7f7cc8bc673cbd7fd77208609b42fc634d0fd1 as a starting point. (But to make it compile, you will probably have to prepend |
Yeah, Avi made it crystal clear :D
There must be denied-binary-tracing club somewhere, optionally we could found one
Thanks! |
The goal of the tracer is to capture events at the frequence that exceeds the regular logging and metrics resolution. Basically, it's a buffer (well -- a bunch of buffers) into which pre-defined trace points but their binary data. Overhead: when OFF the it's is single if, when ON it's putting several bytes-long data, advancing a pointer and checking if the buffer is full. Current implementation pre-allocates 8 buffers 128k in size. Spoiler: io events take 25 bytes per request. At 10 kiops this size is able to store trace data for ~4 seconds. When a buffer is full it's flushed on disk in a single write, then re-cycled. If an event happens when there are no buffers available, it counts the number of skipped events and puts them into the next available buffer to at least let user know it happened. Tracing is started for a given duration (in seconds) and total size of flushed data (in bytes, but it's naturally rounded to 128k). Once either of the limits is hit, tracing is stopped. This is not to disturb system too much including both -- run-time overhead and consumed disk space. Message format is: event id (1 bytes) timestamp (2 bytes) incremental, duration since previous event in usecs, thus holding up to 64 msec length. then goes the payload of arbitrary length. The length is not saved, it's impled that event id defines the payload length. There are 4 service events - buffer-head: placed at the head of each buffer and includes the number of skipped events (4 bytes) and 4-bytes timestamp offset. The latter is 2x times larger than per-event timestamp, since it may take time to recycle a buffer and it can take more than 64 msec - opening: an even that's placed when tracing is enabled, includes shard number and whatever enabler code wants to put there (spoiler: IO tracer puts id:name mapping for registered sched classes) - tick: there's a timer running that puts this 3-bytes event every 32 msec. This is to prevent 2-bytes timestamps from overflowing in case no events happen for that long and as sanity milestones for decoder - terminator: when an event doesn't fit the buffer tail room, 1-bytes event (without timestamp) is placed there to tell decoder to stop decoding current buffer Signed-off-by: Pavel Emelyanov <[email protected]>
As messages format is determined by the event id, decoder should know exactly the version from this a trace was generated, thus be tagged with the same "release ID" in the repository. Signed-off-by: Pavel Emelyanov <[email protected]>
First trace generator is going to be IO queue code, so put the tracer object on board and introduce the enabler method. Events will come in the next patch. Signed-off-by: Pavel Emelyanov <[email protected]>
There's one interesting even that is generated by fair-queue -- when the queue starts waiting for shared capacity. In fact, the output throttler code should exist on IO-queue level (see scylladb#2332), but currently it's in the fair-queue, so it will need to generate this event. This patch adds tracer reference on fair-queue for future patching. Signed-off-by: Pavel Emelyanov <[email protected]>
Add a CLI option that enables tracing after all preparations are done. The workload jobs will then be traced and trace files will be available afterwards. Signed-off-by: Pavel Emelyanov <[email protected]>
There are 3 interesting events -- when a request is queued, when it's dispatched and when it completes. The first one comes with request ID, class ID, direction and length. To save some space on the trace buffer the length is trimmed down to 512-bytes blocks, so 2 bytes are enough for up to 32-MB requests. The request "ID" is derived from io_desc_read_write pointer as it uniquely identifes a request while it's in-flight. For that, the pointer is disguised -- shifted right for 3 bits thus cutting off always zero bits, and then converted to 32-bit number. This cuts off the upper part of the pointer, but it seems to be "unique enough" for one shard. This even is 1 (event id) + 2 (timestamp) + 1 (direction) + 1 (class) + 2 (length in blocks) + 4 (id) = 11 bytes Dispatch and complete only include the request ID field. Its length is thus 7 bytes (event id + timestamps + request id). Totally, each request generates 25 bytes in the buffer. Signed-off-by: Pavel Emelyanov <[email protected]>
There's only one interesting event for it -- when the queue goes "pending" state waiting for shared capacity. No arguments here, so it just takes 3 bytes in the buffer. Signed-off-by: Pavel Emelyanov <[email protected]>
1cf0ddf
to
cdd9734
Compare
This patch adds ability to capture info about frequent events (such as queued/dispatched/completed IO requests) into a buffer in memory and write it into a file. The goal is to collect a lot of events in relatively short period of time (e.g. -- seconds to minutes) thus getting what exactly is happening in the system, instead of averaged statistical metrics. Events are collected in binary form, so a decoder is included.
In this PR the tracer is put on io-queue to collect IO events. Example of the decoded output:
The OPENING message includes shard id and id:name mapping for registered sched classes.
The IO event format above is
<offset in file>:<timestamp in usec> <event name Queue/Dispatch/Complete> <request id>
and for Queued event also --
<direction r/w> <class ID> <length>
For IO there's a set of io_log.trace() per-request calls, but it's not very efficient. First it takes time to format the message and forward it to syslogd (in prod servers). Next, each request traces three messages:
that's 290 bytes. Contrary to this, this tracer generates 25 bytes per request. Also note, that timestamp precision is msec, while IO events need at least usec resolution.
refs: #2403