-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement removal of files via blocks discarding according to configured bandwidth #2303
base: master
Are you sure you want to change the base?
Conversation
5934da6
to
3559f90
Compare
} | ||
|
||
_currently_discarded_files.reserve(std::max(4u * _config._requests_per_second, uint64_t{256u})); | ||
_refill_available_requests_timer.arm_periodic(std::chrono::seconds{1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider you have 1M requests per second. AFAIU it means, that the discarder will emit 1M requests almost (there's magic 8u constant in try_discard_blocks) instantly and then will be idling for the remaining (almost) second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Also, if the timer expires right after the requests were issued, then another 1M will be issued. When using periodic timer such situation may happen. In such case the bandwidth at that second may be almost twice as big as the configured one. It is wrong.
During creation of file_discard_queue_global_test.cc
I also had impression that setting the periodic timer in constructor is problematic. However, I wanted to push the draft ASAP to get some general feedback about used approach. Also, I see some corner cases in other parts of the PR that need to be discussed.
Regarding the timer - currently I have two ideas:
- decrease the period of the timer and refill only fraction of the available shares e.g. given
1/n
s period refillRPS/n
shares - it could alleviate "long idling" problem, but still if the refill happens right after the first portion of discards was issued, then the bandwidth for the first second may be larger than the configured one (by BW/n) - do not use periodic timer - instead arm timer after issuing discards that caused decreasing number of shares
Regarding the magic constant - file_discard_queue
processes files in FIFO order. If any file was opened, then the queue tries to discard that file completely before opening next ones. My intention was to not open too many files at once. 8
is just an arbitrary value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're trying to implement token-bucket algorithm, it can be done without a timer and with smoother refill. All you need is to refill tokens not by timer, but synchronously, every time you try to grab tokens from them. For that, there should a function tokens = fn(duration<double>)
that converts duration into tokens. Every time you grab tokens you first put some of them back using that fn() and note the time of the replenish to calculate duration for the next refill. There's util/shared_token_bucket.hh that implements it, but it's node-wide, not shard-local and thus has some extra complexity due to that.
if (node["discard_block_size"]) { | ||
mp.discard_block_size = parse_memory_size(node["discard_block_size"].as<std::string>()); | ||
} | ||
if (node["discard_requests_per_second"]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to calculate this number for a given node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a quite tricky question. The experiments with diskplorer
seemed to indicate, that when block_size == 32MiB
and discard_bw == write_bw
, then on i3.2xlarge such huge latency spikes had not been visible.
However, what I am afraid of is whether the configured bandwidth would be sufficient. The queue of files to remove cannot grow indefinitely. Also, I am not sure how to measure the desired value via iotune - creation of files to test certain bandwidth may take long time.
|
||
auto& params = mountpoint_it->second; | ||
if (params.discard_block_size && params.discard_requests_per_second) { | ||
return file_discard_queue_config{*params.discard_block_size, *params.discard_requests_per_second}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the discard numbers in io-properties file are not node-wide, but per-shard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works like that for the current state of the PR. It would be better to specify the value for the device instead of per-shard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then consider the number of discard requests specified is less than the number of shards on the node. Is that possible? If yes, how to handle it?
Since we try to go on fine-grained block-by-block (the "discard_block_size" parameter) removal, why don't we route those discards via regular io-queue? |
To avoid I/O latency spikes that are caused by too many TRIM requests issued to the drive over a short period of time when the filesystem is mounted with the online discard option and many files are removed, the file_discard_queue class is used. Instead of relying on the unlink operation that trims all extents of a file at once, the approach that utilizes blocks discarding at a given pace is used. When a user wants to remove a file, then file_discard_queue::enqueue_file_removal() needs to be called. Its main purpose is to store the information about the work related to file discard in the internal data of the class. Reactor will invoke file_discard_queue::try_discard_blocks() function via poller. The function checks if the bandwidth of discards allows another block to be discarded. If the bandwidth is available, then the discard is performed. The described approach ensures, that the amount of removed data remains at the configured level. Signed-off-by: Patryk Wrobel <[email protected]>
This patch introduces unit tests for file_discard_queue class. To avoid interference between test cases, each of them manually triggers file_discard_queue::try_discard_blocks(). In next patches file_discard_queue will be plugged into reactor for each shard. Then another part of tests will be added as a separate binary to check the full flow. Signed-off-by: Patryk Wrobel <[email protected]>
To enable usage of file_discard_queue in seastar applications, it had to be plugged into reactor. This patch introduces usage of file_discard_queue in reactor. The following changes have been implemented: - remove_file_via_blocks_discarding() function is available and allows users to discard files according to a given bandwidth (using file_discard_queue) - file_discard_queue_submission_pollfn poller was added to reactor's main loop to trigger the work related to discarding files - I/O properties contain two new optional parameters: discard_block_size and discard_requests_per_second that allow a user to configure the bandwidth of discards for a given mountpoint Signed-off-by: Patryk Wrobel <[email protected]>
… removal In order to ensure that file discard queue is properly used by the reactor and can be configured via I/O properties parameters a new set of global tests was introduced. To avoid any interferences between test cases, they are run one after another in a seastar::thread. The test cases verify the full flow including usage of global utility functions, reactor, pollers and configured file discard queues. The bandwidth of discard operation is also verified. Signed-off-by: Patryk Wrobel <[email protected]>
3559f90
to
e2ad5ba
Compare
Could you elaborate a little bit? In the beginning I saw that there is no support for Moreover, That's the reason why I created a separate class called When I try to think about routing via io_queue, I have a few ideas and I am not sure which one is correct (if any).
|
Correct.
Yes, this just means that io_queue::poll_io_queue() will need to submit discard somehow else, not via io_sink. One of the options is to convert io_desc_read_write into an abstract class with its dispatch() being a virtual method and have two different inherits from it -- existing io_desc_read_write and io_desc_discard.
Definitely not. Discards must compete with regular read-write requests in the same fair-queue, because they are competing with each other in the kernel/disk. That's why asked for this-like measurement of read-vs-discard to see how they correspond to existing model. You're right that currently io_queue is read-write oriented, but that's ... historical coincidence :) There's queued_io_request thing, io_queue maintains a set of those, pushing them through the fair_queue, so that they obey both -- shares-based classes and outgoing throttler (*). There's nothing that's read-write specific here yet. (*) the outgoing throttler is currently in fair_queue, but that's not quite correct either. The goal of the fair-queue is to provide cross-class fair scheduling. The goal of the outgoing throttler is to limit the amount of requests sent to disk. It effectively the properly of io_sink, and it's in fair_queue due to me being lazy or shortsighted (likely both) back when I implemented it. And my above point is that discards should be pushed through it in the first place. |
Hi, I tried to prepare io_queue for request types that do not need to use io_sink according to your comment. Please find another PR: #2319 |
To avoid I/O latency spikes that are caused by too many TRIM requests
issued to the drive over a short period of time when the filesystem
is mounted with the online discard option and many files are removed,
the file_discard_queue class is used.
Instead of relying on the unlink operation that trims all extents of
a file at once, the approach that utilizes blocks discarding at a
given pace is used.
When a user wants to remove a file, then remove_file_via_blocks_discarding()
needs to be called. Its main purpose is to enqueue the work into file_discard_queue
object associated with the device on which the file is stored.
Reactor will invoke file_discard_queue::try_discard_blocks() function via poller.
The function checks if the bandwidth of discards allows another block to be
discarded. If the bandwidth is available, then the discard is performed.
The described approach ensures, that the amount of removed data over time
remains at the configured level.
Refs: scylladb/scylladb#14940