-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add expired_sstables.py tool #12
base: master
Are you sure you want to change the base?
Conversation
usage: expired_sstables.py [-h] --table TABLE --gc-grace-seconds GC_GRACE_SECONDS --default-ttl DEFAULT_TTL [--ignore-max-deletion-time] [--safely-move-expired-sstables-to SAFELY_MOVE_EXPIRED_SSTABLES_TO] optional arguments: -h, --help show this help message and exit --table TABLE Absolute path to table dir --gc-grace-seconds GC_GRACE_SECONDS GC grace period in seconds --default-ttl DEFAULT_TTL Default Time to Live in seconds --ignore-max-deletion-time Ignore max deletion time of never-expiring SSTable, and assume its TTL == default TTL --safely-move-expired-sstables-to SAFELY_MOVE_EXPIRED_SSTABLES_TO Specify path to backup dir, at which fully expired SSTables will be safely moved to. WARNING: please guarantee Scylla is not running for safety reasons! This tool allows the user to know: 1) The amount of disk space that can be eventually reclaimed from all expired SSTables. 2) The amount of disk space that can theoretically be reclaimed now from expired SSTables with no blockers. An expired SSTable with no blocker is one that can be purged now without any chance of resurrecting data, because there's no blocker, non-expired SSTables out there which data is shadowed by that expired SSTable. The tool itself performs ALL the data shadowing checks before deciding that an expired SSTable has no blocker, exactly like ScyllaDB itself does. Additionally, the tool is able to calculate from scratch the max deletion time of SSTables. This feature is useful for SSTables that were affected by the bug which reset default TTL in schemas. So it's able to determine the SSTables that are essentially expired, even though the metadata doesn't tell us so. Finally, the tool is able to reclaim the disk space used by all the expired SSTables with no blockers, meaning that getting rid of them don't lead to data resurrection. This feature must be used ONLY when the node is offline. The expired SSTables are moved to a backup directory specified by the user. Tool output example: https://gist.githubusercontent.com/raphaelsc/1616db26c033c1895df70cf468769aa2/raw/3bddce6113a8887759af6c55c374e9d842aa9033/expired%2520sstables%2520output Signed-off-by: Raphael S. Carvalho <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not reviewed the correctness of the algorithm.
My only comment is that now that the script is in sstable-tools
, you can reuse sstable_tools.sstablelib
as well as the existing component parsers from the same modules. sstablelib
already has parsers for a lot of existing disk types from sstables::
, maybe worth reusing those.
import psutil | ||
|
||
# WARNING: don't tweak the value below whatsoever as it's a CONSTANT. | ||
deletion_time_for_never_expiring_sstable = 2147483647 # timestamp in seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In python, constants usually have all caps names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change, thanks.
|
||
return murmur3_token.get_token(first), murmur3_token.get_token(last) | ||
|
||
def init_stats_metadata(self, filename, sstable_format): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use sstable_tools.statistics
to parse the statistics file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@denesb I wish I could make this file self-contained, but it's better to reuse sstable_tools.statistics of course ;-)
# WARNING: don't tweak the value below whatsoever as it's a CONSTANT. | ||
deletion_time_for_never_expiring_sstable = 2147483647 # timestamp in seconds | ||
|
||
class inclusive_range: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use Interval from the intervaltree module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change in next version. thanks
This tool allows the user to know:
Additionally, the tool is able to calculate from scratch the max deletion time of SSTables. This feature
is useful for SSTables that were affected by the bug which reset default TTL in schemas. So it's able to
determine the SSTables that are essentially expired, even though the metadata doesn't tell us so.
Finally, the tool is able to reclaim the disk space used by all the expired SSTables with no blockers,
meaning that getting rid of them don't lead to data resurrection. This feature must be used ONLY when
the node is offline. The expired SSTables are moved to a backup directory specified by the user.
Tool output example:
https://gist.githubusercontent.com/raphaelsc/1616db26c033c1895df70cf468769aa2/raw/3bddce6113a8887759af6c55c374e9d842aa9033/expired%2520sstables%2520output
Signed-off-by: Raphael S. Carvalho [email protected]