Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbwrapper: Bump LevelDB max file size to 128 MiB to avoid system slowdown from high disk cache flush rate #30039

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

maciejsszmigiero
Copy link

The default max file size for LevelDB is 2 MiB, which results in the LevelDB compaction code generating ~4 disk cache flushes per second when syncing with the Bitcoin network.
These disk cache flushes are triggered by fdatasync() syscall issued by the LevelDB compaction code when reaching the max file size.

If the database is on a HDD this flush rate brings the whole system to a crawl.
It also results in very slow throughput since 2 MiB * 4 flushes per second is about 8 MiB / second max throughput, while even an old HDD can pull 100 - 200 MiB / second streaming throughput.

Increase the max file size for LevelDB to 128 MiB instead so the flush rate drops to about 1 flush / 2 seconds and the system no longer gets so sluggish.

The max file size value chosen also matches the MAX_BLOCKFILE_SIZE file size setting already used by the block storage.

@DrahtBot
Copy link
Contributor

DrahtBot commented May 4, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

@sipa
Copy link
Member

sipa commented May 4, 2024

@jamesob Feel like benchmarking a reindex or so with this?

@laanwj
Copy link
Member

laanwj commented May 4, 2024

Are there any drawbacks to this?

@willcl-ark
Copy link
Member

Might partially address #29662

@maciejsszmigiero
Copy link
Author

Are there any drawbacks to this?

I didn't notice any.

It's worth mentioning that the total amount of data stored in this database is at least two orders of magnitude higher than even 128 MiB file size.

@laanwj
Copy link
Member

laanwj commented May 4, 2024

It's worth mentioning that the total amount of data stored in this database is at least two orders of magnitude higher than even 128 MiB file size.

Oh yes, i asked because i like this even from a "leveldb creates less files" point of view, eg anecdotally on one of my nodes the counter exceeds 6 digits bitcoin-core/bitcoin-maintainer-tools#161 . Of course, this includes deleted files, the active number is "only" about 6000.

@tdb3
Copy link
Contributor

tdb3 commented May 5, 2024

Are there any drawbacks to this?

I didn't notice any.

It's worth mentioning that the total amount of data stored in this database is at least two orders of magnitude higher than even 128 MiB file size.

It would be great if there are few/no drawbacks. Do you mind sharing the methods used so far to test this? It would be great to have some data for comparison.

Other questions that come to mind (thinking out loud before I dig deeper or perform testing):

  • Does the change from 2MB to 128MB have any impact on consistent or transient RAM usage (i.e. for resource-constrained nodes)?
  • Is the file size (or an option to use the legacy smaller file size) something we would want to expose in bitcoin.conf (e.g. as a debug option)?

@andrewtoth
Copy link
Contributor

Might partially address #29662

That issue is complaining about long compaction times. From https://github.com/bitcoin/bitcoin/blob/master/src/leveldb/include/leveldb/options.h#L111-L112:

The downside will be longer compactions and hence longer latency/performance hiccups.

it seems this change would make compaction times longer, so would exacerbate that issue?

@maciejsszmigiero
Copy link
Author

Do you mind sharing the methods used so far to test this?

I am simply watching the disk cache flush rate in iostat(1).
In addition to that, the difference in the system interactivity is also pretty apparent.

Does the change from 2MB to 128MB have any impact on consistent or transient RAM usage (i.e. for resource-constrained nodes)?

Did not observe any such effect, the RAM usage of the Bitcoin process seems to vary within roughly the same bounds when syncing with the Bitcoin network with our without this change.

Is the file size (or an option to use the legacy smaller file size) something we would want to expose in bitcoin.conf (e.g. as a debug option)?

Maybe, but I don't know whether it makes sense to expose additional tuning option with respect to, for example, maintenance impact.

The downside will be longer compactions and hence longer latency/performance hiccups.

it seems this change would make compaction times longer, so would exacerbate that issue?

For me, the biggest performance impact of compaction is from disk cache flushes this operation generates.
This patch significantly reduces such flush rate and so should make compaction less painful.

@@ -147,6 +147,7 @@ static leveldb::Options GetOptions(size_t nCacheSize)
// on corruption in later versions.
options.paranoid_checks = true;
}
options.max_file_size = 128 << 20;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a constant? Would it be appropriate to reuse MAX_BLOCKFILE_SIZE?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on a constant, but i don't think it's approprioate to reuse MAX_BLOCKFILE_SIZE, better to define a new one

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a relevant constant.

@andrewtoth
Copy link
Contributor

andrewtoth commented May 6, 2024

Benchmarked IBD with an SSD to block 800k, dbcache=450, prune=0 with a local node serving the blocks. This branch is 27% (!) faster than master 🚀

 commit 7f15e71f7e762645dbd1ea5eba9ecc6f9ad60236 (branch)
  Time (mean ± σ):     14711.490 s ± 225.376 s    [User: 19465.517 s, System: 1147.712 s]
  Range (min … max):   14552.125 s … 14870.854 s    2 runs
  
 commit eb0bdbdd753bca97120247b921fd29d606fea6e9 (master)
  Time (mean ± σ):     20274.276 s ± 106.042 s    [User: 21762.310 s, System: 4546.936 s]
  Range (min … max):   20199.293 s … 20349.259 s    2 runs

This patch significantly reduces such flush rate and so should make compaction less painful.

From what I understand, this patch reduces the frequency of flushes, but they will take longer when they do occur. This is great for IBD, but for #29662 the issue is an unavoidable compaction at startup. The compaction could potentially take longer with this patch.

@laanwj
Copy link
Member

laanwj commented May 6, 2024

This branch is 27% (!) faster than master

That's impressive!

From what I understand, this patch reduces the frequency of flushes

Not only the frequency of flushes; another potential advantage here is that leveldb will spend less time open()ing and close()ing files to maintain its allowed number of open files (eg the fd_limiter stuff).

@luke-jr
Copy link
Member

luke-jr commented May 7, 2024

If there's no drawbacks, why not go even larger?

@willcl-ark
Copy link
Member

I ran some benchmarks of IBD to block 800,000 vs master for comparison, and got some similar, if slightly less impressive, results with default dbcache.

With -dbcache=16384:

With -dbcache=450:

I only did a single run of each though. Sync was performed from a single second local node with datadir on a separate SSD.

@andrewtoth
Copy link
Contributor

FWIW re: #29662 I did not notice any difference in compaction time at startup on an SSD. It takes about 5 seconds to finish with debug=leveldb both on master and this branch.

The default max file size for LevelDB is 2 MiB, which results in the
LevelDB compaction code generating ~4 disk cache flushes per second when
syncing with the Bitcoin network.
These disk cache flushes are triggered by fdatasync() syscall issued by the
LevelDB compaction code when reaching the max file size.

If the database is on a HDD this flush rate brings the whole system to a
crawl.
It also results in very slow throughput since 2 MiB * 4 flushes per second
is about 8 MiB / second max throughput, while even an old HDD can pull
100 - 200 MiB / second streaming throughput.

Increase the max file size for LevelDB to 128 MiB instead so the flush rate
drops to about 1 flush / 2 seconds and the system no longer gets so
sluggish.

The max file size value chosen also matches the MAX_BLOCKFILE_SIZE file
size setting already used by the block storage.
@maciejsszmigiero
Copy link
Author

If there's no drawbacks, why not go even larger?

I used 128 MiB as the new size for commonality with MAX_BLOCKFILE_SIZE already used by the block storage and because it gives me a nice low disk cache flush rate of about 1 flush / 2 seconds that no longer impacts the overall system performance.

But just to be sure, changed the patch to use std::max() around this max_file_size option so if at some point LevelDB decides to increase its default above 128 MiB we won't be lowering it accidentally.

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played around with this branch a bit, upgrading and downgrading between it and master with existing datadirs on signet and didn't run into any issues.
Also just noting that this will affect all leveldb databases, also the indexes and the block/index db.

@sipa
Copy link
Member

sipa commented May 17, 2024

It appears that RocksDb (more-developed derivative of LevelDB) uses a default of 64 MiB (https://github.com/facebook/rocksdb/blob/main/include/rocksdb/advanced_options.h#L468). See also my comment in #30059 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants