dbwrapper: Bump LevelDB max file size to 128 MiB to avoid system slowdown from high disk cache flush rate #30039

maciejsszmigiero · 2024-05-04T09:41:44Z

The default max file size for LevelDB is 2 MiB, which results in the LevelDB compaction code generating ~4 disk cache flushes per second when syncing with the Bitcoin network.
These disk cache flushes are triggered by fdatasync() syscall issued by the LevelDB compaction code when reaching the max file size.

If the database is on a HDD this flush rate brings the whole system to a crawl.
It also results in very slow throughput since 2 MiB * 4 flushes per second is about 8 MiB / second max throughput, while even an old HDD can pull 100 - 200 MiB / second streaming throughput.

Increase the max file size for LevelDB to 128 MiB instead so the flush rate drops to about 1 flush / 2 seconds and the system no longer gets so sluggish.

The max file size value chosen also matches the MAX_BLOCKFILE_SIZE file size setting already used by the block storage.

DrahtBot · 2024-05-04T09:41:47Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

sipa · 2024-05-04T14:13:03Z

@jamesob Feel like benchmarking a reindex or so with this?

laanwj · 2024-05-04T16:41:58Z

Are there any drawbacks to this?

willcl-ark · 2024-05-04T16:52:11Z

Might partially address #29662

maciejsszmigiero · 2024-05-04T16:57:21Z

Are there any drawbacks to this?

I didn't notice any.

It's worth mentioning that the total amount of data stored in this database is at least two orders of magnitude higher than even 128 MiB file size.

laanwj · 2024-05-04T17:06:35Z

It's worth mentioning that the total amount of data stored in this database is at least two orders of magnitude higher than even 128 MiB file size.

Oh yes, i asked because i like this even from a "leveldb creates less files" point of view, eg anecdotally on one of my nodes the counter exceeds 6 digits bitcoin-core/bitcoin-maintainer-tools#161 . Of course, this includes deleted files, the active number is "only" about 6000.

tdb3 · 2024-05-05T18:30:41Z

Are there any drawbacks to this?

I didn't notice any.

It's worth mentioning that the total amount of data stored in this database is at least two orders of magnitude higher than even 128 MiB file size.

It would be great if there are few/no drawbacks. Do you mind sharing the methods used so far to test this? It would be great to have some data for comparison.

Other questions that come to mind (thinking out loud before I dig deeper or perform testing):

Does the change from 2MB to 128MB have any impact on consistent or transient RAM usage (i.e. for resource-constrained nodes)?
Is the file size (or an option to use the legacy smaller file size) something we would want to expose in bitcoin.conf (e.g. as a debug option)?

andrewtoth · 2024-05-05T21:31:42Z

Might partially address #29662

That issue is complaining about long compaction times. From https://github.com/bitcoin/bitcoin/blob/master/src/leveldb/include/leveldb/options.h#L111-L112:

The downside will be longer compactions and hence longer latency/performance hiccups.

it seems this change would make compaction times longer, so would exacerbate that issue?

maciejsszmigiero · 2024-05-05T22:51:03Z

Do you mind sharing the methods used so far to test this?

I am simply watching the disk cache flush rate in iostat(1).
In addition to that, the difference in the system interactivity is also pretty apparent.

Does the change from 2MB to 128MB have any impact on consistent or transient RAM usage (i.e. for resource-constrained nodes)?

Did not observe any such effect, the RAM usage of the Bitcoin process seems to vary within roughly the same bounds when syncing with the Bitcoin network with our without this change.

Is the file size (or an option to use the legacy smaller file size) something we would want to expose in bitcoin.conf (e.g. as a debug option)?

Maybe, but I don't know whether it makes sense to expose additional tuning option with respect to, for example, maintenance impact.

The downside will be longer compactions and hence longer latency/performance hiccups.

it seems this change would make compaction times longer, so would exacerbate that issue?

For me, the biggest performance impact of compaction is from disk cache flushes this operation generates.
This patch significantly reduces such flush rate and so should make compaction less painful.

andrewtoth · 2024-05-06T00:55:06Z

src/dbwrapper.cpp

@@ -147,6 +147,7 @@ static leveldb::Options GetOptions(size_t nCacheSize)
 // on corruption in later versions.
 options.paranoid_checks = true;
 }
+ options.max_file_size = 128 << 20;


Should we make this a constant? Would it be appropriate to reuse MAX_BLOCKFILE_SIZE?

+1 on a constant, but i don't think it's approprioate to reuse MAX_BLOCKFILE_SIZE, better to define a new one

Added a relevant constant.

andrewtoth · 2024-05-06T02:18:12Z

Benchmarked IBD with an SSD to block 800k, dbcache=450, prune=0 with a local node serving the blocks. This branch is 27% (!) faster than master 🚀

 commit 7f15e71f7e762645dbd1ea5eba9ecc6f9ad60236 (branch)
  Time (mean ± σ):     14711.490 s ± 225.376 s    [User: 19465.517 s, System: 1147.712 s]
  Range (min … max):   14552.125 s … 14870.854 s    2 runs
  
 commit eb0bdbdd753bca97120247b921fd29d606fea6e9 (master)
  Time (mean ± σ):     20274.276 s ± 106.042 s    [User: 21762.310 s, System: 4546.936 s]
  Range (min … max):   20199.293 s … 20349.259 s    2 runs

This patch significantly reduces such flush rate and so should make compaction less painful.

From what I understand, this patch reduces the frequency of flushes, but they will take longer when they do occur. This is great for IBD, but for #29662 the issue is an unavoidable compaction at startup. The compaction could potentially take longer with this patch.

laanwj · 2024-05-06T07:56:25Z

This branch is 27% (!) faster than master

That's impressive!

From what I understand, this patch reduces the frequency of flushes

Not only the frequency of flushes; another potential advantage here is that leveldb will spend less time open()ing and close()ing files to maintain its allowed number of open files (eg the fd_limiter stuff).

luke-jr · 2024-05-07T17:49:06Z

If there's no drawbacks, why not go even larger?

willcl-ark · 2024-05-07T18:38:15Z

I ran some benchmarks of IBD to block 800,000 vs master for comparison, and got some similar, if slightly less impressive, results with default dbcache.

With -dbcache=16384:

master@ fdb41e0: 9607 s
master@ fdb41e0 + 7f15e71: 9351 s
3% faster with this change

With -dbcache=450:

master@ fdb41e0: 15338 s
master@ fdb41e0 + 7f15e71: 13246 s
~16% faster with this change

I only did a single run of each though. Sync was performed from a single second local node with datadir on a separate SSD.

andrewtoth · 2024-05-07T19:18:03Z

FWIW re: #29662 I did not notice any difference in compaction time at startup on an SSD. It takes about 5 seconds to finish with debug=leveldb both on master and this branch.

The default max file size for LevelDB is 2 MiB, which results in the LevelDB compaction code generating ~4 disk cache flushes per second when syncing with the Bitcoin network. These disk cache flushes are triggered by fdatasync() syscall issued by the LevelDB compaction code when reaching the max file size. If the database is on a HDD this flush rate brings the whole system to a crawl. It also results in very slow throughput since 2 MiB * 4 flushes per second is about 8 MiB / second max throughput, while even an old HDD can pull 100 - 200 MiB / second streaming throughput. Increase the max file size for LevelDB to 128 MiB instead so the flush rate drops to about 1 flush / 2 seconds and the system no longer gets so sluggish. The max file size value chosen also matches the MAX_BLOCKFILE_SIZE file size setting already used by the block storage.

maciejsszmigiero · 2024-05-07T22:07:51Z

If there's no drawbacks, why not go even larger?

I used 128 MiB as the new size for commonality with MAX_BLOCKFILE_SIZE already used by the block storage and because it gives me a nice low disk cache flush rate of about 1 flush / 2 seconds that no longer impacts the overall system performance.

But just to be sure, changed the patch to use std::max() around this max_file_size option so if at some point LevelDB decides to increase its default above 128 MiB we won't be lowering it accidentally.

mzumsande

I've played around with this branch a bit, upgrading and downgrading between it and master with existing datadirs on signet and didn't run into any issues.
Also just noting that this will affect all leveldb databases, also the indexes and the block/index db.

sipa · 2024-05-17T18:23:58Z

It appears that RocksDb (more-developed derivative of LevelDB) uses a default of 64 MiB (https://github.com/facebook/rocksdb/blob/main/include/rocksdb/advanced_options.h#L468). See also my comment in #30059 (comment).

laanwj added the UTXO Db and Indexes label May 4, 2024

andrewtoth reviewed May 6, 2024

View reviewed changes

maciejsszmigiero force-pushed the dbwrapper-bump-max-file-size branch from 7f15e71 to 3e32d23 Compare May 7, 2024 21:59

luke-jr mentioned this pull request May 8, 2024

Add option dbfilesize to control LevelDB target ("max") file size #30059

Open

mzumsande reviewed May 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbwrapper: Bump LevelDB max file size to 128 MiB to avoid system slowdown from high disk cache flush rate #30039

dbwrapper: Bump LevelDB max file size to 128 MiB to avoid system slowdown from high disk cache flush rate #30039

maciejsszmigiero commented May 4, 2024

DrahtBot commented May 4, 2024 •

edited

sipa commented May 4, 2024

laanwj commented May 4, 2024

willcl-ark commented May 4, 2024

maciejsszmigiero commented May 4, 2024

laanwj commented May 4, 2024

tdb3 commented May 5, 2024

andrewtoth commented May 5, 2024

maciejsszmigiero commented May 5, 2024

andrewtoth May 6, 2024

laanwj May 6, 2024

maciejsszmigiero May 7, 2024

andrewtoth commented May 6, 2024 •

edited

laanwj commented May 6, 2024

luke-jr commented May 7, 2024

willcl-ark commented May 7, 2024

andrewtoth commented May 7, 2024

maciejsszmigiero commented May 7, 2024

mzumsande left a comment

sipa commented May 17, 2024

dbwrapper: Bump LevelDB max file size to 128 MiB to avoid system slowdown from high disk cache flush rate #30039

Are you sure you want to change the base?

dbwrapper: Bump LevelDB max file size to 128 MiB to avoid system slowdown from high disk cache flush rate #30039

Conversation

maciejsszmigiero commented May 4, 2024

DrahtBot commented May 4, 2024 • edited

Code Coverage

Reviews

sipa commented May 4, 2024

laanwj commented May 4, 2024

willcl-ark commented May 4, 2024

maciejsszmigiero commented May 4, 2024

laanwj commented May 4, 2024

tdb3 commented May 5, 2024

andrewtoth commented May 5, 2024

maciejsszmigiero commented May 5, 2024

andrewtoth May 6, 2024

Choose a reason for hiding this comment

laanwj May 6, 2024

Choose a reason for hiding this comment

maciejsszmigiero May 7, 2024

Choose a reason for hiding this comment

andrewtoth commented May 6, 2024 • edited

laanwj commented May 6, 2024

luke-jr commented May 7, 2024

willcl-ark commented May 7, 2024

andrewtoth commented May 7, 2024

maciejsszmigiero commented May 7, 2024

mzumsande left a comment

Choose a reason for hiding this comment

sipa commented May 17, 2024

DrahtBot commented May 4, 2024 •

edited

andrewtoth commented May 6, 2024 •

edited