trie: reduce allocations in stacktrie #30743

holiman · 2024-11-11T13:01:18Z

This PR uses various tweaks and tricks to make the stacktrie near alloc-free.

[user@work go-ethereum]$ benchstat stacktrie.1 stacktrie.7
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.1  │             stacktrie.7              │
             │    sec/op    │    sec/op     vs base                │
Insert100K-8   106.97m ± 8%   88.21m ± 34%  -17.54% (p=0.000 n=10)

             │   stacktrie.1    │             stacktrie.7              │
             │       B/op       │     B/op      vs base                │
Insert100K-8   13199.608Ki ± 0%   3.424Ki ± 3%  -99.97% (p=0.000 n=10)

             │  stacktrie.1   │             stacktrie.7             │
             │   allocs/op    │ allocs/op   vs base                 │
Insert100K-8   553428.50 ± 0%   22.00 ± 5%  -100.00% (p=0.000 n=10)

Also improves derivesha:

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.1 │             derivesha.2              │
                          │   sec/op    │    sec/op     vs base                │
DeriveSha200/stack_trie-8   477.8µ ± 2%   430.0µ ± 12%  -10.00% (p=0.000 n=10)

                          │ derivesha.1  │             derivesha.2              │
                          │     B/op     │     B/op      vs base                │
DeriveSha200/stack_trie-8   45.17Ki ± 0%   25.65Ki ± 0%  -43.21% (p=0.000 n=10)

                          │ derivesha.1 │            derivesha.2             │
                          │  allocs/op  │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   1259.0 ± 0%   232.0 ± 0%  -81.57% (p=0.000 n=10)

holiman · 2024-11-11T13:22:46Z

Update

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.2  │              stacktrie.3              │
             │    sec/op    │    sec/op     vs base                 │
Insert100K-8   78.51m ± ∞ ¹   69.50m ± 12%  -11.47% (p=0.010 n=5+7)
¹ need >= 6 samples for confidence interval at level 0.95

             │  stacktrie.2  │              stacktrie.3              │
             │     B/op      │     B/op      vs base                 │
Insert100K-8   6.931Mi ± ∞ ¹   4.640Mi ± 0%  -33.06% (p=0.003 n=5+7)
¹ need >= 6 samples for confidence interval at level 0.95

             │ stacktrie.2  │         stacktrie.3          │
             │  allocs/op   │  allocs/op   vs base         │
Insert100K-8   326.7k ± ∞ ¹   226.7k ± 0%  -30.61% (n=5+7)
¹ need >= 6 samples for confidence interval at level 0.95

holiman · 2024-11-11T13:40:39Z

New progress

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.3  │          stacktrie.4          │
             │    sec/op    │    sec/op     vs base         │
Insert100K-8   69.50m ± 12%   74.59m ± 14%  ~ (p=0.128 n=7)

             │ stacktrie.3  │             stacktrie.4             │
             │     B/op     │     B/op      vs base               │
Insert100K-8   4.640Mi ± 0%   3.112Mi ± 0%  -32.93% (p=0.001 n=7)

             │ stacktrie.3 │            stacktrie.4             │
             │  allocs/op  │  allocs/op   vs base               │
Insert100K-8   226.7k ± 0%   126.7k ± 0%  -44.11% (p=0.001 n=7)

holiman · 2024-11-11T21:19:57Z

Argh only this little thing left: returning a slice to a pool somehow leaks something. I guess somehow I expose an underlying array, or at least make the compiler think so, hence it reallocs the slice when I return it to the pool.

namiloh · 2024-11-11T21:25:44Z

Ah wait 24B that is the size of a slice: a pointer and two ints. It is the slice passed by value, escaped to heap (correctly so). But how to avoid it??

holiman · 2024-11-11T21:56:02Z

Solved!

[user@work go-ethereum]$ benchstat stacktrie.1 stacktrie.7
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.1  │             stacktrie.7              │
             │    sec/op    │    sec/op     vs base                │
Insert100K-8   106.97m ± 8%   88.21m ± 34%  -17.54% (p=0.000 n=10)

             │   stacktrie.1    │             stacktrie.7              │
             │       B/op       │     B/op      vs base                │
Insert100K-8   13199.608Ki ± 0%   3.424Ki ± 3%  -99.97% (p=0.000 n=10)

             │  stacktrie.1   │             stacktrie.7             │
             │   allocs/op    │ allocs/op   vs base                 │
Insert100K-8   553428.50 ± 0%   22.00 ± 5%  -100.00% (p=0.000 n=10)

trie/stacktrie.go

trie/bytepool.go

trie/stacktrie.go

holiman · 2024-11-20T13:09:21Z

Did a sync on bench06 (leaving ancients intact). It synced up in ~2h15m, so absolutely no problem there!

rjl493456442 · 2024-11-21T06:44:00Z

@holiman I think we should compare the relevant metrics (memory allocation, etc) during the snap sync and to see the overall impact brought by this change.

rjl493456442 · 2024-11-21T06:47:23Z

trie/node_enc.go

@@ -40,6 +40,20 @@ func (n *fullNode) encode(w rlp.EncoderBuffer) {
 	w.ListEnd(offset)
 }

+func (n *fullnodeEncoder) encode(w rlp.EncoderBuffer) {


Can you estimate the performance and allocation differences between using this encoder and the standard full node encoder?

The primary distinction seems to be that this encoder inlines the children encoding directly, rather than invoking the children encoder recursively. I’m curious about the performance implications of this approach.

Sure, so if I change it back (adding back rawNode, and then switching back the case branchNode: so that it looks like it did earlier, then the difference is:

[user@work go-ethereum]$ benchstat pr.bench pr.bench..2 goos: linux goarch: amd64 pkg: github.com/ethereum/go-ethereum/trie cpu: 12th Gen Intel(R) Core(TM) i7-1270P │ pr.bench │ pr.bench..2 │ │ sec/op │ sec/op vs base │ Insert100K-8 88.67m ± 33% 88.20m ± 11% ~ (p=0.579 n=10) │ pr.bench │ pr.bench..2 │ │ B/op │ B/op vs base │ Insert100K-8 3.451Ki ± 3% 2977.631Ki ± 0% +86178.83% (p=0.000 n=10) │ pr.bench │ pr.bench..2 │ │ allocs/op │ allocs/op vs base │ Insert100K-8 22.00 ± 5% 126706.50 ± 0% +575838.64% (p=0.000 n=10)

When we build the children struct, all the values are copied, as opposed if we just use the encoder-type which just uses the same child without triggering a copy.

rjl493456442 · 2024-11-21T06:59:58Z

trie/stacktrie.go

-		n := shortNode{Key: hexToCompactInPlace(st.key), Val: valueNode(st.val)}
-
-		n.encode(t.h.encbuf)
+		{


We can also use shortNodeEncoder here?
I would recommend not to do the inline encoding directly.

We could, but here we know we have a valueNode. And the valueNode just does:

func (n valueNode) encode(w rlp.EncoderBuffer) { w.WriteBytes(n) }

If we change that to a method which checks the size and theoretically does different things, then the semantics become slightly changed.

rjl493456442 · 2024-11-21T07:03:00Z

trie/node.go

+	//shortNodeEncoder is a type used exclusively for encoding. Briefly instantiating
+	// a shortNodeEncoder and initializing with existing slices is less memory
+	// intense than using the shortNode type.
+	shortNodeEncoder struct {


qq, does shortNodeEncoder need to implement the node interface?

It's just an encoder which coverts the node object to byte format?

just have a try, it's technically possible

Good thinking there!

rjl493456442 · 2024-11-21T07:21:33Z

trie/stacktrie.go

@@ -27,6 +27,7 @@ import (

 var (
 	stPool = sync.Pool{New: func() any { return new(stNode) }}
+	bPool  = newBytesPool(32, 100)


any particular reason to not use sync.Pool?

Yes! You'd be surprised, but the whole problem I had with extra alloc :

I solved that by not using a sync.Pool. I suspect it's due to the interface-conversion, but I don't know any deeper details about the why of it.

rjl493456442 · 2024-11-21T07:27:57Z

trie/stacktrie.go

@@ -129,6 +143,12 @@ const (
 )

 func (n *stNode) reset() *stNode {
+	if n.typ == hashedNode {


What's the difference between HashedNode and Embedded tiny node?

For embedded node, we also allocate the buffer from the bPool and the buffer is owned by the node itself right?

Well, the thing is, that there's only one place in the stacktrie where we convert a node into a hashedNode.

st.typ = hashedNode st.key = st.key[:0] st.val = nil // Release reference to potentially externally held slice.

This is the only place. So we know that once something has been turned into a hashedNode, the val is never externally held and thus it can be returned to the pool. The big problem we had earlier, is that we need to overwrite the val with a hash, but we are not allowed to mutate val. So this was a big cause of runaway allocs.

But now we reclaim those values-which-are-hashes since we know that they're "ours".

For embedded node, we also allocate the buffer from the bPool and the buffer is owned by the node itself right?

To answer your question: yes. So from the perspective of slice-reuse, there is zero difference.

Btw, in my follow-up PR to reduce allocs in derivesha:#30747 , I remove this trick again.

In that PR, I always copy the value as it enters the stacktrie. So we always own the val slice, and are free to reuse via pooling.

Doing so is less hacky, we get rid of this special case "val is externally held unless it's an hashedNode because then we own it".

That makes it possible for the derivesha-method to reuse the input-buffer

MariusVanDerWijden · 2024-11-25T01:26:36Z

lint is read after the latest commit

goos: linux goarch: amd64 pkg: github.com/ethereum/go-ethereum/trie cpu: 12th Gen Intel(R) Core(TM) i7-1270P │ stacktrie.3 │ stacktrie.4 │ │ sec/op │ sec/op vs base │ Insert100K-8 69.50m ± 12% 74.59m ± 14% ~ (p=0.128 n=7) │ stacktrie.3 │ stacktrie.4 │ │ B/op │ B/op vs base │ Insert100K-8 4.640Mi ± 0% 3.112Mi ± 0% -32.93% (p=0.001 n=7) │ stacktrie.3 │ stacktrie.4 │ │ allocs/op │ allocs/op vs base │ Insert100K-8 226.7k ± 0% 126.7k ± 0% -44.11% (p=0.001 n=7)

holiman · 2024-11-25T13:13:50Z

Running a snapsync benchmarkers now, using partialwipe (ancients intact),

master on bench06
This PR on bench05

holiman · 2024-11-26T11:24:09Z

So, here are some charts. The first third of the charts is the snapsync, the latter two thirds are snap-gen + block by block sync. The yellow (this PR) finished slightly faster (13 minutes or so).

This is despite that the 05 (this PR) struggled against larger iowait. Perhaps it was bottlenecked by the disk?

The 05 had some block downloading to do in the beginnig, but after that the ingress rates were fairly equal:

Memory charts doesn't show much, except that maybe the lead that 05 had after sync was increased further during generation, to about 25 minutes

holiman force-pushed the stacktrie_allocs_1 branch from f332426 to 4a2a72b Compare November 11, 2024 13:57

holiman mentioned this pull request Nov 12, 2024

Stacktrie allocs 2 #30746

Closed

holiman changed the title ~~trie: [wip] reduce allocs in stacktrie~~ trie: reduce allocactions in stacktrie Nov 12, 2024

holiman mentioned this pull request Nov 12, 2024

trie: [wip] reduce allocations in derivesha #30747

Draft

holiman force-pushed the stacktrie_allocs_1 branch from aacb8e6 to af2feef Compare November 12, 2024 10:27

holiman changed the title ~~trie: reduce allocactions in stacktrie~~ trie: reduce allocations in stacktrie Nov 12, 2024

MariusVanDerWijden reviewed Nov 18, 2024

View reviewed changes

trie/stacktrie.go Outdated Show resolved Hide resolved

MariusVanDerWijden reviewed Nov 18, 2024

View reviewed changes

trie/bytepool.go Show resolved Hide resolved

holiman commented Nov 18, 2024

View reviewed changes

trie/stacktrie.go Outdated Show resolved Hide resolved

holiman commented Nov 18, 2024

View reviewed changes

trie/stacktrie.go Outdated Show resolved Hide resolved

holiman force-pushed the stacktrie_allocs_1 branch from 258987d to b34643f Compare November 20, 2024 07:45

rjl493456442 reviewed Nov 21, 2024

View reviewed changes

holiman added 6 commits November 25, 2024 13:49

trie: add stacktrie test

566534f

trie: new node-encoding types

c7f6aec

trie: make stacktrie use less alloc:y encoders

5b6a6e4

trie: stacktrie allocation reduction via key scratchspace

4eb7032

trie: stacktrie pool hashing slices

22f86a7

holiman added 3 commits November 25, 2024 13:50

trie: implement bytepool

5880b2b

trie: nits and polishes

649a329

trie: simplify full/short-nodeEncoder

bfc3aae

holiman force-pushed the stacktrie_allocs_1 branch from fc0705f to bfc3aae Compare November 25, 2024 12:51

holiman added the status:triage label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trie: reduce allocations in stacktrie #30743

trie: reduce allocations in stacktrie #30743

holiman commented Nov 11, 2024 •

edited

Loading

holiman commented Nov 11, 2024 •

edited

Loading

holiman commented Nov 11, 2024 •

edited

Loading

holiman commented Nov 11, 2024

namiloh commented Nov 11, 2024

holiman commented Nov 11, 2024

holiman commented Nov 20, 2024

rjl493456442 commented Nov 21, 2024 •

edited

Loading

rjl493456442 Nov 21, 2024

holiman Nov 21, 2024

rjl493456442 Nov 21, 2024 •

edited

Loading

holiman Nov 21, 2024

rjl493456442 Nov 21, 2024

rjl493456442 Nov 21, 2024

holiman Nov 21, 2024

rjl493456442 Nov 21, 2024

holiman Nov 21, 2024

rjl493456442 Nov 21, 2024

holiman Nov 21, 2024

holiman Nov 21, 2024

holiman Nov 21, 2024

MariusVanDerWijden commented Nov 25, 2024

holiman commented Nov 25, 2024

holiman commented Nov 26, 2024

trie: reduce allocations in stacktrie #30743

Are you sure you want to change the base?

trie: reduce allocations in stacktrie #30743

Conversation

holiman commented Nov 11, 2024 • edited Loading

holiman commented Nov 11, 2024 • edited Loading

holiman commented Nov 11, 2024 • edited Loading

holiman commented Nov 11, 2024

namiloh commented Nov 11, 2024

holiman commented Nov 11, 2024

holiman commented Nov 20, 2024

rjl493456442 commented Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjl493456442 Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MariusVanDerWijden commented Nov 25, 2024

holiman commented Nov 25, 2024

holiman commented Nov 26, 2024

holiman commented Nov 11, 2024 •

edited

Loading

holiman commented Nov 11, 2024 •

edited

Loading

holiman commented Nov 11, 2024 •

edited

Loading

rjl493456442 commented Nov 21, 2024 •

edited

Loading

rjl493456442 Nov 21, 2024 •

edited

Loading