Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trie: reduce allocations in stacktrie #30743

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

holiman
Copy link
Contributor

@holiman holiman commented Nov 11, 2024

This PR uses various tweaks and tricks to make the stacktrie near alloc-free.

[user@work go-ethereum]$ benchstat stacktrie.1 stacktrie.7
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.1  │             stacktrie.7              │
             │    sec/op    │    sec/op     vs base                │
Insert100K-8   106.97m ± 8%   88.21m ± 34%  -17.54% (p=0.000 n=10)

             │   stacktrie.1    │             stacktrie.7              │
             │       B/op       │     B/op      vs base                │
Insert100K-8   13199.608Ki ± 0%   3.424Ki ± 3%  -99.97% (p=0.000 n=10)

             │  stacktrie.1   │             stacktrie.7             │
             │   allocs/op    │ allocs/op   vs base                 │
Insert100K-8   553428.50 ± 0%   22.00 ± 5%  -100.00% (p=0.000 n=10)

Also improves derivesha:

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/types
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                          │ derivesha.1 │             derivesha.2              │
                          │   sec/op    │    sec/op     vs base                │
DeriveSha200/stack_trie-8   477.8µ ± 2%   430.0µ ± 12%  -10.00% (p=0.000 n=10)

                          │ derivesha.1  │             derivesha.2              │
                          │     B/op     │     B/op      vs base                │
DeriveSha200/stack_trie-8   45.17Ki ± 0%   25.65Ki ± 0%  -43.21% (p=0.000 n=10)

                          │ derivesha.1 │            derivesha.2             │
                          │  allocs/op  │ allocs/op   vs base                │
DeriveSha200/stack_trie-8   1259.0 ± 0%   232.0 ± 0%  -81.57% (p=0.000 n=10)

@holiman
Copy link
Contributor Author

holiman commented Nov 11, 2024

Update

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.2  │              stacktrie.3              │
             │    sec/op    │    sec/op     vs base                 │
Insert100K-8   78.51m ± ∞ ¹   69.50m ± 12%  -11.47% (p=0.010 n=5+7)
¹ need >= 6 samples for confidence interval at level 0.95

             │  stacktrie.2  │              stacktrie.3              │
             │     B/op      │     B/op      vs base                 │
Insert100K-8   6.931Mi ± ∞ ¹   4.640Mi ± 0%  -33.06% (p=0.003 n=5+7)
¹ need >= 6 samples for confidence interval at level 0.95

             │ stacktrie.2  │         stacktrie.3          │
             │  allocs/op   │  allocs/op   vs base         │
Insert100K-8   326.7k ± ∞ ¹   226.7k ± 0%  -30.61% (n=5+7)
¹ need >= 6 samples for confidence interval at level 0.95

@holiman
Copy link
Contributor Author

holiman commented Nov 11, 2024

New progress

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.3  │          stacktrie.4          │
             │    sec/op    │    sec/op     vs base         │
Insert100K-8   69.50m ± 12%   74.59m ± 14%  ~ (p=0.128 n=7)

             │ stacktrie.3  │             stacktrie.4             │
             │     B/op     │     B/op      vs base               │
Insert100K-8   4.640Mi ± 0%   3.112Mi ± 0%  -32.93% (p=0.001 n=7)

             │ stacktrie.3 │            stacktrie.4             │
             │  allocs/op  │  allocs/op   vs base               │
Insert100K-8   226.7k ± 0%   126.7k ± 0%  -44.11% (p=0.001 n=7)

@holiman
Copy link
Contributor Author

holiman commented Nov 11, 2024

Argh only this little thing left: returning a slice to a pool somehow leaks something. I guess somehow I expose an underlying array, or at least make the compiler think so, hence it reallocs the slice when I return it to the pool.

Screenshot 2024-11-11 at 22-17-15 trie test alloc_space
Screenshot 2024-11-11 at 22-18-40 trie test alloc_space

@namiloh
Copy link

namiloh commented Nov 11, 2024

Ah wait 24B that is the size of a slice: a pointer and two ints. It is the slice passed by value, escaped to heap (correctly so). But how to avoid it??

@holiman
Copy link
Contributor Author

holiman commented Nov 11, 2024

Solved!

[user@work go-ethereum]$ benchstat stacktrie.1 stacktrie.7
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.1  │             stacktrie.7              │
             │    sec/op    │    sec/op     vs base                │
Insert100K-8   106.97m ± 8%   88.21m ± 34%  -17.54% (p=0.000 n=10)

             │   stacktrie.1    │             stacktrie.7              │
             │       B/op       │     B/op      vs base                │
Insert100K-8   13199.608Ki ± 0%   3.424Ki ± 3%  -99.97% (p=0.000 n=10)

             │  stacktrie.1   │             stacktrie.7             │
             │   allocs/op    │ allocs/op   vs base                 │
Insert100K-8   553428.50 ± 0%   22.00 ± 5%  -100.00% (p=0.000 n=10)

@holiman holiman mentioned this pull request Nov 12, 2024
@holiman holiman changed the title trie: [wip] reduce allocs in stacktrie trie: reduce allocactions in stacktrie Nov 12, 2024
@holiman holiman changed the title trie: reduce allocactions in stacktrie trie: reduce allocations in stacktrie Nov 12, 2024
trie/stacktrie.go Outdated Show resolved Hide resolved
trie/stacktrie.go Outdated Show resolved Hide resolved
@holiman
Copy link
Contributor Author

holiman commented Nov 20, 2024

Did a sync on bench06 (leaving ancients intact). It synced up in ~2h15m, so absolutely no problem there!

@rjl493456442
Copy link
Member

rjl493456442 commented Nov 21, 2024

@holiman I think we should compare the relevant metrics (memory allocation, etc) during the snap sync and to see the overall impact brought by this change.

@@ -40,6 +40,20 @@ func (n *fullNode) encode(w rlp.EncoderBuffer) {
w.ListEnd(offset)
}

func (n *fullnodeEncoder) encode(w rlp.EncoderBuffer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you estimate the performance and allocation differences between using this encoder and the standard full node encoder?

The primary distinction seems to be that this encoder inlines the children encoding directly, rather than invoking the children encoder recursively. I’m curious about the performance implications of this approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, so if I change it back (adding back rawNode, and then switching back the case branchNode: so that it looks like it did earlier, then the difference is:

[user@work go-ethereum]$ benchstat pr.bench pr.bench..2
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │   pr.bench   │          pr.bench..2           │
             │    sec/op    │    sec/op     vs base          │
Insert100K-8   88.67m ± 33%   88.20m ± 11%  ~ (p=0.579 n=10)

             │   pr.bench   │                pr.bench..2                 │
             │     B/op     │      B/op        vs base                   │
Insert100K-8   3.451Ki ± 3%   2977.631Ki ± 0%  +86178.83% (p=0.000 n=10)

             │  pr.bench  │                pr.bench..2                 │
             │ allocs/op  │   allocs/op     vs base                    │
Insert100K-8   22.00 ± 5%   126706.50 ± 0%  +575838.64% (p=0.000 n=10)

When we build the children struct, all the values are copied, as opposed if we just use the encoder-type which just uses the same child without triggering a copy.

n := shortNode{Key: hexToCompactInPlace(st.key), Val: valueNode(st.val)}

n.encode(t.h.encbuf)
{
Copy link
Member

@rjl493456442 rjl493456442 Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also use shortNodeEncoder here?
I would recommend not to do the inline encoding directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but here we know we have a valueNode. And the valueNode just does:

func (n valueNode) encode(w rlp.EncoderBuffer) {
	w.WriteBytes(n)
}

If we change that to a method which checks the size and theoretically does different things, then the semantics become slightly changed.

//shortNodeEncoder is a type used exclusively for encoding. Briefly instantiating
// a shortNodeEncoder and initializing with existing slices is less memory
// intense than using the shortNode type.
shortNodeEncoder struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq, does shortNodeEncoder need to implement the node interface?

It's just an encoder which coverts the node object to byte format?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just have a try, it's technically possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking there!

Good thinking

@@ -27,6 +27,7 @@ import (

var (
stPool = sync.Pool{New: func() any { return new(stNode) }}
bPool = newBytesPool(32, 100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason to not use sync.Pool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! You'd be surprised, but the whole problem I had with extra alloc :

Screenshot 2024-11-11 at 22-18-40 trie test alloc_space

I solved that by not using a sync.Pool. I suspect it's due to the interface-conversion, but I don't know any deeper details about the why of it.

@@ -129,6 +143,12 @@ const (
)

func (n *stNode) reset() *stNode {
if n.typ == hashedNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between HashedNode and Embedded tiny node?

For embedded node, we also allocate the buffer from the bPool and the buffer is owned by the node itself right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the thing is, that there's only one place in the stacktrie where we convert a node into a hashedNode.

	st.typ = hashedNode
	st.key = st.key[:0]

	st.val = nil // Release reference to potentially externally held slice.

This is the only place. So we know that once something has been turned into a hashedNode, the val is never externally held and thus it can be returned to the pool. The big problem we had earlier, is that we need to overwrite the val with a hash, but we are not allowed to mutate val. So this was a big cause of runaway allocs.

But now we reclaim those values-which-are-hashes since we know that they're "ours".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For embedded node, we also allocate the buffer from the bPool and the buffer is owned by the node itself right?

To answer your question: yes. So from the perspective of slice-reuse, there is zero difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, in my follow-up PR to reduce allocs in derivesha:#30747 , I remove this trick again.

In that PR, I always copy the value as it enters the stacktrie. So we always own the val slice, and are free to reuse via pooling.

Doing so is less hacky, we get rid of this special case "val is externally held unless it's an hashedNode because then we own it".


That makes it possible for the derivesha-method to reuse the input-buffer

@MariusVanDerWijden
Copy link
Member

lint is read after the latest commit

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
             │ stacktrie.3  │          stacktrie.4          │
             │    sec/op    │    sec/op     vs base         │
Insert100K-8   69.50m ± 12%   74.59m ± 14%  ~ (p=0.128 n=7)

             │ stacktrie.3  │             stacktrie.4             │
             │     B/op     │     B/op      vs base               │
Insert100K-8   4.640Mi ± 0%   3.112Mi ± 0%  -32.93% (p=0.001 n=7)

             │ stacktrie.3 │            stacktrie.4             │
             │  allocs/op  │  allocs/op   vs base               │
Insert100K-8   226.7k ± 0%   126.7k ± 0%  -44.11% (p=0.001 n=7)
@holiman
Copy link
Contributor Author

holiman commented Nov 25, 2024

Running a snapsync benchmarkers now, using partialwipe (ancients intact),

  • master on bench06
  • This PR on bench05

@holiman
Copy link
Contributor Author

holiman commented Nov 26, 2024

So, here are some charts. The first third of the charts is the snapsync, the latter two thirds are snap-gen + block by block sync. The yellow (this PR) finished slightly faster (13 minutes or so).

Screenshot 2024-11-26 at 12-18-25 Dual Geth - Grafana

This is despite that the 05 (this PR) struggled against larger iowait. Perhaps it was bottlenecked by the disk?

Screenshot 2024-11-26 at 12-19-55 Dual Geth - Grafana

The 05 had some block downloading to do in the beginnig, but after that the ingress rates were fairly equal:

Screenshot 2024-11-26 at 12-21-30 Dual Geth - Grafana

Memory charts doesn't show much, except that maybe the lead that 05 had after sync was increased further during generation, to about 25 minutes

Screenshot 2024-11-26 at 12-22-45 Dual Geth - Grafana

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants