[WIP] Implement arena allocation for VersionVec #12

ibabushkin · 2019-04-08T09:39:09Z

I'm submitting this PR alongside my GSoC proposal (see
tokio-rs/gsoc#2), providing two core changes:

Building the fringe scheduler on nightly is possible again.
VersionVec objects can now be allocated in an arena cleared after each iteration of
the model checker. The implementation is based on the stub already present.

However, some caveats remain: It is debatable whether integrating the arena into the
Execution struct is the right way forward, since passing around a mutable reference to
the arena to allow allocation in various modules is certainly not ergonomic.

So far, performance on the workloads I've tested has been worse than expected, and plenty
other questions need to be discussed:

Should other data structures be allocated in the arena?
Should the allocation interface be changed?
Does a drop-in solution for pointer bump arena allocation provide a performance benefit
or some other advantage?
Why did this implementation of arena allocation for some objects not provide the
expected speedup?

Additionally, I was able to discover a resource leak in one of the tokio-sync tests
using this patch set: leaking objects mocked by loom can cause the arena to refuse
clearing, if Slices allocated on behalf of the tested program are still live, so if
the tested program contains such a leak, the test crashes when loom tries to clear the
arena after the leaky iteration.

Here's the relevant function from the
test:

// setup omitted

#[path = "../src/mpsc/list.rs"]
mod list;

#[test]
fn smoke() {
    loom::fuzz(|| {
        let (tx, mut rx) = list::channel();

        // lots of code omitted here...

        // This is necessary to avoid the leak, as no `Drop` impl for `list::Rx` is
        // provided, so the allocated blocks are never returned, causing a leak of objects
        // allocated by loom. On the other hand, providing a `Drop` impl breaks other
        // code.
        unsafe { rx.free_blocks(); }
    });
}

I'm unsure whether this is worth opening a tokio issue for, as the bug is largely
irrelevant in practise, and only amounts to leaking (tiny) amounts of memory in a test.

Unsure how this worked before, but the macro used to declare the scheduler's state never existed in scoped-tls afaict. Fix involves doing disgusting things to RefCell (so rather hacky).

To allow for cloning objects allocated in the arena, some rearrangements had to be made, namely moving allocation functions to a place where they can be used with a shared reference to the arena. This sounds dangerous, as no synchronization is done. However, it should be safe: with every scheduler, no true concurrency is present in the loom runtime, and the only scheduler using multiple threads (`std`) places the global execution state behind a mutex. Still, this is somewhat of an opinionated approach to integrating the arena allocator into loom.

As a first step, allocating clock vectors in the arena seems reasonable and allows to gauge how much complexity such a change introduces. It turns out that at the moment, handling of the arena requires that it is passed around to code that should be able to allocate in it. However, a globally accessible arena could prove to be more elegant. Also, not all `VersionVec` instances are deallocated after the end of each iteration, so they are currently handled by allowing `VersionVec`s to be allocated on the heap as before as well.

carllerche · 2019-04-08T20:11:09Z

Thanks for submitting the PR.

Could you post before / after benchmark results?

Also, what version of tokio-sync are you using? I know there was a leak bug fixed recently. Being able to detect leaks is an added bonus.

ibabushkin · 2019-04-08T20:58:50Z

Argh, sorry -- I forgot to reference all the relevant information as it's already included in the proposal. Luckily, the relevant bits are available here.

To summarize: Commit 824b7b675990baab87fa3df347fa4a5491f6809b from the tokio repo, and below is my benchmark (the fuzz_semaphore test from tokio-sync). Both versions use the fringe scheduler, here is a branch that fixes the build for that.

LOOM_MAX_DURATION=10 hyperfine \
  --export-markdown bench.md ./fuzz_semaphore_arena ./fuzz_semaphore_system

Command	Mean [s]	Min … Max [s]
`fuzz_semaphore_arena`	47.372 ± 0.517	46.685 … 48.222
`fuzz_semaphore_system`	42.656 ± 2.242	40.781 … 48.764

I realize I also forgot to test the serialization feature and accidentally committed some nightly-only stuff though, I'll fix that shortly.

carllerche · 2019-04-09T20:14:25Z

It looks like the measurement is the total time of execution, but you are also capping it with LOOM_MAX_DURATION (though, it is set to 10 seconds...).

The fact that an arena is slower is unexpected. I would dig in and see what the time is spent on w/ a profile.

ibabushkin · 2019-04-09T20:17:46Z

Yeah, that is odd. However, since multiple different tests are run in the binary, the rough ballpark of the runtimes makes sense. I don't yet have an explanation on the slowdown, but I'm working on that.

For now, we only deserialize to heap-allocated `VersionVec`s. Also, we now have an iterator for arena slices.

ibabushkin · 2019-04-09T21:26:45Z

Aaand I forgot windows has no mmap!

ibabushkin · 2019-04-12T12:30:17Z

Excuse the force-push -- but windows builds should work now!

* Less inefficient cloning of `Slice`s * Much more efficient dropping of `Slice`s

ibabushkin · 2019-04-12T20:50:34Z

Alright, most of the performance regression is now gone after optimizing the Drop and Clone impls on Slices. There is some overhead left, but at the moment I am unsure we can get rid of it -- it seems that what is left in terms of unwanted slowdown is spread throughout various places and is thus hard to get rid of. However, it could be possible to fix this by moving more objects to the arena -- that way, the overhead from various operations that are slower on arena-allocated objects could be offset in other places (there already local speedups to be seen).

ibabushkin · 2019-04-16T12:25:47Z

Just a quick update: I'm currently experimenting with ways to provide arena allocation for other uses of Vec in the code which perform resizing operations. I'll post results/code when I have a clear picture of the performance properties of the possible implementations.

carllerche · 2019-04-17T19:57:40Z

Quick update from me. I’ve been on vacation. I’ll be digging in next week. Expect to hear more then.

ibabushkin · 2019-04-24T13:14:43Z

I'm exploring the possibility to store the arena in a scoped thread local variable to buy us some flexibility wrt where exactly the arena can be used. I'll benchmark a simple arena-backed vector implementation based on that.

ibabushkin · 2019-04-26T20:11:42Z

Another update: Pushing more and more data structures into the arena improves performance. As of now, I am still debugging the implementation though. Also, I haven't found a nice way to scale arena usage to more data structures other than to store the arena "control block" in a thread local variable. This has the obvious drawback that the std scheduler is no longer usable.

carllerche · 2019-05-06T18:33:04Z

@ibabushkin Putting it in a thread-local is fine. If we don't have std, we can always use alternate strategies for accessing the arena 👍.

ibabushkin added 4 commits March 16, 2019 13:49

Made the fringe scheduler compile again.

ad4bfaa

Unsure how this worked before, but the macro used to declare the scheduler's state never existed in scoped-tls afaict. Fix involves doing disgusting things to RefCell (so rather hacky).

Fixed compilation with futures feature.

39017c9

ibabushkin added 2 commits April 9, 2019 23:14

Made checkpoints play nice with arena allocation.

beb6873

For now, we only deserialize to heap-allocated `VersionVec`s. Also, we now have an iterator for arena slices.

Some cleanup.

b38896d

Experimental support for mapping an arena on windows.

745286b

ibabushkin force-pushed the arena branch from c52e6ee to 745286b Compare April 12, 2019 11:54

Partial fixes for arena-related performance regressions.

aff4806

* Less inefficient cloning of `Slice`s * Much more efficient dropping of `Slice`s

whentze mentioned this pull request Apr 19, 2019

Flesh out the std::sync::Arc mock implementation #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Implement arena allocation for VersionVec #12

[WIP] Implement arena allocation for VersionVec #12

ibabushkin commented Apr 8, 2019

carllerche commented Apr 8, 2019

ibabushkin commented Apr 8, 2019 •

edited

carllerche commented Apr 9, 2019

ibabushkin commented Apr 9, 2019

ibabushkin commented Apr 9, 2019

ibabushkin commented Apr 12, 2019

ibabushkin commented Apr 12, 2019

ibabushkin commented Apr 16, 2019

carllerche commented Apr 17, 2019

ibabushkin commented Apr 24, 2019

ibabushkin commented Apr 26, 2019

carllerche commented May 6, 2019

[WIP] Implement arena allocation for VersionVec #12

Are you sure you want to change the base?

[WIP] Implement arena allocation for VersionVec #12

Conversation

ibabushkin commented Apr 8, 2019

carllerche commented Apr 8, 2019

ibabushkin commented Apr 8, 2019 • edited

carllerche commented Apr 9, 2019

ibabushkin commented Apr 9, 2019

ibabushkin commented Apr 9, 2019

ibabushkin commented Apr 12, 2019

ibabushkin commented Apr 12, 2019

ibabushkin commented Apr 16, 2019

carllerche commented Apr 17, 2019

ibabushkin commented Apr 24, 2019

ibabushkin commented Apr 26, 2019

carllerche commented May 6, 2019

ibabushkin commented Apr 8, 2019 •

edited