Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActorGraphInterpreter Box Caching #7116

Draft
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

to11mtm
Copy link
Member

@to11mtm to11mtm commented Mar 11, 2024

Fixes #

Changes

Adds Box-Pools for various struct messages used by ActorGraphInterpreter

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

Latest dev Benchmarks

TBD after feedback

This PR's Benchmarks

TBD after feedback

@to11mtm
Copy link
Member Author

to11mtm commented Mar 11, 2024

Waiting for feedback on approach before I provide benchmarks (Also have to make sure we have some good ones)

But local testing on ValueTask branch indicates a 5-15% reduction in memory allocation on even existing async calls, with equal or improved performance throughout.

@Aaronontheweb Aaronontheweb self-assigned this Mar 11, 2024
@Aaronontheweb
Copy link
Member

@to11mtm I'll get to this this week with an initial review, but looks promising

Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could help explain the approach for why we're doing it this way, I'd have an easier time understanding this code some. Do you mind?

@@ -156,8 +245,8 @@ public GraphInterpreterShell(GraphAssembly assembly, Connection[] connections, G
_settings = settings;
Materializer = materializer;

_inputs = new ActorGraphInterpreter.BatchingActorInputBoundary[shape.Inlets.Count()];
_outputs = new ActorGraphInterpreter.IActorOutputBoundary[shape.Outlets.Count()];
_inputs = new ActorGraphInterpreter.BatchingActorInputBoundary[shape.Inlets.Length];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


public GraphInterpreterShell Shell => Boxed.Shell;
}
internal sealed class BoxedOnNext : ActorGraphInterpreter.IBoundaryEvent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@to11mtm can you describe the approach / "why do it this way?" for me just so I understand the thinking behind this.

@to11mtm
Copy link
Member Author

to11mtm commented Mar 12, 2024

If you could help explain the approach for why we're doing it this way, I'd have an easier time understanding this code some. Do you mind?

Of course and I apologize for terseness in this POC, the lift vs benefit got me a bit excited. :)

Right now, ActorGraphInterpreter is using a bunch of structs in it's internal processing flow. This causes boxing on .Tell() and some vagaries of struct unboxing in large switch type checks (I'll try to find some data, it's in my notes somewhere)

What this PR does, is change some things up to minimize allocations.

  • ActorGraphInterpreter.Resume is made into a class, as when the instance is created, we are setting it as a readonly anyway, so this gets rid of some boxing where it is involved (i.e. any async ops, need to check if others)
  • ActorGraphInterpreter.AsyncInput, ActorGraphInterpreter.OnNext, and ActorGraphInterpreter.RequestMore get their own special boxes, and we pool them, via thin wrapper around ConcurrentQueue<T>

The main purpose of the structure of said boxes, was to minimize overall code changes; since IBoundaryEvent is required to get into that function, It seemed like a safe 'hack' that was overall cleaner.

Where the 'maybes' come in:

  1. Is there a better pool to use than ConcurrentQueue<T> in this fashion.
    1. Will note, the choice of that pool was with intent of avoiding any locks and minimizing interlocked operations, maybe there's a better way to do that with a little more allowance for waste?
  2. Do we want to expose settings for pool sizes?
    1. Probably.
  3. Should I look into what perf hits do or do not exist if we genericize the boxing to minimize code bloat

@Aaronontheweb
Copy link
Member

@to11mtm got it - benchmark it and let's see if it makes any difference. I'd probably prefer to just move everything to classes than having clever object pooling stuff that can bite us in the butt. Alternatively, maybe we can use generics or something else to stick with structs and eliminate boxing OR allocations OR pooling altogether.

@to11mtm
Copy link
Member Author

to11mtm commented Mar 17, 2024

@to11mtm got it - benchmark it and let's see if it makes any difference.

Generic Boxes looks... IDK They do seem to somehow add some extra jitter to benchmarks 🤷‍♂️ but the reduced code bloat is worth it, unless the benchmarks

Actual benchmark numbers are in next reply (trying to get better about bench spam), I had to re-run everything with a more 'fair' pool abstraction (original one was too ugly even for my taste, new one I'm re-benching before commit is fair, but still probably can be improved.)

That said in general stuff looks... curious. I have bad jitter luck on my box, which complicates comparing branches at times.

I think there's something small I'm missing, as I see 'equal or slightly better' perf, or 'maybe slightly worse or test jitter' perf... memory allocation is definitely improved though

I'd probably prefer to just move everything to classes than having clever object pooling stuff that can bite us in the butt.

Well, the pooling still helps; If we used classes but didn't pool, we may save on unbox vagaries but still will be allocating.
I just am not good at writing a pool that is safe and fast enough for everyone to be happy with. 😅.

And, Ironically, I have to say that the 'box pool' made this easy to do safely. We pull the struct out of the 'box', pass it to the next thing, the box use is all 'internal' to ActorGraphInterpreter.cs, blast radius stays minimized. When we try to pool a class explicitly, yes that can be faster but it does also add a lot of complication to doing it and being able to sleep at night.

I should also add, after some looking, Box-pools -may- be a solution for some other challenges we have around GetAsyncCallback<T>, namely, avoiding box on the value type result case... Seeing a lot of potential there...

@to11mtm
Copy link
Member Author

to11mtm commented Mar 17, 2024

After:

Method Mean Error StdDev Gen0 Allocated
RunSelectAsync 31.62 us 0.334 us 0.313 us 3.8452 17.83 KB
RunSelectAsyncSync 29.08 us 0.322 us 0.301 us 3.8757 17.83 KB
Method Mean Error StdDev Gen0 Allocated
UnfoldAsyncYieldInConsume 173.7 us 1.46 us 1.37 us 8.7891 40.49 KB
UnfoldAsyncYieldInPush 154.5 us 2.89 us 2.97 us 7.5684 34.25 KB
Method Mean Error StdDev Gen0 Allocated
UnfoldResourceAsyncNoYield 252.06 us 4.953 us 6.264 us 14.1602 67320 B
UnfoldResourceAsyncWithYield 223.95 us 3.978 us 3.721 us 12.9395 60929 B
StraightChannelReadNoYield 67.54 us 1.333 us 1.182 us - 256 B
StraightChannelReadWithYield 80.44 us 0.727 us 0.645 us - 269 B

before

Method Mean Error StdDev Gen0 Allocated
RunSelectAsync 31.83 us 0.374 us 0.350 us 3.9063 17.94 KB
RunSelectAsyncSync 29.25 us 0.360 us 0.337 us 3.9063 17.94 KB
Method Mean Error StdDev Gen0 Allocated
UnfoldAsyncYieldInConsume 169.8 us 1.64 us 1.37 us 9.7656 45.23 KB
UnfoldAsyncYieldInPush 154.1 us 1.17 us 1.04 us 7.8125 38.99 KB
Method Mean Error StdDev Gen0 Allocated
UnfoldResourceAsyncNoYield 242.33 us 2.044 us 1.812 us 16.3574 76713 B
UnfoldResourceAsyncWithYield 211.84 us 1.311 us 1.162 us 15.1367 70322 B
StraightChannelReadNoYield 65.31 us 0.293 us 0.274 us - 256 B
StraightChannelReadWithYield 82.70 us 0.678 us 0.634 us - 271 B

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants