Added a timeout value for the take_group_member #72

cjimison · 2017-01-17T22:22:21Z

If the pools in a "group" has room to grow but has no available members it will return and that there are no members available. This fix will allow the user to pass in a timeout and after it checks if there are any pools not at capacity it will then try to grow one.

…eedes to be spun up it can be

seth

This looks reasonable, but I think needs a few cleanups. Does the existing test suite cover these additions? If not, let's add a test so it gets covered.

seth · 2017-01-18T05:09:13Z

src/pooler.erl

@@ -220,18 +229,25 @@ take_group_member(GroupName) ->
            {_, _, X} = os:timestamp(),
            Idx = (X rem length(Pools)) + 1,
            {PoolPid, Rest} = extract_nth(Idx, Pools),
-            take_first_pool([PoolPid | Rest])
+            case {take_first_pool([PoolPid | Rest], 0), Timeout} of


Why a first case with hard-coded zero timeout?

I did not have a lot of time to truly dig into the internals of how pooler works so if any of my assumptions are wrong please let me know:

From the looks of the code when you call the "take_first_pool" it would basically call "take_member" with a zero timeout. If you pass in a zero timeout it will not grow the pool if it is a capacity (or if it does grow the pool, it will not wait around for a response). The reason I do the 0 timeout first it to make sure no one pool gets too big.

Example:

Lets assume I have two pools each one connected to a different Riak server with an initial capacity of 1 and a max capacity of 100: If I passed in the non-zero timeout first and the sudo random number being used (via the os:timestamp call) always landed on 1, I would end up with one pool with 100 connections and 1 pool with 1 connection. In such a case this would unbalance the load placed on the the 2 riak servers.

That said the big issue I have with this change is if you have a lot of pools (like 1000), you would be making 2000 gen_server calls just to find out there are no pools left with any space.

Also I completely acknowledge that my argument maybe purely academic and in reality this is just not the case (or so rare that it is not worth handling) :P

So I see what you're concerned about. I think that random selection of the pool makes a single-large-pool issue a non-problem. Are you concerned that the selection isn't random enough?

How many pools are you using? 1000 strikes me as a large number -- at least in the sense that I didn't consider such as a use case to solve for in the initial design.

It would be possible to build a small experiment to model out how the pools in the group would grow.

On balance, I'd prefer the simpler code path and think that the random selection makes that OK.

The 1000 pools was just a crazy max case kind of thing, in practice we run one pool per Riak DB (so typically 5) and configure them with a init capacity of 10 and a max of 100. When I first ran the change without the first pass of 0 through our load simulator I was seeing pools sizes roughly round 25, 20, 15, 10, 10 and after the change it evened out (around 14 per pool give or take one). In the end nothing that was so critical that the system became unstable but I did find that typically a few servers received more load than others.

I also wonder if a solution like what dispcount(https://github.com/ferd/dispcount) does maybe better. He uses an ets table with counters to determine which pool is the best to pull from. However that has a disadvantage of doing more ets calls which would be much more expensive then just getting the timestamp ...

Ferd also keeps all the processes in an ets table so there are no gen_server calls to pull a process however not sure if that would work good for riak given vector clock issue you cite in the Readme.md.

seth · 2017-01-18T05:10:42Z

src/pooler.erl

+%% in the group are tried in order. If no member
+%% is available within the specified timeout, error_no_members is returned.
+%% `Timeout' can be either milliseconds as integer or `{duration, time_unit}'
+-spec take_group_member(atom(), non_neg_integer()) -> pid() | error_no_members | {error_no_group, atom()}.


The spec here for Timeout doesn't include {duration(), time_unit()} -- if we're going to support that, the spec should allow it as well.

Ah yes. I will add that when I have some free time this weekend.

Added a timeout value for the take_group_member so if a new process n…

af89dc4

…eedes to be spun up it can be

seth requested changes Jan 18, 2017

View reviewed changes

randysecrist mentioned this pull request Aug 15, 2017

Not handling growing concurrency past init_count very well drewkerrigan/riak-elixir-client#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a timeout value for the take_group_member #72

Added a timeout value for the take_group_member #72

cjimison commented Jan 17, 2017

seth left a comment

seth Jan 18, 2017

cjimison Jan 19, 2017

cjimison Jan 19, 2017

seth Jan 19, 2017

cjimison Jan 19, 2017

cjimison Jan 19, 2017

cjimison Jan 19, 2017

seth Jan 18, 2017

cjimison Jan 19, 2017

Added a timeout value for the take_group_member #72

Are you sure you want to change the base?

Added a timeout value for the take_group_member #72

Conversation

cjimison commented Jan 17, 2017

seth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment