Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a timeout value for the take_group_member #72

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cjimison
Copy link

If the pools in a "group" has room to grow but has no available members it will return and that there are no members available. This fix will allow the user to pass in a timeout and after it checks if there are any pools not at capacity it will then try to grow one.

Copy link
Collaborator

@seth seth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable, but I think needs a few cleanups. Does the existing test suite cover these additions? If not, let's add a test so it gets covered.

@@ -220,18 +229,25 @@ take_group_member(GroupName) ->
{_, _, X} = os:timestamp(),
Idx = (X rem length(Pools)) + 1,
{PoolPid, Rest} = extract_nth(Idx, Pools),
take_first_pool([PoolPid | Rest])
case {take_first_pool([PoolPid | Rest], 0), Timeout} of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a first case with hard-coded zero timeout?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not have a lot of time to truly dig into the internals of how pooler works so if any of my assumptions are wrong please let me know:

From the looks of the code when you call the "take_first_pool" it would basically call "take_member" with a zero timeout. If you pass in a zero timeout it will not grow the pool if it is a capacity (or if it does grow the pool, it will not wait around for a response). The reason I do the 0 timeout first it to make sure no one pool gets too big.

Example:

Lets assume I have two pools each one connected to a different Riak server with an initial capacity of 1 and a max capacity of 100: If I passed in the non-zero timeout first and the sudo random number being used (via the os:timestamp call) always landed on 1, I would end up with one pool with 100 connections and 1 pool with 1 connection. In such a case this would unbalance the load placed on the the 2 riak servers.

That said the big issue I have with this change is if you have a lot of pools (like 1000), you would be making 2000 gen_server calls just to find out there are no pools left with any space.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I completely acknowledge that my argument maybe purely academic and in reality this is just not the case (or so rare that it is not worth handling) :P

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I see what you're concerned about. I think that random selection of the pool makes a single-large-pool issue a non-problem. Are you concerned that the selection isn't random enough?

How many pools are you using? 1000 strikes me as a large number -- at least in the sense that I didn't consider such as a use case to solve for in the initial design.

It would be possible to build a small experiment to model out how the pools in the group would grow.

On balance, I'd prefer the simpler code path and think that the random selection makes that OK.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1000 pools was just a crazy max case kind of thing, in practice we run one pool per Riak DB (so typically 5) and configure them with a init capacity of 10 and a max of 100. When I first ran the change without the first pass of 0 through our load simulator I was seeing pools sizes roughly round 25, 20, 15, 10, 10 and after the change it evened out (around 14 per pool give or take one). In the end nothing that was so critical that the system became unstable but I did find that typically a few servers received more load than others.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if a solution like what dispcount(https://github.com/ferd/dispcount) does maybe better. He uses an ets table with counters to determine which pool is the best to pull from. However that has a disadvantage of doing more ets calls which would be much more expensive then just getting the timestamp ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ferd also keeps all the processes in an ets table so there are no gen_server calls to pull a process however not sure if that would work good for riak given vector clock issue you cite in the Readme.md.

%% in the group are tried in order. If no member
%% is available within the specified timeout, error_no_members is returned.
%% `Timeout' can be either milliseconds as integer or `{duration, time_unit}'
-spec take_group_member(atom(), non_neg_integer()) -> pid() | error_no_members | {error_no_group, atom()}.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec here for Timeout doesn't include {duration(), time_unit()} -- if we're going to support that, the spec should allow it as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. I will add that when I have some free time this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants