Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional time-to-live to subscription lock #213

Open
slashdotdash opened this issue Aug 10, 2020 · 4 comments
Open

Add optional time-to-live to subscription lock #213

slashdotdash opened this issue Aug 10, 2020 · 4 comments

Comments

@slashdotdash
Copy link
Member

slashdotdash commented Aug 10, 2020

Postgres advisory locks are used to ensure only a single subscription process can subscribe to each uniquely named EventStore subscription. This ensures events are only processed by a single susbcription, regardless of how many nodes are running in a multi-node deployment.

One consequence of this design is that when multiple nodes are started it will likely be the first node which starts that acquires locks for all the started subscriptions This means that load won't be evenly distributed amongst the available nodes in the cluster. You could use distributed Erlang to evenly distribute subscriber processes amongst the available nodes in the cluster.

For scenarios where distributed Erlang is not used, to help balance load more evently the subscription lock could be released after a configurable interval (e.g. hourly with random jitter). This would allow a subscription process running on another to connect and resume processing. It may be necessary to broadcast a "lock released" message to connected nodes, triggering lock acquisition on another node, to reduce latency. Eventually subscription processes should be randomly distributed amongst all running nodes in the cluster.

The pg_locks table can be used to identify locks acquired on the EventStore subscriptions table from any connected node:

SELECT * FROM pg_locks 
WHERE classid = 'public.subscriptions'::regclass
  AND locktype = 'advisory';

This could be used to determine if locks are fairly distributed or not by grouping and counting by PID:

SELECT pid, COUNT(*) 
FROM pg_locks 
WHERE classid = 'subscriptions'::regclass 
  AND locktype = 'advisory'
GROUP BY pid;
@slashdotdash slashdotdash changed the title Add subscription lock time-to-live Add optional time-to-live to subscription lock Aug 12, 2020
@HarenBroog
Copy link

Hiho!

I am also facing problem that, the only one node in the cluster process events, which make things harder to manage and scale horizontally. Solution which you proposed would allow to spread load by subscription name, but still it isn't ideal. We won't be able to spin off an another subscriber to the existing persistent subscription (for instance when the consumer has constantly growing lag).

I am just wondering, maybe we can use an another scaling pattern here? From what I inferred from codebase, the problem is that EventStore.Subscriptions.Supervisor is unaware of clustered environment. It spins off one process for each subscription on each node, but only one can really process events (just as you described).

What if, we make that supervisor cluster-aware using some tool like swarm or horde? Subscription processes will be registered "globally" (well almost :) ) and they will be able to spread event handling on each node by given partition_by callback.

Last week I hacked a small PoC from that idea, if you think that it can be a viable concept I can share a draft :)

@slashdotdash
Copy link
Member Author

slashdotdash commented Oct 19, 2020

@HarenBroog This particular issue relates to scenarios where distributed Erlang is not being used. However, allowing subscribers to a single subscription to be distributed amongst multiple nodes in a distributed Erlang cluster would be a useful feature to add. I can create a separate issue for horizonal subscription scaling.

@HarenBroog
Copy link

@slashdotdash makes sense. Then we can continue discussion there.

@ericlyoung
Copy link

This lock on the subscription actually causes a problem with deploying new versions of the app.
I use docker swarm to launch a new container with the new version of my elixir app.
This is an extremely common deployment strategy.
The new app in the new container is stuck waiting 60 seconds to be able to subscribe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants