You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 5, 2018. It is now read-only.
The current implementation of m3em was targeted at short-lived, transient clusters. This PR lists the things we should address if we want to make it more robust for long running clusters (in no particular order):
(1) Currently, the remote agents heartbeat to a coordinator periodically. If they aren’t able to heartbeat for a defined duration, they timeout and reset themselves to an un-initialised state, i.e. kill any running process, delete local data.
(2) Declarative state modelling: currently the flow control is modelled procedurally, i.e. a coordinator is given a list of remote agents, and it sends instructions to remote agents like "transfer file", "execute file", and so on. We’ll need to change this to be declarative to allow for more robust control
(3) Fault tolerant master: currently the coordinator initialises the connections with the remote agents and must stay alive for the duration of a test run. If the coordinator goes down, it is not able to recover the state of the running agents based on heartbeats. We should allow it to do so.
(4) Durable state: We do not persist expected state of the nodes anywhere. Persisting it in a durable store (etcd would be fine) is necessary for fault tolerance.
The text was updated successfully, but these errors were encountered:
Another couple:
(5) The remote agents should be aware of the "coordinator(s)" and be the ones starting the control flow.
(6) The m3em agents handle the fork/exec and process lifecycle monitoring, it works well but might be worth moving to supervisor/systemd or something more battle tested
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The current implementation of m3em was targeted at short-lived, transient clusters. This PR lists the things we should address if we want to make it more robust for long running clusters (in no particular order):
(1) Currently, the remote agents heartbeat to a coordinator periodically. If they aren’t able to heartbeat for a defined duration, they timeout and reset themselves to an un-initialised state, i.e. kill any running process, delete local data.
(2) Declarative state modelling: currently the flow control is modelled procedurally, i.e. a coordinator is given a list of remote agents, and it sends instructions to remote agents like "transfer file", "execute file", and so on. We’ll need to change this to be declarative to allow for more robust control
(3) Fault tolerant master: currently the coordinator initialises the connections with the remote agents and must stay alive for the duration of a test run. If the coordinator goes down, it is not able to recover the state of the running agents based on heartbeats. We should allow it to do so.
(4) Durable state: We do not persist expected state of the nodes anywhere. Persisting it in a durable store (etcd would be fine) is necessary for fault tolerance.
The text was updated successfully, but these errors were encountered: