Team : Grégoire Clément, Maxime Delisle, Sylvain Beaud
Nowadays, robust and reliable systems are a core component of a respectable setup, for this reason, we focused more particularly on this side of the problem.
One of the main highlights of our implementation is the possibility to add and remove workers at will and at any time. Indeed, in the synchronous implementation, the coordinator monitors the number of workers and if a worker crashes, the computation can continue without him. On the contrary, if the user want to add more workers to the system, the new workers will connect to the coordinator or other workers and the computation will continue with these additional workers. In the asynchronous version, a new worker arrives, it will retrieve the list of workers from another worker and broadcast its updates to them and receive their computations; this is the only phase where a locking mechanism is used. When a worker encounters an error, it broadcasts an error message to the other workers and they will stop to communicate with the faulty node.
Another interesting feature of our implementation is the fact that once the computations are finished, the logs and statistics are uploaded and stored on transfer.sh and can be downloaded for a later use. We have also put options to adjust the level of verbosity of the logs.
For more infos about this project refer to report.pdf
or contact us.
kubectl (https://kubernetes.io/docs/tasks/tools/install-kubectl/)
$ sh run.sh $1 $2 $3
$1
argument is either sync
or async
$2
argument is the number of replicas 1
to 100
(or more)
$3
argument is the log level (or verbosity) from 0
(minimal) to 3
(maximal)
Results are uploaded on transfer.sh (linked displayed in the console). In case of failure (if server transfer.sh is down) we also print them in the console (just to be sure!).
[1] Recht, Benjamin, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." Advances in neural information processing systems. 2011.