Skip to content

Latest commit

 

History

History
33 lines (20 loc) · 2.22 KB

README.md

File metadata and controls

33 lines (20 loc) · 2.22 KB

A Distributed version of Hogwild! [1]

Team : Grégoire Clément, Maxime Delisle, Sylvain Beaud

Description

Nowadays, robust and reliable systems are a core component of a respectable setup, for this reason, we focused more particularly on this side of the problem.

One of the main highlights of our implementation is the possibility to add and remove workers at will and at any time. Indeed, in the synchronous implementation, the coordinator monitors the number of workers and if a worker crashes, the computation can continue without him. On the contrary, if the user want to add more workers to the system, the new workers will connect to the coordinator or other workers and the computation will continue with these additional workers. In the asynchronous version, a new worker arrives, it will retrieve the list of workers from another worker and broadcast its updates to them and receive their computations; this is the only phase where a locking mechanism is used. When a worker encounters an error, it broadcasts an error message to the other workers and they will stop to communicate with the faulty node.

Another interesting feature of our implementation is the fact that once the computations are finished, the logs and statistics are uploaded and stored on transfer.sh and can be downloaded for a later use. We have also put options to adjust the level of verbosity of the logs.

Report

For more infos about this project refer to report.pdf or contact us.

Requirements

kubectl (https://kubernetes.io/docs/tasks/tools/install-kubectl/)

How to run the project

$ sh run.sh $1 $2 $3

$1 argument is either sync or async

$2 argument is the number of replicas 1 to 100 (or more)

$3 argument is the log level (or verbosity) from 0 (minimal) to 3 (maximal)

Results

Results are uploaded on transfer.sh (linked displayed in the console). In case of failure (if server transfer.sh is down) we also print them in the console (just to be sure!).

Reference

[1] Recht, Benjamin, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." Advances in neural information processing systems. 2011.