NOTE: Currently there are a couple of assumptions:
- Homogenous cluster set up
- model gradients transfer is the same as the model size saved in ckpts (model_factory)
- Parameter Server / Worker frameworks (All-reduce not yet implemented)
- Synchronize SGD
Execution
-
Before the exection, what's needed?
- Infrastructure details
Define the hierarchy and resource capacity of the infrastructure in
cluster_spec.csv
. For example, we have a cluster with 4 racks (switches). Under each rack (switch), there are 32 nodes. And each node has 128 CPU cores, 256 GB memory, and 8 GPUs. Thencluster_spec.csv
will look like this:num_switch,num_node_p_switch,num_gpu_p_node,num_cpu_p_node,mem_p_node 4,32,8,128,256
- Job trace
The job trace to simulate. For each job, the simulator needs the following information:
job_id
: for trackingnum_gpu
: gpu requirementsubmit_time
: when the job is submitted. The simulator is event-based and discrete-time. Therefore, the time value starts from0
, and in second-scale.iterations
: the number of iterations to training. Used by Network costs calculation when in data parallel jobs.model_name
: what's the model in that job. This is used to estimate GPU memory usage, and network costs.duration
: how long this job will run. This information is used to generate job completion event by the simulator.interval
: job submission interval from this job to the next job
- Infrastructure details
Define the hierarchy and resource capacity of the infrastructure in
-
How to run the simulator? A simple example of the execution commend should be:
python execute.py
Inside the execute file The following options are necessary:
--cluster_spec
: infrastructure spec file--trace_file
: job trace--scheme
: placement scheme--schedule
: scheduler
Optional inputs:
--print
: print debug information--log_path
: the output path of the log (cluster, job). The default will betime-stamp
folder under current path
-
What are the placement and scheduling algorithms provided? Placement:
yarn
: get GPUs from the same server nodes under the same switch
Scheduling
fifo
sjf
: Smallest-job-first, in terms of GPU requirement- TODO BELOW
lpjf
: longest pending job firstshorest
: shorestest remaining time job firstshorest-gpu
: shortest-remaining-gputime job firstdlas
: discretized LAS (just time-based) Injobs.py
, you need to specifynum_queue
andqueue_limit
forMLFQ
(also fordlas-gpu
, andgittins
)# Example1: there are two queues, and the threshold for Q1 is 3600 seconds self.queue_limit = [3600] # Example2: there are four queues, and the threshold for queues is 3600, 7200, 18000 seconds self.queue_limit = [3600, 7200, 18000]
dlas-gpu
: discretized LAS (gpu-time-based)gittins
: discretized Gittins Index (gpu-time-based)
-
What's the output? Based on the
--log_path
, all the output files are in that folder (e.g.,result-20190210-12-20-37
including:cluster.csv
: cluster-level resource utilization info at each event pointjobs.csv
: the job execution informationcpu.csv
,gpu.csv
,memory.csv
,network.csv
: those are the utilization details of each resource unit at event points. However, those logs are not accurate under some combinations of placement and scheduler. Whencount
is chosen, those files are not generated.
The output logs are defined in
log.py
; You can modify that file to adjust the output information.