Skip to content

chuxiaoyu/2023-hotcloudperf-ml-failures

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

2023-hotcloudperf-ml-failures

Publication

Xiaoyu Chu, Sacheendra Talluri, Laurens Versluis, and Alexandru Iosup. 2023. How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster. In Companion of the 2023 ACM/SPEC International Conference on Performance Engineering (ICPE '23 Companion). Association for Computing Machinery, New York, NY, USA, 263–268. https://doi.org/10.1145/3578245.3584726

Dataset

SLURM job data: https://zenodo.org/records/12750561

About

Code and data for 2023-hotcloudperf-ml-failures.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published