Skip to content

This project aims to create a toolset for defining HPC ontology, that builds on top of an existing HPC ontology (https://hpc-fair.github.io/ontology/).

Notifications You must be signed in to change notification settings

am-i-helpful/hpc-ontology-modeller

Repository files navigation

Ontology Modeller

This project aims to model an ontology for HPC computing environment based on some of the sample datasets produced by real-world HPC cluster tools.

  • In this project, we'll focus on the datasets collected from SURF's LISA HPC cluster and CINECA's Marconi100 HPC cluster (consisting of metrics clooected at various levels from the HPC environment).
  • The ontology modelling aims to extend or improve upon an HPC ontology already derived by the HPC-FAIR research team at https://hpc-fair.github.io/ontology/ [1].
  • The aim of this project is to reuse the HPC-FAIR ontology while mapping metrics from these datasets (which have been made available for research) as much as possible and adding the missing ones.

Metrics mapping

1.CINECA's Marconi100 cluster dataset consists of metrics associated with different plugins as described below:

  • IPMI plugin (https://gitlab.com/ecs-lab/exadata/-/blob/main/documentation/plugins/ipmi.md)
    • metric (description; unit; value type; sampling period)
    • ambient (Temperature at the node inlet; measured in °C; values are of type float; sampled at 20s from each node)
    • dimmX_temp (Temperature of DIMM module X. X=0..15; measured in °C; values are of type int; sampled at 20s from each node)
    • fanX_Y (Speed of the Fan Y in module X. X=0..3, Y=0,1; measured in revolutions (RPM); values are of type int; sampled at 20s from each node)
    • fan_disk_power (Temperature at the node inlet; measured in Watts; values are of type int; sampled at 20s from each node)
    • gpuX_core_temp (Temperature of the core for the GPU id X. X=0,1,3,4; measured in °C; values are of type int; sampled at 20s from each node)
    • gpuX_mem_temp (Temperature of the memory for the GPU id X. X=0,1,3,4; measured in °C; values are of type int; sampled at 20s from each node)
    • gv100cardX (Nvidia Quadro GV100 Card for the GPU id X. X=0,1,3,4; measurement is unspecified; values are of type int; sampled at 20s from each node)
    • pX_coreY_temp (Temperature of core Y in the CPU socket X. X=0..1, Y=0..23; measured in °C; values are of type int; sampled at 20s from each node)
    • pX_io_power (Power consumption for the I/O subsystem for the CPU socket X. X=0..1; measured in Watts; values are of type int; sampled at 20s from each node)
    • pX_mem_power (Power consumption for the memory subsystem for the CPU socket X. X=0..1; measured in Watts; values are of type int; sampled at 20s from each node)
    • pX_power (Power consumption for the CPU socket X. X=0..1; measured in Watts; values are of type int; sampled at 20s from each node)
    • pX_vdd_temp (Temperature of the voltage regulator for the CPU socket X. X=0..1; measured in °C; values are of type int; sampled at 20s from each node)
    • pcie (Temperature at the PCIExpress slots; measured in °C; values are of type float; sampled at 20s from each node)
    • psX_input_power (Power consumption at the input of power supply X. X=0..1; measured in Watts; values are of type int; sampled at 20s from each node)
    • psX_input_voltag (Voltage at the input of power supply X. X=0..1; measured in Volts; values are of type int; sampled at 20s from each node)
    • psX_output_curre (Current at the output of power supply X. X=0..1; measured in Amps; values are of type int; sampled at 20s from each node)
    • psX_output_volta (Voltage at the output of power supply X. X=0..1; measured in Volts; values are of type float; sampled at 20s from each node)
    • total_power (Total node power consumption; measured in Watts; values are of type int; sampled at 20s from each node)
metric (description) unit value type sampling period
ambient (Temperature at the node inlet) measured in °C values are of type float sampled at 20s from each node
dimmX_temp (Temperature of DIMM module X. X=0..15) measured in °C values are of type int sampled at 20s from each node
fanX_Y (Speed of the Fan Y in module X. X=0..3, Y=0,1) measured in revolutions (RPM) values are of type int sampled at 20s from each node
fan_disk_power (Temperature at the node inlet) measured in Watts values are of type int sampled at 20s from each node
gpuX_core_temp (Temperature of the core for the GPU id X. X=0,1,3,4) measured in °C values are of type int sampled at 20s from each node
gpuX_mem_temp (Temperature of the memory for the GPU id X. X=0,1,3,4) measured in °C values are of type int sampled at 20s from each node
gv100cardX (Nvidia Quadro GV100 Card for the GPU id X. X=0,1,3,4) measurement is unspecified values are of type int sampled at 20s from each node
pX_coreY_temp (Temperature of core Y in the CPU socket X. X=0..1, Y=0..23) measured in °C values are of type int sampled at 20s from each node
pX_io_power (Power consumption for the I/O subsystem for the CPU socket X. X=0..1) measured in Watts values are of type int sampled at 20s from each node
pX_mem_power (Power consumption for the memory subsystem for the CPU socket X. X=0..1) measured in Watts values are of type int sampled at 20s from each node
pX_power (Power consumption for the CPU socket X. X=0..1) measured in Watts values are of type int sampled at 20s from each node
pX_vdd_temp (Temperature of the voltage regulator for the CPU socket X. X=0..1) measured in °C values are of type int sampled at 20s from each node
pcie (Temperature at the PCIExpress slots) measured in °C values are of type float sampled at 20s from each node
psX_input_power (Power consumption at the input of power supply X. X=0..1) measured in Watts values are of type int sampled at 20s from each node
psX_input_voltag (Voltage at the input of power supply X. X=0..1) measured in Volts values are of type int sampled at 20s from each node
psX_output_curre (Current at the output of power supply X. X=0..1) measured in Amps values are of type int sampled at 20s from each node
psX_output_volta (Voltage at the output of power supply X. X=0..1) measured in Volts values are of type float sampled at 20s from each node
total_power (Total node power consumption) measured in Watts values are of type int sampled at 20s from each node
  1. SURF's LISA Cluster dataset consists of various metrics from both Prometheus and SLURM as described below:
ID Prometheus Metrics Description Unit Datatype
1 id Measurement ID - int
2 timestamp Time of the measurment, every 30 seconds - datetime
3 up the node is up or not - int
4 node The number of rack and node - string
5 node_arp_entries ARP entries by device Aggregates
6 node_boot_time_seconds Node boot time, in unixtime Aggregates timestamp
7 node_context_switches_total Total number of context switches Aggregates int
8 node_disk_io_now The number of I/Os currently in progress
9 node_disk_read_bytes_total The total number of bytes read successfully
10 node_disk_writes_completed_total The total number of writes completed successfully
11 node_disk_written_bytes_total The total number of bytes written successfully
12 node_filesystem_avail_bytes Filesystem space available to non-root users in bytes json
13 node_filesystem_device_error Whether an error occurred while getting statistics for the given device
14 node_filesystem_files Filesystem total file nodes json
15 node_filesystem_files_free Filesystem total free file nodes
16 node_filesystem_free_bytes Filesystem free space in bytes
17 node_filesystem_size_bytes Filesystem size in bytes
18 node_forks_total Total number of forks json
19 node_hwmon_temp_celsius Temperature of the node Celsius json
20 node_intr_total Total number of interrupts serviced
21 node_load1 1m CPU utilization load average
22 node_load15 15m CPU utilization load average
23 node_load5 5m CPU utilization load average
24 node_memory_Active_bytes Memory
25 node_memory_Dirty_bytes Memory
26 node_memory_MemFree_bytes Memory
27 node_memory_Percpu_bytes Memory
28 node_netstat_Icmp_InErrors Statistic
29 node_netstat_Icmp_InMsgs Statistic
30 node_netstat_Icmp_OutMsgs Statistic
31 node_netstat_Tcp_InErrs Statistic
32 node_netstat_Tcp_InSegs Statistic
33 node_netstat_Tcp_OutSegs Statistic
34 node_netstat_Tcp_RetransSegs Statistic
35 node_netstat_Udp_InDatagrams Statistic
36 node_netstat_Udp_InErrors Statistic
37 node_netstat_Udp_OutDatagrams Statistic
38 node_network_receive_bytes_total Network
39 node_network_receive_drop_total Network
40 node_network_receive_multicast_total Network
41 node_network_receive_packets_total Network
42 node_network_transmit_bytes_total Network
43 node_network_transmit_packets_total Network
44 node_procs_blocked Number of processes blocked waiting for I/O to complete
45 node_procs_running Number of processes in runnable state
46 node_rapl_package_joules_total
47 node_thermal_zone_temp The thermal zone temperature of the node Celsius
48 node_time_seconds System time in seconds since epoch (1970)
49 node_udp_queues Number of allocated memory in the kernel for UDP datagrams in bytes. Aggregates
50 nvidia_gpu_duty_cycle Percent of time over the past sample period during which one or more kernels were executing on the GPU device
51 nvidia_gpu_fanspeed_percent Fanspeed of the GPU device as a percent of its maximum
52 nvidia_gpu_memory_used_bytes Memory used by the GPU device in bytes Bytes
53 nvidia_gpu_power_usage_milliwatts Power usage of the GPU device in milliwatts Milliwattes
54 nvidia_gpu_temperature_celsius Temperature of the GPU device in celsius Celsius
55 surfsara_ambient_temp Temprature of the datacenter
56 surfsara_power_usage Power usage for the node
ID SLURM Metrics Description Datatype
1 id Job ID int
2 start_date Start date of job datetime
3 end_date End date of job datetime
4 node Rack and node of the job, maybe multiple string
5 nodetypes Type of node string
6 numnodes Number of nodes used for the job int
7 numcores Number of cores used for the job int
8 sharednode
9 submit Submit time of job timestamp
10 start Start time of job timestamp
11 end End time of job timestamp
12 state End state of job string
13 exitcode Exit code of job
14 researvation
15 partprepaid

Sample visualisations of modelled ontology (visualised in Protégé Desktop)

  • The OWLViz visualisation of CINECA's Marconi 100 HPC cluster modelled ontology is shown below:

CINECA's Marconi 100 cluster ontology OWLViz

  • The OntoGraf visualisation of SURF's LISA HPC cluster modelled ontology is shown below:

SURF's LISA cluster ontology OntoGraf

References

[1] Chunhua Liao, Pei-Hung Lin, Gaurav Verma, Tristan Vanderbruggen, Murali Emani, Zifan Nan, Xipeng Shen, HPC Ontology: Towards a Unified Ontology for Managing Training Datasets and AI Models for High-Performance Computing, 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) (LLNL-CONF-826494)

About

This project aims to create a toolset for defining HPC ontology, that builds on top of an existing HPC ontology (https://hpc-fair.github.io/ontology/).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published