-
Notifications
You must be signed in to change notification settings - Fork 230
TutorialMPI
MPI (Message Passing Interface) is probably the most prominent communication solution for distributed memory parallelization. It follows the SPMD paradigm (Single Program, Multiple Data) which means, each MPI process executes the same program but works on different chunks of data. For the use with LIKWID it is kind of tricky because the configuration of each process running on the same host has to be different. So, for example you want to have one MPI process per CPU socket of a machine, LIKWID needs to know for each process which CPUs it should measure. If you further use the MarkerAPI inside your MPI processes, some more configuration is needed. But let's start simple.
likwid-mpirun is a Lua script that wraps default MPI calls to insert the proper configuration for each MPI process. It performs no measurements itself but inserts calls to likwid-perfctr and likwid-pin into the execution path. I'm more interested in node-local stuff, so likwid-mpirun is not as well tested as the other LIKWID tools. That's why I write this tutorial.
We first want to execute LIKWID without the MarkerAPI, so end-to-end measurements of our whole application. Moreover, we assume that we want to pin the MPI processes to the CPUs. Assuming you have something like
mpiexec -np 2 <mpi-program>
starting 2 MPI processes, each executes the <mpi-program>
.
The most naive way would be something like this:
likwid-perfctr -C E:N:40 -g ENERGY mpiexec -np 2 <mpi-program>
But when we think about it, this works properly only in special cases. LIKWID does not know where the 2 MPI processes are going to run. It could be even a completely different machine which results in measuring mpiexec on the local host and not the MPI programs. When the processes stay on the local machine this might work. A better way is to integrate LIKWID into the 2 started processes like:
mpiexec -np 2 likwid-perfctr -C E:N:40 -g ENERGY <mpi-program>
With this calling order, LIKWID is executed on the hosts that are given in the MPI hostfile and measures the energy consumption on 40 CPUs. If you use distinct hosts, this works but what about the results. Each process prints its results when it is done. So the output could be interleaved and you first have to separate the outputs to get the measurement results. This could be avoided by setting an output file for each process:
mpiexec -np 2 likwid-perfctr -C E:N:40 -g ENERGY -o output_%h_%r.txt <mpi-program>
We added -o output_%h_%r.txt
to the commandline. The %h
is a variable that is filled by LIKWID with the hostname it is running on. %r
is substituted by the MPI rank, an process identifier within the MPI execution. After execution you will have the two files output_<host1>_0.txt
and output_<host2>_1.txt
. But there is still some problem. If both processes should run on the same host, both measure the same range of CPUs. If you want distinct CPUs per MPI processes, you have to separate the calls:
mpiexec -np 1 likwid-perfctr -C S0:20 -g ENERGY -o output_%h_%r.txt <mpi-program> : -np 1 likwid-perfctr -C S1:20 -g ENERGY -o output_%h_%r.txt <mpi-program>
With this commandline, you start two processes, one is measuring 20 CPUs on CPU socket 0 (S0
) and the other one 20 CPUs on CPU socket 1 (S1
). This works fine but it is not very handy. likwid-mpirun builds commandlines like this but provides a much simpler interface and prints out the results combined for all processes. An example call would be:
likwid-mpirun -pin S0:20_S1:20 -g ENERGY <mpi-program>
Due to the _
in the pin statement, likwid-mpirun knows that we need 2 MPI processes, each may run on 20 CPUs (caused by threading). The page likwid-mpirun lists some more options.
A little more complex thing is to use the MarkerAPI because the measuring is not performed by likwid-perfctr anymore but by the MarkerAPI calls in the application. likwid-perfctr is only used to export some configuration like CPU list, event set and intermediate result file path. So, let's look again at the naive call:
likwid-perfctr -C E:N:40 -g ENERGY -m mpiexec -np 2 <mpi-program>
This may work in some cases but it will often fail. The explaination is simple: Both MPI applications write their measurement results to the same intermediate result file. Moving the likwid-perfctr call inside the mpiexec call avoids that because likwid-perfctr is called for each process defining one intermediate output file for each MPI program.
mpiexec -np 2 likwid-perfctr -C E:N:40 -g ENERGY -m <mpi-program>
Moreover we add a distinct output file for each MPI process. When spreading the computation ofter multiple hosts, it is good practise to add the host name to the output file:
mpiexec -np 2 likwid-perfctr -C E:N:40 -g ENERGY -m -o output_%h_%r.txt <mpi-program>
The %h
is substituted by the hostname executing likwid-perctr, so we can differentiate the output files afterwards. But there still may be problems when running on the same machine, because each MPI process is pinned to the same list of CPUs. In order to pin the MPI programs to distinct parts of a host, we have to separate the calls:
mpiexec -np 1 likwid-perfctr -C S0:20 -g ENERGY -m -o output_%h_%r.txt <mpi-program> : -np 1 likwid-perfctr -C S1:20 -g ENERGY -m -o output_%h_%r.txt <mpi-program>
So this is a long commandline but it does the job. Using likwid-mpirun the same call looks like this:
likwid-mpirun -pin S0:20_S1:20 -g ENERGY -m <mpi-program>
There are some further pitfalls, especially when you want to pin your program and its threads yourself. So, we assume we want to run two processes on the same CPU socket and read the energy consumption. Our application pins the first MPI process and its threads to the first half of the CPU socket and the second process to the second half. Moreover, we use the naive way to start the measurements:
likwid-perfctr -c S0:20 -g ENERGY -m mpiexec -np 2 <mpi-program>
During initialization of the MarkerAPI inside our MPI program, each of the given CPUs is intialized and the socket locks are acquired. LIKWID sets up the socket locks because the energy counters as well as all Uncore counters are socket-specific, not core-specific. The first initialized CPU normally gets the lock, so in this case both MPI processes set the lock to the first CPU on socket 0 (commonly CPU 0). Now we pin our threads and execute the MarkerAPI calls. The start and stop calls executed by a thread measure only for the currently executing CPU, so in our second process this is the second half of the socket and never the first CPU on the socket. Since only the first CPU is able to read the energy counters, the second process will never read them. This is only done by the first process running on the first half. Finally, the Marker API writes it intermediate result file and if the second process comes last, it will only write 0 for the energy because it has never read the counter. A proper call for this purpose is:
mpiexec -np 1 likwid-perfctr -c S0:0-9 -g ENERGY -m <mpi-program> : -np 1 likwid-perfctr -c S0:10-19 -g ENERGY -m <mpi-program>
Or use likwid-mpirun:
likwid-mpirun -pin S0:0-9_S0:10-19 -g ENERGY -m <mpi-program>
Although the option is named pin, you can pin your application yourself. The processes and threads are only measuring on the selected CPUs, if you pin outside of this range, there won't be any results for the process/thread. If you need the CPUs for the thread, you can use the environment variable LIKWID_THREADS
.
-
Applications
-
Config files
-
Daemons
-
Architectures
- Available counter options
- AMD
- Intel
- Intel Atom
- Intel Pentium M
- Intel Core2
- Intel Nehalem
- Intel NehalemEX
- Intel Westmere
- Intel WestmereEX
- Intel Xeon Phi (KNC)
- Intel Silvermont & Airmont
- Intel Goldmont
- Intel SandyBridge
- Intel SandyBridge EP/EN
- Intel IvyBridge
- Intel IvyBridge EP/EN/EX
- Intel Haswell
- Intel Haswell EP/EN/EX
- Intel Broadwell
- Intel Broadwell D
- Intel Broadwell EP
- Intel Skylake
- Intel Coffeelake
- Intel Kabylake
- Intel Xeon Phi (KNL)
- Intel Skylake X
- Intel Cascadelake SP/AP
- Intel Tigerlake
- Intel Icelake
- Intel Icelake X
- Intel SappireRapids
- Intel GraniteRapids
- Intel SierraForrest
- ARM
- POWER
-
Tutorials
-
Miscellaneous
-
Contributing