MPI testing benchmarks
Intel MPI Benchmarks (IMB)
Para testear el rendimiento de MPI en nuestros CE's utilizamos la suite de
Intel® MPI Benchmarks (IMB). En concreto utilizamos la versión 3.2. Esta suite provee un conjunto de benchmarks destinados a medir el rendimiento de las funciones mas importantes de MPI.
Los benchmarks son los siguientes:
- PingPong: is the classical pattern used for measuring startup and throughput of a single message sent between two processes.
- PingPing: measure startup and throughput of single messages, with the crucial difference that messages are obstructed by oncomingmessages.
- Sendrecv: the processes form a periodic communication chain. Each process sends to the right and receives from the left neighbor in the chain.The turnover count is 2 messages per sample (1 in, 1 out) for each process.
- Exchange: communications pattern that often occurs in grid splitting algorithms (boundary exchanges). The group of processes is seen as a periodic chain, and each process exchanges data with both left and right neighbour in the chain.
- Reduce: It reduces a vector of length L = X/sizeof(float) float items.
- Reduce_scatter: Benchmark for the MPI_Reduce_scatter function. It reduces a vector of length L = X/sizeof(float)float items
- Allreduce: Benchmark for the MPI_Allreduce function. It reduces a vector of length L = X/sizeof(float) float items.
- Allgather: Benchmark for the MPI_Allgather function. Every process inputs X bytes and receives the gathered X*(#processes) bytes.
- Allgatherv: Functionally is the same as Allgather. However, with the MPI_Allgatherv function it shows whether MPI produces overhead due to the more complicated situation as compared to MPI_Allgather.
- Scatter: Benchmark for the MPI_Scatter function. The root process inputs X*(#processes) bytes (X for each process); all processes receive X bytes.
- Scatterv: Benchmark for the MPI_Scatterv function. The root process inputs X*(#processes) bytes (X for each process); all processes receive X bytes.
- Gather: Benchmark for the MPI_Gather function. All processes input X bytes, and the root process receives X*(#processes) bytes (X from each process).
- Gatherv: Benchmark for the MPI_Gatherv function. All processes input X bytes, and the root process receives X*(#processes) bytes (X from each process).
- Alltoall: Benchmark for the MPI_Alltoall function. Every process inputs X*(#processes) bytes (X for each process) and receives X*(#processes) bytes (X from each process).
- Alltoallv: Benchmark for the MPI_Alltoall function. Every process inputs X*(#processes) bytes (X for each process) and receives X*(#processes) bytes (X from each process).
- Bcast: Benchmark for MPI_Bcast. A root process broadcasts X bytes to all.
Ejecución IMB benchmark en un CE
Para ejecutar los benchmarks en el CE, lanzaremos un job con el siguiente .jdl:
JobType = "normal";VirtualOrganisation = "ific";NodeNumber = 64;Executable = "mpi-start-wrapper.sh";Arguments = "IMB-MPI1 OPENMPI";StdOutput = "IMB-MPI1.out";StdError = "IMB-MPI1.err";InputSandbox = {"IMB-MPI1", "mpi-start-wrapper.sh" };OutputSandbox = {"IMB-MPI1.out", "IMB-MPI1.err"};Requirements = Member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment)
Donde en
NodeNumber iremos variando el número de nodos (hasta 64 máximo).
El ejecutable
IMB-MPI1 lo generaremos con la suite.
Una vez finalizado el
job, obtendremos su output. En
IMB-MPI1.out podremos visualizar los resultados del benchmark.
Análisis rendimiento MPI 2010-2013
A partir de resultados obtenidos en 2010, y valiéndonos de un script que genera gráficas comparativas, hemos ejecutado el benchmark en enero de 2013 en la máquina
ce02.ific.uv.es, y obtenemos estos resultados. Se aprecia un rendimiento bastante menor ahora que con respecto a 2010, conforme aumenta el numero de nodos, se degrada aun mas el rendimiento. A continuación algunas gráficas para el benchmark
Allgather que muestran este problema para distinto número de nodos (en rojo 2013, en verde 2010):
--
AlvaroFernandez - 16 Jan 2013