4. OSU Microbenchmarks

4.1. Purpose

The OSU Microbenchmarks (OMB) are widely used to measure and evaluate the performance of MPI operations for point-to-oiint, multi-pair, collective, and one-sided communications.

4.2. Characteristics

4.2.1. Problem

The OSU benchmarks are a suite of microbenchmarks designed to measure network characteristics on HPC systems.

4.2.2. Run Rules

N/A

4.3. Building

On GPU enabled systems add these flags to the following configure lines:

--enable-cuda
--with-cuda-include=/path/to/cuda/include
--with-cuda-libpath=/path/to/cuda/lib

Build and install the benchmarks.

./configure --prefix=$INSTALL_DIR
make -j
make -j install

Before configuring make sure your CXX and CC environment variables are set to an MPI compiler or wrapper. On most systems this will look like:

export CC=mpicc CXX=mpicxx

On systems with vendor provided wrappers it may look different. For example, on HPE-Cray systems:

export CC=cc CXX=CC

4.4. Running

For any GPU enabled system, please also include the GPU variants of the following benchmarks.

Table 4.6 OSU Microbenchmark Tests

Program

Description

Msg Size

Num Nodes

Rank Config

osu_latency

P2p Latency

8 B

2

2 tests per node:
  • Longest Path (worst case)

  • Shortest Path (best case)

osu_bibw

P2p Bi-directional BW

16 kB

2

1 per node

osu_mbw_mr

P2p Multi-BW & Msg Rate

16 KB

2

Host-to-Host (two tests):
  • 1 per NIC

  • 1 per core

Device-to-Device (two tests):
  • 1 per NIC

  • 1 per accelerator

osu_get_acc_latency

P2p 1 sided Accumulate Latency

8 B

2

1 per node

osu_get

Get latency

8 B

2

1 per node

osu_put

Put latency

8 B

2

1 per node

osu_barrier

Barrier time

N/A

full-system

Two tests:
  • 1 per physical core

  • 1 per GPU/accelerator

osu_ibarrier

Async-Barrier time

N/A

full-system

Two tests:
  • 1 per physical core

  • 1 per GPU/accelerator

osu_allreduce

All-reduce Latency

8B, 16 MB

full-system

Two tests:
  • 1 per physical core

  • 1 per GPU/accelerator

osu_alltoall

All-to-all Latency

8 B

full-system

Two tests:
  • 1 per physical core

  • 1 per GPU/accelerator

4.5. Example Results

Results for the OSU Microbenchmarks are provided on the following systems:

4.5.1. Crossroads

Table 4.7 OSU Microbenchmark Tests

Test

Ranks

Msg Size

Num Nodes

Result

osu_latency

1 per node

8 B

2

1.61 us

osu_bibw

1 per node

1 MB

2

45307.17 MB/s

osu_mbw_mr

1 per NIC

16 KB

2

49656.45 MB/s

osu_mbw_mr

1 per core

16 KB

2

45198.46 MB/s

osu_get_acc_latency

1 per node

8 B

2

10.85 us

osu_get

1 per node

8 B

2

3.59 us

osu_put

1 per node

8 B

2

4.87 us

osu_barrier

1 per physical core

N/A

full-system

550.66 us

osu_ibarrier

1 per physical core

N/A

full-system

4802.82 us

osu_allreduce

1 per physical core

8B, 16 MB

full-system

345.55, 2477365.95 us

osu_alltoall

1 per node

8B

full-system

1954.35 us