6. DGEMM

6.1. Purpose

The DGEMM benchmark measures the sustained floating-point rate of a single node.

Benchmarks https://github.com/lanl/benchmarks/tree/main/microbenchmarks/dgemm

6.2. Characteristics

DGEMM is available in the benchmarks repository:

LANL Benchmarks: Benchmarks

6.2.1. Problem

\[\mathbf{C} = \alpha*\mathbf{A}*\mathbf{B} + \beta*\mathbf{C}\]

Where \(A B C\) are square \(NxN\) vectors and \(\alpha\) and \(\beta\) are scalars. This operation is repeated \(R\) times.

6.2.2. Figure of Merit

The Gigaflops per second rate reported at the end of the run

GFLOP/s rate:         <FOM> GF/s

6.2.3. Run Rules

Vendors are permitted to change the source code in the region marked in the source.
Optimized BLAS/DGEMM routines are permitted (and encouraged) to demonstrate the highest performance.
Vendors may modify the Makefile(s) as required

6.3. Building

Load the compiler; make and enter a build directory.

cmake -DBLAS_NAME=<blas library name> -DBLAS_ROOT=<root path to blas library> ..
make

Current BLAS_NAME options are mkl, cblas (openblas), essl, libsci, libsci_acc, cublas, cublasxt or the raw coded (OpenMP threaded) dgemm. The BLAS_NAME argument is required. If the headers or libraries aren’t found provide BLAS_LIB_DIR or BLAS_INCLUDE_DIR to cmake. If using a different blas library, modify the C source file to use the correct header and dgemm command.

6.4. Running

DGEMM uses OpenMP but does not use MPI.

Set the number of OpenMP threads and other OMP characteristics with export. The following were used for the Crossroads (ATS-3/Crossroads) system.

export OPENBLAS_NUM_THREADS=<nthreads> #MKL INHERITS FROM OMP_NUM_THREADS.
export OMP_NUM_THREADS=<nthreads>
export OMP_PLACES=cores
export OMP_PROC_BIND=close

./mt-dgemm <N> <R> <alpha> <beta>

These values default to: \(N=256, R=8, \alpha=1.0, \beta=1.0\)

These inputs are subject to the conditions \(N>128, R>4\).

These are positional arguments, so, for instance, R cannot be set without setting N.

6.5. Example Results

Results from DGEMM are provided on the following systems:

Crossroads (see ATS-3/Crossroads)

6.5.1. Crossroads

This test was built with the intel 2023.1.0 compiler using the crayOS compiler wrapper where: \(N=2500, R=500, \alpha=1.0, \beta=1.0\). The 110 core run (cores are used as OpenMP threads) avoids the OS dedicated cores and takes roughly an hour. All four runs on rocinante hbm take 5-6 hours.

Table 6.5 DGEMM microbenchmark FLOPs measurement
No. Cores	GFlops
32	2176.5
56	3460.0
88	4949.1
110	4850.5