BlockCRS Benchmark

This benchmark measures the performance of Tpetra::BlockCrsMatrix in a realistic application context. The code uses 7 point stencile operator to mimic finite volume CFD code. The problem domain is a 3D cube and is distributed over MPI processors. Internally, the code exploits node-level parallelism using Kokkos. This benchmark measures the following performance features.

logal/global graph construction
local/global block crs matrix and multivector fill
block crs matrix vector multiplication
equivalent flat scalar matrix vector multiplication This benchmarks provides a baseline performance of the current Tpetra::BlockCrsMatrix implementation.

CMake setup

In this section, we show how to configure the Trilinos code for Intel and NVIDIA GPU architectures. First we show the base configuration that is commonly used for our target architectures and we explain customized cmake variables and setup for each target architecture.

CMake base configure

#!/bin/bash  

USE_CUDA=OFF  # ON if GPU
USE_OPENMP=ON 

EXAMPLE=ON
TEST=ON

BUILD_TYPE=RELEASE  # or DEBUG
TRILINOS_DIR=/your/trilinos/source/directory
INSTALL_DIR=/your/trilinos/install/directory

rm -rf C*  
cmake \ 
    -D BUILD_SHARED_LIBS:BOOL=OFF \                                                                           
    -D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \                                                       
    -D Trilinos_ENABLE_INSTALL_CMAKE_CONFIG_FILES:BOOL=ON \                                                   
    -D Trilinos_ENABLE_EXAMPLES:BOOL=${EXAMPLE} \                                                             
    -D Trilinos_ENABLE_TESTS:BOOL=${TEST} \                                                                                                                                 
    -D Trilinos_ENABLE_Fortran:BOOL=OFF \                                                                     
    -D Trilinos_ENABLE_KokkosCore:BOOL=ON \                                                                   
    -D Trilinos_ENABLE_KokkosAlgorithms:BOOL=ON \                                                             
    -D Trilinos_ENABLE_ALL_PACKAGES:BOOL=OFF \                                                                
    -D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \                                                       
    -D Trilinos_ENABLE_Tpetra:BOOL=ON \
    -D Teuchos_ENABLE_LONG_LONG_INT:BOOL=OFF \                                                                                                                                       
    -D CMAKE_BUILD_TYPE:STRING=${BUILD_TYPE} \                                                                                                        
    -D CMAKE_CXX_COMPILER:FILEPATH="mpicxx" \                                                                 
    -D CMAKE_VERBOSE_MAKEFILE:BOOL=OFF \                                                                      
    -D CMAKE_SKIP_RULE_DEPENDENCY=ON \                                                                        
    -D CMAKE_INSTALL_PREFIX:PATH=${INSTALL_DIR} \                                                                                                                            
    -D TPL_ENABLE_GLM=OFF \                                                                                   
    -D TPL_ENABLE_MPI:BOOL=ON \                                                                               
    -D TPL_ENABLE_LAPACK:BOOL=ON \                                                                            
    -D TPL_ENABLE_BLAS:BOOL=ON \                                                                              
    -D CMAKE_SKIP_RULE_DEPENDENCY=ON \                                                                        
    -D Trilinos_ENABLE_OpenMP=${USE_OPENMP} \                                                                 
    -D Kokkos_ENABLE_OpenMP:BOOL=${USE_OPENMP} \                                                              
    -D Kokkos_ENABLE_TESTS:BOOL=ON \                                                                          
    -D TPL_ENABLE_CUDA:BOOL=${USE_CUDA} \                                                                     
    -D TPL_ENABLE_CUSPARSE:BOOL=${USE_CUDA} \                                                                 
    -D Kokkos_ENABLE_Cuda:BOOL=${USE_CUDA} \                                                                  
    -D Kokkos_ENABLE_Cuda_UVM:BOOL=${USE_CUDA} \                                                              
    $TRILINOS_DIR

Architecture specific CMake setup

specify KOKKOS_ARCH

  -D KOKKOS_ARCH="[OPT]", available options are 
               [AMD]
                 AMDAVX         = AMD CPU
               [ARM]
                 ARMv80         = ARMv8.0 Compatible CPU
                 ARMv81         = ARMv8.1 Compatible CPU
                 ARMv8-ThunderX = ARMv8 Cavium ThunderX CPU
               [IBM]
                 Power7         = IBM POWER7 and POWER7+ CPUs
                 Power8         = IBM POWER8 CPUs
                 Power9         = IBM POWER9 CPUs
               [Intel]
                 WSM            = Intel Westmere CPUs
                 SNB            = Intel Sandy/Ivy Bridge CPUs
                 HSW            = Intel Haswell CPUs
                 BDW            = Intel Broadwell Xeon E-class CPUs
                 SKX            = Intel Sky Lake Xeon E-class HPC CPUs (AVX512)
               [Intel Xeon Phi]
                 KNC            = Intel Knights Corner Xeon Phi
                 KNL            = Intel Knights Landing Xeon Phi
               [NVIDIA]
                 Kepler30       = NVIDIA Kepler generation CC 3.0
                 Kepler32       = NVIDIA Kepler generation CC 3.2
                 Kepler35       = NVIDIA Kepler generation CC 3.5
                 Kepler37       = NVIDIA Kepler generation CC 3.7
                 Maxwell50      = NVIDIA Maxwell generation CC 5.0
                 Maxwell52      = NVIDIA Maxwell generation CC 5.2
                 Maxwell53      = NVIDIA Maxwell generation CC 5.3
                 Pascal60       = NVIDIA Pascal generation CC 6.0
                 Pascal61       = NVIDIA Pascal generation CC 6.1
                 Volta70        = NVIDIA Volta generation CC 7.0
                 Volta72        = NVIDIA Volta generation CC 7.2
   for heterogeneous architectures, put each arch variables with comma 
   e.g., "Power8,Pascal60"

specify LAPACK and BLAS libraries

  -D TPL_LAPACK_LIBRARIES:FILEPATH="-llapack" or "-mkl" (Intel compiler)
  -D TPL_BLAS_LIBRARIES:FILEPATH="-lblas" or "-mkl" (Intel compiler)

   if your BLAS and LAPACK is located in a non-standard path, please
   append the path to LD_LIBRARY_PATH.

For CUDA, set CUDA specfiic environment varialbes as follows.

export OMPI_CXX=${TRILINOS_DIR}/packages/kokkos/bin/nvcc_wrapper                                              
export CUDA_LAUNCH_BLOCKING=1                                                                                 
export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1

Path to benchmark source:

Trilinos/packages/tpetra/core/example/BlockCrs

Path to benchmark executable:

$BUILD/packages/tpetra/core/example/BlockCrs/TpetraCore_BlockCrsPerfTest.exe

Command line options and default values

[kyukim @bread] BlockCrs > ./TpetraCore_BlockCrsPerfTest.exe --help
Usage: ./TpetraCore_BlockCrsPerfTest.exe [options]
  options:
  --help                               Prints this help message
  --pause-for-debugging                Pauses for user input to allow attaching a debugger
  --echo-command-line                  Echo the command-line but continue as normal
  --num-elements-i       int           Number of cells in the I dimension.
                                       (default: --num-elements-i=2)
  --num-elements-j       int           Number of cells in the J dimension.
                                       (default: --num-elements-j=2)
  --num-elements-k       int           Number of cells in the K dimension.
                                       (default: --num-elements-k=2)
  --num-procs-i          int           Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
                                       (default: --num-procs-i=1)
  --num-procs-j          int           Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
                                       (default: --num-procs-j=1)
  --num-procs-k          int           Processor grid of (npi,npj,npk); npi*npj*npk should be equal to the number of MPI ranks.
                                       (default: --num-procs-k=1)
  --blocksize            int           Block size. The # of DOFs coupled in a multiphysics flow problem.
                                       (default: --blocksize=5)
  --nrhs                 int           Number of right hand sides to solve for.
                                       (default: --nrhs=1)
  --repeat               int           Number of iterations of matvec operations to measure performance.
                                       (default: --repeat=100)

Suggested scaling study: DESCRIBE WEAK/STRONG

Single Node OpenMP Strong Scale

OMP_NUM_THREADS=4 OMP_PROC_BIND=spread OMP_PLACES=threads \
  ./TpetraCore_BlockCrsPerfTest.exe \
  --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 \
  --blocksize=5 --nrhs=1 \
  --repeat=20

Single Node CUDA

OMP_NUM_THREADS=1 \
  ./TpetraCore_BlockCrsPerfTest.exe \
  --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 \
  --blocksize=5 --nrhs=1 \
  --repeat=20

Multi Node Weak Scale

OMP_NUM_THREADS=2 OMP_PROC_BIND=spread OMP_PLACES=threads mpirun -np 32 \
  ./TpetraCore_BlockCrsPerfTest.exe \
  --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 \
  --num-procs-i=4 --num-procs-j=8 --num-procs-k=1 \
  --blocksize=5 --nrhs=1 \
  --repeat=20

Preliminary results:

Platform used:
Summary or screenshot:

Copyright © Trilinos a Series of LF Projects, LLC
For web site terms of use, trademark policy and other project policies please see https://lfprojects.org.

Trilinos Developer Home
Trilinos Package Owners
Policies
    New Developers
    Trilinos PR/CR
    Productivity++
    Support Policy
    Test Dashboard Policy
    Testing Policy
    Managing Issues
        New Issue Quick Ref
        Handling Stale Issues and Pull Requests
        Release Notes
    Software Quality Plan
    Compiler Warnings/Errors
    Proposing a New Package
    Guidance on Copyrights and Licenses
Tools
    CMake
    Doxygen
    git
    GitHub Notifications
    Mail lists
    Clang-format
Version Control
    Initial git setup
    'feature'/'develop'/'master' (cheatsheet)
    Simple centralized workflow
Building
    SEMS Dev Env
    Mac OS X
    ATDM Platforms
Containers
    Development Tips
    Automated Workflows
Testing
    Test Harness
    Pull Request Testing
        Submitting a Pull Request
        Pull Request Workflow
        Reproducing PR Errors
        Addressing Test Failures
        Trilinos Status Table Archive
    Pre-push (Checkin) Testing
        Remote pull/test/push
PR Creation & Approval Guidelines for Tpetra, Ifpack2, and MueLu Developers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly