Skip to content

qlibs/perf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

// Overview / Tutorial / Studies / API / FAQ / Resources

perf: C++23 Performance library

License Version Build Try it online

Performance is not a number!

Overview

Performance-oriented library that combines the power of:
c++23, linux/perf, llvm/mca, intel/pt, gnuplot, terminal/sixel, ...

Use cases

CPU Benchmarking, Profiling, Tracing, Analysis

Requirements

Minimal
Optimal (recommended)
Optional

Features

Supports timing, profiling, tracing, analyzing, plotting, testing
  • time (tsc, steady_clock, cpu,thread, real, monotonic)
  • stat (perf_event_open)
  • record (perf_event_open)
  • trace (perf_event_open/intel_pt)
  • mc/mca (llvm: disassemble, llvm-mca: resource_pressure, timeline, bottleneck)
  • prof (linux-perf, llvm-xray, gperftools, intel-vtune, callgrind)
  • plot (gnuplot: hist, bar, box, ecdf, line)
Strives for simplicity, flexibility, accuracy
  • Single Header/Module - https://github.com/qlibs/perf/blob/main/perf https://github.com/qlibs/perf/blob/main/perf.cppm

  • API / bench

    import perf;
    
    int main() {
      perf::runner bench{perf::bench::latency{}};     // what and how
    
      auto fizz_buzz = [](int n) {
        if (n % 15 == 0) {
          return "FizzBuzz";
        } else if (n % 3 == 0) {
          return "Fizz";
        } else if (n % 5 == 0) {
          return "Buzz";
        } else {
          return "Unknown";
        }
      };
    
      bench(fizz_buzz, 1);
      bench(fizz_buzz, 3);
      bench(fizz_buzz, perf::data::repeat<int>{{3,5,15}});
      bench(fizz_buzz, perf::data::uniform<int>{.min = 0, .max = 15});
    
      perf::report(bench[perf::time::steady_clock]);  // per benchmark
      perf::annotate(bench[perf::mc::assembly]);      // per instruction
      perf::plot::bar(bench[perf::stat::cycles])      // hist, bar, box, ecdf, line
    }

    See How does it work? fore more details

Enables dev, reaserch workflows
  • dev: edit -> compile -> run -> analyze (c++ only)

  • research: edit -> compile -> run -> save(json) -> notebook/perfetto -> analyze (c++/python/browser)

Performs self-verification upon compilation
  • compile-time tests are executed upon include/import (enabled by default)

  • run-time/sanity check tests can be executed by

    int main() {
      perf::test::run({.verbose = true});
    }

compile-time/run-time tests can be disabled with -DNTEST (not recommended)

Setup

Try it!
perf.mp4
docker build -t perf .
docker run --rm --privileged -v ${PWD}:${PWD} -w ${PWD} \
  clang++-20 -std=c++23 -O3 --precompile perf.cppm
gh gist create --public --web hello_world.cpp
Build & Test
  • local env

    build dependencies (optional)

    apt-get install linux-tools-common
    apt-get install llvm-dev
    apt-get install libipt-dev
    apt-get install gnuplot
    pip3 install pyperf

    enable linux tracing (optional)

    .github/scripts/setup.sh --perf

    tune machine for benchmarking (recommended)

    .github/scripts/tune.sh

    import perf

    cat <<EOF > perf.cpp
      import perf;
      int main() {
        perf::test::run({.verbose = true});
      }
    EOF

    build # perf tests are running at compile-time upon inclusion/import and run-time on start-up

    clang++-20 \
      -std=c++23
      -O3 \
      -I. \
      -I/usr/lib/llvm-20/include \
      -lLLVM-20 \
      -lipt \
      -o perf \
      perf.cpp && \
    ./perf # runs run-time tests and sanity checks
  • docker

    build perf image

    wget https://raw.githubusercontent.com/qlibs/perf/refs/heads/main/.github/workflows/Dockerfile
    docker build -t perf .

    import perf

     cat <<EOF > perf.cpp
       import perf;
       int main() {
         perf::test::run({.verbose = true});
       }
     EOF

    build # perf tests are running at compile-time upon inclusion/import and run-time on start-up

    docker run \
      --rm \
      --privileged \
      -v ${PWD}:${PWD} \
      -w ${PWD} \
      perf \
      clang++-20 \
        -std=c++23
        -O3 \
        -I. \
        -I/usr/lib/llvm-20/include \
        -lLLVM-20 \
        -lipt \
        -o perf \
        perf.cpp && \
      ./perf

Tutorial

Setup
Benchmarking
Profiling / Tracing / Analysis
Miscellaneous

Studies

Retiring
Bad speculation
Frontend bound
Backend bound

API

Config
  • Macros

    /**
     * PERF version # https://semver.org
     */
    #define PERF (MAJOR, MINOR, PATCH) // ex. (1, 0, 0)
    
    /**
     * GNU # default: deduced based on `__GNUC__`
     * - 0 not compatible
     * - 1 compatible
     */
    #define PERF_GNU 0/1
    
    /**
     * Linux # default: deduced based on `__linux__` and `perf_event_open.h`
     * - 0 not supported
     * - 1 supported
     */
    #define PERF_LINUX 0/1
    
    /**
     * UEFI # default: 0
     * - 0 not supported
     * - 1 supported
     */
    #define PERF_UEFI 0/1
    
    /**
     * LLVM # default: deduced based on `llvm-dev` headers
     * - 0 not supported
     * - 1 supported
     */
    #define PERF_LLVM 0/1
    
    /**
     * Intel Processor Trace # default: deduced based on `<intel_pt.h>` header
     * - 0 not supported
     * - 1 supported
     */
    #define PERF_INTEL 0/1
    
    /**
     * tests # default: not-defined
     * - defined:     disables all compile-time, run-time tests
     * - not-defined: compile-time tests executed, run-time tests via `test::run`
     */
    #define NTEST
  • Enviornment variables

    /**
     * gnuplot terminal # see `gnuplot -> set terminal` # default: 'sixel'
     * - 'sixel'                  # console image # https://www.arewesixelyet.com
     * - 'wxt'                    # popup window
     * - 'dumb size 150,25 ansi'  # console with colors
     * - 'dumb size 80,25'        # console
     */
    ENV:PERF_PLOT_TERM
    
    /**
     * style # default: dark
     * - light
     * - dark
     */
    ENV:PERF_PLOT_STYLE
Synopsis

FAQ

Usage
  • How does it work?

    • unroll - show run

    • views - gnuplot via sixel

    • function - invoke

    • sampler, counter - perf_event_open

    • mca - llvm-dev.mca

    • Backward/type-safe deduction on what to bench based on usage

      bench(fn); // `time::cpu, stat::instructions (ipc), stat::cycles` will be benchmarked
      report(bench[stat::cycles, time::cpu]);
      annotate(bench[record::cycles]);
      plot(bench[stat::ipc]);
  • How to share results?

    • export.sh

    • with https://gist.github.com

      apt-get install gh
      
      gh gist create perf.html --public | awk '{print "https://htmlpreview.github.io/?perf.html/raw"}'
  • Sharing options

    • markdown (.md) - easiest to edit but github doesn`t render embedded images
    • notebook (.ipynb) - still markdown based but githbu rendered embedded images
    • html (.html) - best looking but the hardest to edit, github doesn`t render html but https://htmlpreview.github.io does
  • How to use ascii based charts?

    // - 'dumb size 150,25 ansi'  # console with colors
    // - 'dumb size 80,25'        # console
    PERF_PLOT_TERM='dumb size 150,25 ansi' ./perf
  • How to use popup based charts?

    // - 'dumb size 150,25 ansi'  # console with colors
    // - 'dumb size 80,25'        # console
    PERF_PLOT_TERM='wxt' ./perf
  • How to change plots style?

    ENV:PERF_PLOT_STYLE='light' ./perf
    ENV:PERF_PLOT_STYLE='dark' ./perf # default
  • What is prevent_elision and when it`s needed?

    • optimizing compiler may elide your code completely if its not required (doesnt have side effects). prevent_elision will prevent that.

      verify(perf::compiler::is_elided([] { }));
      verify(perf::compiler::is_elided([] { auto value = 4 + 2; }));
      verify(perf::compiler::is_elided([] { int i{}; i++; }));
      verify(not perf::compiler::is_elided([&] { i++; }));
      verify(not perf::compiler::is_elided([] { static int i; i++; }));
      verify(not perf::compiler::is_elided([=] {
        int i{};
        perf::compiler::prevent_elision(i++);
      }));
  • How to change assembly syntax from intel to at&t?

    perf::llvm llvm{.syntax = perf::arch::syntax::att };
    perf::bench::runner bench{
      {.syntax = perf::arch::syntax::att},
      perf::bench::latency,
    };
  • How to disassemble for different platform?

    perf::llvm llvm{.triple = "x86_64-pc-linux-gnu" }; // see `llvm-llc` for details
  • How to integration with testing framework?

    • perf can be intergrated with any unit testing framework - https://github.com/qlibs/ut
      import perf;
      import ut;
      
      int main() {
        "benchmark"_test = [] {
          // ...
        };
      }
  • How to save images?

    gnuplot save{{.term = "svg"}};
    save.send("set output 'output.svg'");
    bar(save, results);
  • How to use perf as a C++20 module?

    clang++-20 -std=c++23 -O3 -I. --precompile perf.cppm
    clang++-20 -std=c++23 -O3 -fprebuilt-module-path=. perf.pcm perf.cpp -lLLVM-18 -lipt
  • What is required to display images on the terminal?

    • terminal with sixel support - https://www.arewesixelyet.com note: Visual Studio Code supports images on terminal (requires enabling Terminal -> Enable images option)
  • How to plot on the server without sixel? PERF_TERM=ascii ./a.out # see gnuplot - set terminal

  • How to write gnuplot charts? sudo apt install gnuplot

  • how to write custom profiler?

    struct my_profiler {
      constexpr auto start() { }
      constexpr auto stop() { }
      [[nodiscard]] constexpr auto *operator() { }
    };
    static_assert(perf::prof_like<your_profiler>);
  • How to pollute cache, heap when benchmarking?

    • perf::memory::pollute_heap
    • perf::memory::flush_cacheline
  • Instrumentation with llvm-xray

    [[clang::xray_always_instrument]]
    void always_profile();
    
    [[clang::xray_always_instrument, clang::xray_log_args(1)]]
    void always_profile_and_log_i(int i);
    
    [[clang::xray_never_instrument]]
    void never_profile();
    # profiling threshold
    -fxray-instruction-threshold=1 # default 200 instructions
    # instrumentation info
    llvm-xray extract ./a.out --symbolize

    https://godbolt.org/z/WhsEYf9cc

  • Conditional profiling with callgrind

    prof::callgrind profiler{"example"};
    
    while (true) {
      profiler.start(); // resets profile
    
      if (should_trigger()) {
        trigger();
        profiler.stop();
        proflier.flush(); // dumps `example` profile
      }
    }
    kcachegrind callgrind.* # opens all profiles combined
  • How perf tests are working?

    #ifndef NTEST
      "demo"_suite = [] {
        "run-time and compile-time"_test = [] constexpr {
          expect(3 == sum({1, 2}));
        };
    
        "run-time only"_test = [] mutable {
          expect(std::rand() >= 0);
        };
    
        "compile-time only"_test = [] consteval {
          expect(sizeof(int) == sizeof(std::int32_t));
        };
      };
    #endif
  • How perf compares to google.benchmark, nanobench, celero?

    • Firstly, google.benchmark, nanobench, celero are great and established libraries.

    • perf philosophy is more about the fact that performance is not a number which leads to the following

      • data driven inputs (to avoid branch prediction overfitting)
      • statistical, data driven inputs
      • instruction level and function level
      • statistical, don`t avoid branch prediction
      • latency and throughput
      • analysis, profiling and plotting
Performance
  • Common pitfalls?

    • Undersand
      • what (latency, throughput),
      • state (cold, warm, data distribution)
      • how (timing, counters, profiling) to measure
      • hardware (what env, os, setup, ...)
      • ensure proper analysis (avoid premature conclusions)
      • assert your expectations
      • verify in the production-like use case/env
    • Ensure the machine is tuned for micro-benchmarking (see #tuning)
    • Use structured and diligent methods (see #top_down_analysis)
    • Use realistic scenarios (see #setup)
    • Use statistical methods for analysis (#stat)
    • Analyze generated assembly (see #annotate/#mca)
    • Visualize results (see #gnuplot)
    • Document / share analysis (see #html)
    • Automate expectations (see #testing)
    • Enhnace your understanding (see #performance)
    • Verify changes in production like system (see #prof) - active-benchmarking
    • Measure / verify consistently (see #json)
  • Latency and Throughput?

    • latency is the time it takes for a single operation to complete (ns)
    • throughput is the total number of operations or tasks completed in a given amount of time (op/s)

    note: In single threaded execution if the algo can`t be parallilzed (where throughput would be important) you likely care about the latency.

  • Performance compilation flags?

    -O1                     # optimizations (level1) [0]
    -O2                     # optimizations (level1 + level2) [0]
    -O3                     # [unsafe] optimizations (level1 + level2 + level3) [0]
    -DNDEBUG                # disable asserts, etc.
    -march=native           # specifies architecture [1]
    -ffast-math             # [unsafe] faster math but non-conforming math [2]
    -g                      # debug symbols
    -fno-omit-frame-pointer # keep the frame pointer in a register for functions that don`t need one
    -fcf-protection=none    # [unsafe] stops emmitting `endbr64` # control-flow enforcement technology (cet)
    
  • performance attributes?

    [[gnu::target("avx2")]]
    [[gnu::optimize("O3")]
    [[gnu::optimize("ffast-math")]

    note: gcc support is much more comprehensive than clang

  • When to finish micro-benchmarking?

    • when IPC if you reached won`t get any better.
  • How to handle cases when there are no obvious bottlenecks?

    • The most likely scenario is cache utilization.
  • How to profile in production?

    • see prof API
  • What are different micro-benchmarking workflows?

    • [cpp] # measure, analyze from cpp pros: self contained, easy itartions, supports unit testing, supports running on the server (PERF_PLOT_TERM="dumb") cons: requires multiple runs for different compiler/flags note: can be shared with export.sh and gh with https://gist.github

    • [cpp->json->notebook] measure with cpp and export json, analyze results in notebook pros: can be unified across different for different compiler/flags and production runs, power of notebooks, easy to share, can be done offline cons: multiple places, longer runs, harder to change the code, harder to run on remote servers, harder to unit test expectations, requires GUI

    • [cpp->json->CI] measure with cpp and use json for CI

      • used for continous benchmarking
  • What is top-down microarchitecture analysis method?

  • What is jupyter notebook and how to use it?

    • jupter-notebook is great for analysis and leaving a trace of it. however, C++ doesnt integrate well with jupyter-notbook (cling is often not an option) so it doesnt work well with servers (requires coping data) and it`s an additional file. perf approach is to leave trace of benchmarking with automatically testing it direclty from C++ so that it can be seen for the future and explains what and why.

      # apt install jupyter
      jupyter notebook -ip 0.0.0.0 --no-browser notebook.ipynb
    • Workflows

      • dev

        • pros

          • easy editing and integration with exising tooling
          • output to the console (charts with sixel)
          • easy to share via gh gist create # github can render markdown
          • can assert and verify expectation
          • can be executed in headless mode on the server without ui
        • cons

        • harder analysis than in python

        • single compiler workflow

      • research (run from jupter notebook (produce json) analyze using python)

        • pros

          • infite way of data analysis with great matpotlib suppport
          • can run different compilers/options
          • easy to share via gh gist create # github can render ipynb
          • can be used on kaggle (avx512) or google colab
        • cons

          • editing C++ is not that well supported by notebooks but runnig is fine
          • not that great to analyze assembly output
          • usually would require running through a browser (there are extensions for vim/emacs/vscode)
          • might be slow with a lot of data
Troubleshooting
  • How to setup docker?

    Dockerfile

    docker build -t perf .
    docker run \
      -it \
      --privileged \
      --network=host \
      -e DISPLAY=${DISPLAY} \
      -v ${PWD}:${PWD} \
      -w ${PWD} \
      perf
  • How to setup linux performance counters?

    setup.sh

    sudo mount -o remount,mode=755 /sys/kernel/debug
    sudo mount -o remount,mode=755 /sys/kernel/debug/tracing
    sudo chown `whoami` /sys/kernel/debug/tracing/uprobe_events
    sudo chmod a+rw /sys/kernel/debug/tracing/uprobe_events
    echo 0 | sudo tee /proc/sys/kernel/kptr_restrict
    echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
    echo 1000 | sudo tee /proc/sys/kernel/perf_event_max_sample_rate
  • How to setup rdpmc?

    rdpmc

    echo 2 | sudo tee /sys/devices/cpu_core/rdpmc
  • How to find out which performance events are supported by the cpu?

    perf list # https://perfmon-events.intel.com

  • How to reduce noise when benchmarking?

    Linux

    • pyperf (https://pyperf.readthedocs.io/en/latest/system.html)

      sudo pyperf system tune
      sudo pyperf system show
      sudo pyperf system reset
    • tune.sh

      # Set Process CPU Affinity (apt install util-linux)
      # note: `perf::sys::thread::affinity` can be used intead
      taskset -c 0 ./a.out
      
      # Set Process Scheduling Priority (apt install coreutils)
      # note: `perf::sys::thread::priority` can be used instead
      nice -n -20 taskset -c 0 ./a.out # -20..19 (most..less favorable to the process)
      
      # Disable CPU Frequency Scaling (apt install cpufrequtils)
      sudo cpupower frequency-set --governor performance
      # cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
      # Disable Address Space Randomization
      echo 0 > /proc/sys/kernel/randomize_va_space
      
      # Disable Processor Boosting
      echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
      
      # Disable Turbo Mode
      echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
      
      # Disable Hyperthreading/SMT
      echo off | sudo tee /sys/devices/system/cpu/smt/control
      
      # Restrict memory to a single socket
      numactl -m 0 -N 0 ./a.out
      
      # Enable Huge Pages
      sudo numactl --cpunodebind=1 --membind=1 hugeadm \
        --obey-mempolicy --pool-pages-min=1G:64
      sudo hugeadm --create-mounts
    • boot / grub

      # Enable Kernel Mode Task-Isolation (https://lwn.net/Articles/816298)
      isolcpus=<cpu number>,...,<cpu number>
      # cat /sys/devices/system/cpu/isolated
      
      # Disable P-states and C-states
      idle=pool intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=1
      # cat /sys/devices/system/cpu/intel_pstate/status
      
      # Disable NMI watchdog (cat /proc/sys/kernel/nmi_watchdog)
      nmi_watchdog=0

    UEFI

    int efi_main(void*, uefi::efi_system_table* system_table) {
      ctrl::cpu::perf::init();
    
      const auto b1 = ctrl::cpu::perf::branch_instructions();
      const auto c1 = ctrl::cpu::perf::cache_misses();
      u64 checksum{};
      for (auto i = 0u ;i < (1u << 16u); ++i) {
        checksum ^= i;
      }
      const auto b2 = ctrl::cpu::perf::branch_instructions();
      const auto c2 = ctrl::cpu::perf::cache_misses();
    
      uefi::println(system_table->out, L"results");
      uefi::println(system_table->out, b2 - b1);
      uefi::println(system_table->out, c2 - c1);
    
      return uefi::run();
    }
    clang++-18 -std=c++20 -I. -target x86_64-pc-win32-coff -fno-stack-protector -fshort-wchar -mno-red-zone -c uefi.cpp -o uefi.o
    lld-link-18 -filealign:16 -subsystem:efi_application -nodefaultlib -dll -entry:efi_main uefi.o -out:BOOTX64.EFI
    mkdir -p efi/boot
    cp BOOTX64.EFI /usr/share/ovmf/OVMF.fd efi/boot
    qemu-system-x86_64 \
      -drive if=pflash,format=raw,file=efi/boot/OVMF.fd \
      -drive format=raw,file=fat:rw:. \
      -net none

    Requires reboot Gives more control such disabling cache, ...

Resources

Specs
Feeds
Videos
Tools

License

MIT / Apache2:LLVM*