CUTracer

A comprehensive dynamic binary instrumentation tool for tracing and analyzing CUDA kernel instructions. Built on top of NVIDIA's NVBit framework.

Overview

CUTracer is a powerful debugging and analysis tool that monitors CUDA kernel execution at the instruction level. It provides detailed tracing of GPU operations, enabling developers to:

Capture and analyze instruction execution patterns
Monitor register values and memory operations
Detect execution anomalies including deadlocks and hangs
Gain deeper insights into kernel behavior for performance optimization

Features

Dynamic binary instrumentation with no need to modify or recompile the target application
Comprehensive instruction-level tracing for all CUDA kernels
Detailed execution information for debugging and analysis:
- Register values and changes
- Instruction program counters
- Warp and thread block execution patterns
- Memory access patterns
Deadlock detection with configurable timeouts
Ability to filter instrumentation by function name patterns
Flexible logging options for customizing output
Minimal impact on application behavior

Requirements

All requirements are aligned with NVBit.

NVIDIA GPU with compute capability >= 3.5 and <= 9.2
Linux operating system
CUDA version >= 12.0
CUDA driver version <= 555.xx
GCC version >= 5.3.0 for x86_64 or >= 7.4.0 for aarch64

Installation

Clone the repository
Download third-party dependencies:

./install_third_party.sh

Build the tool:
```
make
```

Usage

To use CUTracer, simply preload it before launching your CUDA application:

CUDA_INJECTION64_PATH=/path/to/lib/cutracer.so ./your_cuda_application

Configuration

CUTracer can be configured using environment variables:

Variable	Default	Description
DEADLOCK_TIMEOUT	10	Timeout in seconds to detect potential deadlocks
LOG_TO_STDOUT	1	Log to stdout instead of files (1=enabled, 0=disabled)
STORE_LAST_TRACES_ONLY	0	Only store the last trace for each warp in memory (1=enabled, 0=disabled)
DUMP_INTERMEDIA_TRACE	0	Dump intermediate trace data to stdout (1=enabled, 0=disabled)
DUMP_INTERMEDIA_TRACE_TIMEOUT	0	Timeout in seconds for intermediate trace dumping (0=unlimited)
FUNC_NAME_FILTER	-	Comma-separated list of function name patterns to instrument
INSTRS	-	Instruction ranges to instrument, in format: "1-10,13,23-100". Replaces INSTR_BEGIN/INSTR_END
INSTR_BEGIN	0	Beginning of the instruction interval to instrument (used if INSTRS is not set)
INSTR_END	UINT32_MAX	End of the instruction interval to instrument (used if INSTRS is not set)
TOOL_VERBOSE	0	Enable verbosity inside the tool
ALLOW_REINSTRUMENT	0	Allow instrumenting the same kernel multiple times (1=enabled, 0=disabled)
KERNEL_ITER_BEGIN	0	Start instrumenting from this kernel iteration (0=first iteration)
SINGLE_KERNEL_TRACE	0	When ALLOW_REINSTRUMENT is enabled, dump trace files using only kernel name without iteration number, overwriting previous traces (1=enabled, 0=disabled)

Examples

Basic Example

DEADLOCK_TIMEOUT=30 FUNC_NAME_FILTER=kernel_a,kernel_b CUDA_INJECTION64_PATH=./lib/cutracer.so ./my_cuda_app

This will:

Set the deadlock detection timeout to 30 seconds
Only instrument functions with names containing "kernel_a" or "kernel_b"
Run the application with instruction tracing enabled

Instruction Filtering Example

INSTRS="1-10,13,23-100" CUDA_INJECTION64_PATH=./lib/cutracer.so ./my_cuda_app

This will:

Instrument only instructions 1 through 10, instruction 13, and instructions 23 through 100
Skip all other instructions in the kernels

Advanced Example: PyTorch with Flash Attention

For debugging complex applications like PyTorch with Flash Attention operators in TritonBench:

TOOL_VERBOSE=1 \
DUMP_INTERMEDIA_TRACE=1 \
DUMP_INTERMEDIA_TRACE_TIMEOUT=10 \
FUNC_NAME_FILTER=attn \
CUDA_INJECTION64_PATH=~/path/to/deadlock_tracker/lib/cutracer.so \
python run.py --op flash_attention --batch 8 --seq-len 8192 --n-heads 16 --d-head 128 > deadlock_trace.log 2>&1

This configuration:

Enables verbose output from the tool
Dumps intermediate trace data during execution
Sets a 10-second timeout for intermediate trace dumping
Only instruments functions containing "attn" in their name (targeting attention-related kernels)
Redirects both stdout and stderr to a log file for later analysis

Example Output:

------------- NVBit (NVidia Binary Instrumentation Tool v1.7.4) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
ACK_CTX_INIT_LIMITATION = 0 - if set, no warning will be printed for nvbit_at_ctx_init()
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
       NO_EAGER_LOAD = 0 - eager module loading is turned on by NVBit to prevent potential NVBit tool deadlock, turn it off if you want to use the lazy module loading feature
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
        TOOL_VERBOSE = 1 - Enable verbosity inside the tool
    DEADLOCK_TIMEOUT = 10 - Timeout in seconds to detect potential deadlocks
       LOG_TO_STDOUT = 1 - Log to stdout instead of files (1=enabled, 0=disabled)
STORE_LAST_TRACES_ONLY = 0 - Only store the last trace for each warp in memory (1=enabled, 0=disabled)
DUMP_INTERMEDIA_TRACE = 1 - Dump intermediate trace data to stdout (1=enabled, 0=disabled)
DUMP_INTERMEDIA_TRACE_TIMEOUT = 10 - Timeout in seconds for intermediate trace dumping (0=unlimited)
WARNING: Do not call CUDA memory allocation in nvbit_at_ctx_init(). It will cause deadlocks. Do them in nvbit_tool_init(). If you encounter deadlocks, remove CUDA API calls to debug.
No function name filters specified. Instrumenting all functions.
----------------------------------------------------------------------------------------------------
Instrumenting function: vecAdd(double*, double*, double*, int) (mangled: _Z6vecAddPdS_S_i)
Inspecting function vecAdd(double*, double*, double*, int) at address 0x7f34177aef00
Instr 0 @ 0x0 (0) - LDC R1, c[0x0][0x28] ;
  has_guard_pred = 0
  opcode = LDC/LDC
  memop = CONSTANT
  format = INT32
  load/store = 1/0
  size = 4
  is_extended = 0
--op[0].type = REG
  is_neg/is_not/abs = 0/0/0
  size = 4
  num = 1
  prop = 
--op[1].type = CBANK
  is_neg/is_not/abs = 0/0/0
  size = 4
  id = 0
  has_imm_offset = 1
  imm_offset = 40
  has_reg_offset = 0
  reg_offset = 0

Kernel vecAdd(double*, double*, double*, int) - PC 0x7f34177aef00 - grid size 1,1,1 - block size 1024,1,1 - nregs 14 - shmem 0 - cuda stream id 0
INTERMEDIATE TRACE - CTA 0,0,0 - warp 24:
  LDC R1, c[0x0][0x28] ; - PC Offset 1 (0x0)
  * Reg0_T0: 0x00000000 Reg0_T1: 0x00000000 Reg0_T2: 0x00000000 Reg0_T3: 0x00000000 Reg0_T4: 0x00000000 Reg0_T5: 0x00000000 Reg0_T6: 0x00000000 Reg0_T7: 0x00000000 Reg0_T8: 0x00000000 Reg0_T9: 0x00000000 Reg0_T10: 0x00000000 Reg0_T11: 0x00000000 Reg0_T12: 0x00000000 Reg0_T13: 0x00000000 Reg0_T14: 0x00000000 Reg0_T15: 0x00000000 Reg0_T16: 0x00000000 Reg0_T17: 0x00000000 Reg0_T18: 0x00000000 Reg0_T19: 0x00000000 Reg0_T20: 0x00000000 Reg0_T21: 0x00000000 Reg0_T22: 0x00000000 Reg0_T23: 0x00000000 Reg0_T24: 0x00000000 Reg0_T25: 0x00000000 Reg0_T26: 0x00000000 Reg0_T27: 0x00000000 Reg0_T28: 0x00000000 Reg0_T29: 0x00000000 Reg0_T30: 0x00000000 Reg0_T31: 0x00000000

In the above trace snippet, we can see that the kernel vecAdd(double*, double*, double*, int) is instrumented and the instructions executed by each warp are captured. LDC R1, c[0x0][0x28] ; - PC Offset 1 (0x0) is the first instruction executed by the warp and RegX_TY means the X-th register value in the instruction of the Y-th thread in the warp.

How It Works

CUTracer uses NVIDIA's NVBit (NVIDIA Binary Instrumentation Tool) to dynamically instrument CUDA applications. It:

Intercepts CUDA kernel launches
Instruments kernel code with data collection functions
Captures instruction execution data in real-time
Analyzes execution patterns and reports detailed information

For each instrumented instruction, the tool collects register values, program counters, and execution context information, which are transmitted from the GPU to the CPU for analysis. This approach provides comprehensive insights into kernel behavior without requiring source code modifications.

Use Cases

Performance optimization: Identify bottlenecks and inefficient execution patterns
Debugging complex issues: Trace through problematic kernels to pinpoint errors
Deadlock detection: Identify when kernels stop making progress
Algorithm analysis: Understand the execution flow of complex GPU algorithms
Memory access pattern analysis: Track how threads access memory for optimization

Limitations

Adding instrumentation adds overhead to kernel execution
May produce false positives for long-running but non-deadlocked kernels
Limited to NVIDIA GPUs and CUDA applications

License

This project is licensed under the MIT License - see the LICENSE file for details.

This project uses NVBit (protected by NVIDIA CUDA Toolkit EULA) as a dependency.

Acknowledgements

Based on NVIDIA's NVBit framework for dynamic binary instrumentation
Inspired by various debugging techniques for GPU applications

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
include		include
src		src
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
install_third_party.sh		install_third_party.sh
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUTracer

Overview

Features

Requirements

Installation

Usage

Configuration

Examples

Basic Example

Instruction Filtering Example

Advanced Example: PyTorch with Flash Attention

Example Output:

How It Works

Use Cases

Limitations

License

Acknowledgements

About

Releases

Packages

Languages

License

FindHao/CUTracer

Folders and files

Latest commit

History

Repository files navigation

CUTracer

Overview

Features

Requirements

Installation

Usage

Configuration

Examples

Basic Example

Instruction Filtering Example

Advanced Example: PyTorch with Flash Attention

Example Output:

How It Works

Use Cases

Limitations

License

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages