GPU Extensions

KernelAbstractions.jl

@code_diff will automatically detect calls to KernelAbstractions.jl and get the code for the actual underlying kernel function (whatever the backend is). To do this, the kernel call must be complete: both workgroupsize and ndrange must have a value, either from when instantiating the kernel for a backend (gpu_kernel = my_kernel(CUDABackend(), 1024)) or when calling the kernel (gpu_kernel(a, b, c; ndrange=1000)).

There is no support for AST comparison with KA.jl kernels.

GPU kernels

@code_diff supports functions compiled in a GPU context with any of the GPU packages:

Each compilation step has its own code type:

  • :cuda_typed/:rocm_typed/:one_typed/:mtl_typed typed Julia IR for the GPU (output of @device_code_typed)
  • :cuda_llvm/:rocm_llvm/:one_llvm/:mtl_llvm GPU LLVM IR (output of @device_code_llvm)
  • :cuda_native/:rocm_native/:one_native/:mtl_native native GPU assembly (output of @device_code_native). Each have an alias using the assembly name: :ptx/:gcn/:spirv/:agx.

CUDA has one additional layer of assembly code, SASS, available with :sass.

Info

Unlike with the @device_code_* macros, no kernel code is executed by @code_diff. The @device_code_* macros work by capturing kernel launches, while @code_diff or @code_for work with the kernel function directly: this means that kernels launched indirectly by the function call will be ignored.

Info

Note that behind the scenes, GPUCompiler.jl only cares about the most recent methods. Hence the world keyword is unsupported for all GPU backends, as we cannot compile back in time.

GPU kernel statistics

With the :cuda_stats code type, you can get an overview of your CUDA kernel through statistics inferred from its PTX and SASS code.

Other supported types are :cuda_stats, :ptx_stats and :sass_stats for CUDA kernels, and :gcn_stats for AMDGPU kernels. See CodeDiffs.Stats.extract_stats for more about them.

Example usage:

@code_for :cuda_stats some_kernel(a, b, c)

Output:

Kernel memory stats (static allocations):
 - Global  0 bytes
 - Const   0 bytes
 - Param   88 bytes
 - Shared  0 bytes
 - Local   0 bytes

PTX variable declarations:
 - Param:
   - b8       80
   - b64       1
 - Registers:
   - pred      4
   - b32       9
   - b64       6

PTX memory instructions (loads, stores):
 - Global:
   - u64       0     1
 - Param:
   - b32       0     2
   - b64       1     2
   - u32       1     0
   - u64       3     0

SASS source stats:
 - SM version        86
 - Registers usage   38
 - Instructions      600
 - Workgroup sync    7
 - Warp sync         0
 - Function calls    12