GPU Extensions
KernelAbstractions.jl
@code_diff
will automatically detect calls to KernelAbstractions.jl and get the code for the actual underlying kernel function (whatever the backend is). To do this, the kernel call must be complete: both workgroupsize
and ndrange
must have a value, either from when instantiating the kernel for a backend (gpu_kernel = my_kernel(CUDABackend(), 1024)
) or when calling the kernel (gpu_kernel(a, b, c; ndrange=1000)
).
There is no support for AST comparison with KA.jl kernels.
GPU kernels
@code_diff
supports functions compiled in a GPU context with any of the GPU packages:
Each compilation step has its own code type:
:cuda_typed
/:rocm_typed
/:one_typed
/:mtl_typed
typed Julia IR for the GPU (output of@device_code_typed
):cuda_llvm
/:rocm_llvm
/:one_llvm
/:mtl_llvm
GPU LLVM IR (output of@device_code_llvm
):cuda_native
/:rocm_native
/:one_native
/:mtl_native
native GPU assembly (output of@device_code_native
). Each have an alias using the assembly name::ptx
/:gcn
/:spirv
/:agx
.
CUDA has one additional layer of assembly code, SASS, available with :sass
.
Unlike with the @device_code_*
macros, no kernel code is executed by @code_diff
. The @device_code_*
macros work by capturing kernel launches, while @code_diff
or @code_for
work with the kernel function directly: this means that kernels launched indirectly by the function call will be ignored.
Note that behind the scenes, GPUCompiler.jl
only cares about the most recent methods. Hence the world
keyword is unsupported for all GPU backends, as we cannot compile back in time.
GPU kernel statistics
With the :cuda_stats
code type, you can get an overview of your CUDA kernel through statistics inferred from its PTX and SASS code.
Other supported types are :cuda_stats
, :ptx_stats
and :sass_stats
for CUDA kernels, and :gcn_stats
for AMDGPU kernels. See CodeDiffs.Stats.extract_stats for more about them.
Example usage:
@code_for :cuda_stats some_kernel(a, b, c)
Output:
Kernel memory stats (static allocations):
- Global 0 bytes
- Const 0 bytes
- Param 88 bytes
- Shared 0 bytes
- Local 0 bytes
PTX variable declarations:
- Param:
- b8 80
- b64 1
- Registers:
- pred 4
- b32 9
- b64 6
PTX memory instructions (loads, stores):
- Global:
- u64 0 1
- Param:
- b32 0 2
- b64 1 2
- u32 1 0
- u64 3 0
SASS source stats:
- SM version 86
- Registers usage 38
- Instructions 600
- Workgroup sync 7
- Warp sync 0
- Function calls 12