GPU Extensions
KernelAbstractions.jl
@code_diff will automatically detect calls to KernelAbstractions.jl and get the code for the actual underlying kernel function (whatever the backend is). To do this, the kernel call must be complete: both workgroupsize and ndrange must have a value, either from when instantiating the kernel for a backend (gpu_kernel = my_kernel(CUDABackend(), 1024)) or when calling the kernel (gpu_kernel(a, b, c; ndrange=1000)).
There is no support for AST comparison with KA.jl kernels.
GPU kernels
@code_diff supports functions compiled in a GPU context with any of the GPU packages:
Each compilation step has its own code type:
:cuda_typed/:rocm_typed/:one_typed/:mtl_typedtyped Julia IR for the GPU (output of@device_code_typed):cuda_llvm/:rocm_llvm/:one_llvm/:mtl_llvmGPU LLVM IR (output of@device_code_llvm):cuda_native/:rocm_native/:one_native/:mtl_nativenative GPU assembly (output of@device_code_native). Each have an alias using the assembly name::ptx/:gcn/:spirv/:agx.
CUDA has one additional layer of assembly code, SASS, available with :sass.
Unlike with the @device_code_* macros, no kernel code is executed by @code_diff. The @device_code_* macros work by capturing kernel launches, while @code_diff or @code_for work with the kernel function directly: this means that kernels launched indirectly by the function call will be ignored.
Note that behind the scenes, GPUCompiler.jl only cares about the most recent methods. Hence the world keyword is unsupported for all GPU backends, as we cannot compile back in time.
GPU kernel statistics
With the :cuda_stats code type, you can get an overview of your CUDA kernel through statistics inferred from its PTX and SASS code.
Other supported types are :cuda_stats, :ptx_stats and :sass_stats for CUDA kernels, and :gcn_stats for AMDGPU kernels. See CodeDiffs.Stats.extract_stats for more about them.
Example usage:
@code_for :cuda_stats some_kernel(a, b, c)Output:
Kernel memory stats (static allocations):
- Global 0 bytes
- Const 0 bytes
- Param 88 bytes
- Shared 0 bytes
- Local 0 bytes
PTX variable declarations:
- Param:
- b8 80
- b64 1
- Registers:
- pred 4
- b32 9
- b64 6
PTX memory instructions (loads, stores):
- Global:
- u64 0 1
- Param:
- b32 0 2
- b64 1 2
- u32 1 0
- u64 3 0
SASS source stats:
- SM version 86
- Registers usage 38
- Instructions 600
- Workgroup sync 7
- Warp sync 0
- Function calls 12