Using Kokkos.jl with MPI.jl

Loading and compilation

Since calling Kokkos.initialize may trigger the compilation of the internal wrapper library, some care is needed to make sure only a single process is compiling.

A basic initialization workflow with MPI may look like this:

using MPI
using Kokkos

MPI.Init()

rank = MPI.Comm_rank(MPI.COMM_WORLD)
if rank == 0
    Kokkos.load_wrapper_lib()  # All compilation (if any) of the C++ wrapper happens here
end

MPI.Barrier(MPI.COMM_WORLD)

rank != 0 && Kokkos.load_wrapper_lib(; no_compilation=true, no_git=true)
Kokkos.initialize()

Note that passing no_compilation=true and no_git=true to load_wrapper_lib on the non-root processes is required.

The same workflow can be used to compile your library on the root process:

my_project = CMakeKokkosProject(project_src, "libproj")
rank == 0 && compile(my_project)
MPI.Barrier(MPI.COMM_WORLD)
lib = load_lib(my_project)

If configuration options need to be changed before initializing Kokkos, then they must be changed by the root process, since changing options will modify the LocalPreferrences.toml file. Configuration options only affect how the wrapper library is compiled, therefore there is no need to synchronize them on all processes, apart from one: build_dir, which MUST be the same on all processes. Passing local_only=true to Kokkos.set_build_dir for non-root processes will not affect the LocalPreferrences.toml file. The new workflow then looks like this:

rank = MPI.Comm_rank(MPI.COMM_WORLD)
if rank == 0
    Kokkos.set_view_types(my_view_types)
    # set other config options...
    Kokkos.set_build_dir(my_build_dir)
    Kokkos.load_wrapper_lib()
else
    Kokkos.set_build_dir(my_build_dir; local_only=true)
end

MPI.Barrier(MPI.COMM_WORLD)

rank != 0 && Kokkos.load_wrapper_lib(; no_compilation=true, no_git=true)

Dynamic Compilation and MPI

See DynamicCompilation.compilation_lock

Passing views to MPI

Passing a Kokkos.View to a MPI directive is possible:

v = View{Float64}(n)
v .= MPI.Comm_rank(MPI.COMM_WORLD)

r = View{Float64}(n)

MPI.Sendrecv!(v, next_rank, 0, r, prev_rank, 0, MPI.COMM_WORLD)

@assert all(r .== prev_rank)

Internally, the pointer to the data of the view is passed to MPI, there is no copy of the data, regardless of the memory space where the view is stored in.

If Kokkos.span_is_contiguous(view) == true, then the whole memory span of the view is passed to MPI as a single block of data.

For non-contiguous views (such as LayoutStride), a custom MPI.Datatype is built to exactly represent the view.

Support for GPU-awareness should be seamless, as long as your MPI implementation supports the GPU.