Armon

Documentation for Armon.

Armon is a 2D CFD solver for compressible non-viscous fluids, using the finite volume method.

It was made to explore Julia's capabilities in HPC and for performance portability: it should perform very well on any CPU and GPU. Domain decomposition using MPI is supported.

The twin project Armon-Kokkos is a mirror of the core of this solver (with much less options) written in C++ using the Kokkos library. It is possible to reuse kernels from that solver in this one, using the Kokkos.jl package.

Parameters and entry point

Armon.ArmonParametersType
ArmonParameters(; options...)

The parameters and current state of the solver.

The state is reset at each call to armon.

There are many options. Each backend can add their own.

Options

Backend and MPI

device = :CUDA

Device to use. Supported values:

  • :CPU_HP: Polyester.jl CPU multithreading (default if use_gpu=false)
  • :CUDA: CUDA.jl GPU (default if use_gpu=true)
  • :ROCM: AMDGPU.jl GPU
  • :CPU: KernelAbstractions.jl CPU multithreading (using the standard Threads.jl)
use_MPI = true, P = (1, 1), reorder_grid = true, global_comm = nothing

MPI config. The MPI domain will be a process grid of size P. global_comm is the global communicator to use, defaults to MPI.COMM_WORLD. reorder_grid is passed to MPI.Cart_create.

gpu_aware = true

Store MPI buffers on the device. This requires to use a GPU-aware MPI implementation. Does nothing when using the CPU only.

numa_aware = true

Allocate memory according to which NUMA node is associated with the thread supposed to work on a chunk of memory. This effectively enforces the first-touch policy, instead of blindly relying on it.

lock_memory = false

Lock all memory pages using mlock to RAM.

Kernels

use_threading = true, use_simd = true

Switches for CPU_HP kernels. use_threading enables @threaded for outer loops. use_simd enables @simd_loop for inner loops.

use_gpu = false

Enables the use of KernelAbstractions.jl kernels.

use_kokkos = false

Use kernels for Kokkos.jl.

use_cache_blocking = true

Separate the domain into semi-independant blocks, improving the cache-locality of memory accesses and therefore memory throughput.

async_cycle = false

Apply all steps of the solver to all blocks asynchronously, fully taking advantage of cache blocking.

block_size = 1024

Size of blocks for cache blocking. Can be a tuple. If use_cache_blocking == false, this option only controls the size of GPU blocks.

use_two_step_reduction = false

Reduction kernels (dtCFL_kernel and conservation_vars) use some optimizations to perform the reduction in a single step. It might cause issues on some GPU backends: a more "gentle" approach could avoid those by doing it in two steps.

workload_distribution = :simple

Dictates how blocks are distributed among threads when async_cycle == true:

  • :simple trivially spreads all blocks to all threads evenly
  • :scotch uses the Scotch solver to partition the block grid
  • :sorted_scotch is the same as :scotch, but additionally sorts the blocks to work on those at the perimeter of the square first, reducing the likelyness of waiting for neighbouring threads
  • :weighted_sorted_scotch takes into account the number of cells in each block instead of assuming an even workload for all blocks
busy_wait_limit = 100

Number of calls in a cycle to block_state_machine which did not advance any of a thread's blocks until a call to MPI_Wait or a system sleep is done. This is only relevant when async_cycle == true.

Profiling

profiling = Symbol[]

List of profiling callbacks to use:

  • :TimerOutputs: TimerOutputs.jl sections (added if measure_time=true)
  • :NVTX_sections: NVTX.jl sections
  • :NVTX_kernels: NVTX.jl sections for kernels
  • :CUDA_kernels: equivalent to CUDA.@profile in front of all kernels
measure_time = true

measure_time=false can remove any overhead caused by profiling.

time_async = true

time_async=false will add a barrier at the end of every section. Useful for GPU kernels.

Scheme and CFD solver

scheme = :GAD, riemann_limiter = :minmod

scheme is the Riemann solver scheme to use:

  • :Godunov (1st order)
  • :GAD (2nd order, with limiter).

riemann_limiter is the limiter to use for the Riemann solver: :no_limiter, :minmod or :superbee.

projection = :euler_2nd

Scheme for the Eulerian remap step:

  • :euler (1st order)
  • :euler_2nd (2nd order, +minmod limiter)
axis_splitting = :Sequential

Axis splitting to use:

  • :Sequential: X then Y
  • :SequentialSym (or :Godunov): X and Y then Y and X, alternating
  • :Strang: ½X, Y, ½X then ½Y, X, ½Y, alternating (½ is for halved time step)
  • :X_only
  • :Y_only
N = (10, 10)

Number of cells of the global domain in each axes.

nghost = 4

Number of ghost cells. Must be greater or equal to the minimum number of ghost cells (min 1, scheme=:GAD adds one, projection=:euler_2nd adds one, scheme=:GAD + projection=:euler_2nd adds another one)

Dt = 0., cst_dt = false, dt_on_even_cycles = false

Dt is the initial time step, it is computed after initialization by default. If cst_dt=true then the time step is always Dt and no reduction over the entire domain occurs. If dt_on_even_cycles=true then then time step is only updated at even cycles (the first cycle is even).

data_type = Float64

Data type for all variables. Should be an AbstractFloat.

Test case and domain

test = :Sod, domain_size = nothing, origin = nothing

test is the test case name to use:

  • :Sod: Sod shock tube test
  • :Sod_y: Sod shock tube test along the Y axis
  • :Sod_circ: Circular Sod shock tube test (centered in the domain)
  • :Bizarrium: Bizarrium test, similar to the Sod shock tube but with a special equation of state
  • :Sedov: Sedov blast-wave test (centered in the domain, reaches the border at t=1 by default)
  • :DebugIndexes: Set all variables to their index in the global domain. Debug only.
cfl = 0., maxtime = 0., maxcycle = 500_000

cfl defaults to the test's default value, same for maxtime. The solver stops when t reaches maxtime or maxcycle iterations were done (maxcycle=0 stops after initialization).

Output

silent = 0

silent=0 for maximum verbosity. silent=3 doesn't print info at each cycle. silent=5 doesn't print anything.

output_dir = ".", output_file = "output"

joinpath(output_dir, output_file) will be path to the output file.

write_output = false, write_ghosts = false

write_output=true will write all saved_vars() to the output file. If write_ghosts=true, ghost cells will also be included.

write_slices = false

Will write all saved_vars() to 3 output files, one for the middle X row, another for the middle Y column, and another for the diagonal. If write_ghosts=true, ghost cells will also be included.

output_precision = nothing

Numbers are saved with output_precision digits of precision. Defaults to enough numbers for an exact decimal representation.

animation_step = 0

If animation_step ≥ 1, then every animation_step cycles, variables will be saved as with write_output=true.

compare = false, is_ref = false, comparison_tolerance = 1e-10

If compare=true, then at every sub step of each iteration of the solver all variables will:

  • (is_ref=false) be compared with a reference file found in output_dir
  • (is_ref=true) be saved to a reference file in output_dir

When comparing, a relative comparison_tolerance (the rtol kwarg of isapprox) is accepted between values.

check_result = false

Check if conservation of mass and energy is verified between initialization and the last iteration. An error is thrown otherwise. Accepts a relative comparison_tolerance.

return_data = false

If return_data=true, then in the SolverStats returned by armon, the data field will contain the BlockGrid used by the solver.

source
Armon.SolverStatsType
SolverStats

Solver output.

data is nothing if parameters.return_data is false.

timer is nothing if parameters.measure_time is false.

grid_log is nothing if parameters.log_blocks is false.

source
Armon.StepsRangesType
StepsRanges

Holds indexing information for all steps of the solver.

Domains are stored as block corner offsets: blocks can have different sizes, but always the same amount of ghost cells, therefore the iteration domain is determined from the dimensions of the block. The first field is the offset to the first cell, the second is the offset to the last cell.

source
Armon.data_typeFunction
data_type(::ArmonParameters{T})

Get T, the type used for numbers by the solver

source

Grid and blocks

Armon.BlockGridType
BlockGrid{T, DeviceA, HostA, BufferA, Ghost, BlockSize, Device, SolverState}

Stores TaskBlocks on the Device and host memory, in a grid.

LocalTaskBlock are stored separately depending on if they have a StaticBSize of BlockSize (in blocks) or if they have a DynamicBSize (in edge_blocks).

Blocks have Ghost cells padding their real cells. This is included in their block_size. A block cannot have a number of real cells along an axis smaller than the number of ghost cells, unless there is only a single block in the grid along that axis.

"Edge blocks" are blocks located on the right and/or top edge of the grid. They exist in order to handle domains with dimensions which are not multiples of the block size.

DeviceA and HostA are AbstractArray types for the device and host respectively.

BufferA is the type of storage used for MPI buffers. MPI buffers are homogenous: they are either all on the host or all on the device.

source
Armon.grid_dimensionsFunction
grid_dimensions(params::ArmonParameters)
grid_dimensions(block_size::NTuple{D, Int}, domain_size::NTuple{D, Int}, ghost::Int) where {D}

Returns the dimensions of the grid in the form (grid_size, static_sized_grid, remainder_block_size) from the block_size (the size of blocks in the static_sized_grid), the domain_size (number of real cells) and the number of ghost cells, common to all blocks.

grid_size is the static_sized_grid including the edge blocks. Edge blocks along the axis d have a size of remainder_block_size[d] along d only if they are at the edge of the grid along d, or block_size otherwise.

block_size includes the ghost cells in its dimensions, which must all be greater than 2*ghost.

If prod(block_size) == 0, then block_size is ignored and the grid is made of only a single block of size domain_size.

Note

Blocks must not be smaller than ghost, therefore edge blocks might be made bigger than block_size.

Note

In case domain_size is smaller than block_size .- 2*ghost along any axis, the grid will contain only edge blocks.

source
Armon.LocalTaskBlockType
LocalTaskBlock{D, H, Size, SolverState} <: TaskBlock{V}

Container of Size and variables of type D on the device and H on the host. Part of a BlockGrid.

The block stores its own solver state, allowing it to run all solver steps independantly of all other blocks, apart from steps requiring synchronization.

source
Armon.device_to_host!Function
device_to_host!(blk::LocalTaskBlock)

Copies the device data of blk to the host data. A no-op if the device is the host.

source
device_to_host!(grid::BlockGrid)

Copies device data of all blocks to the host data. A no-op if the device is the host.

source
Armon.host_to_device!Function
device_to_host!(blk::LocalTaskBlock)

Copies the host data of blk to its device data. A no-op if the device is the host.

source
device_to_host!(grid::BlockGrid)

Copies host data of all blocks to the device data. A no-op if the device is the host.

source
Armon.buffers_on_deviceFunction
buffers_on_device(::BlockGrid)
buffers_on_device(::Type{<:BlockGrid})

true if the communication buffers are stored on the device, allowing direct transfers without passing through the host (GPU-aware communication).

source
Armon.device_is_hostFunction
device_is_host(::BlockGrid{T, D, H})
device_is_host(::Type{<:BlockGrid{T, D, H}})

true if the device is the host, i.e. device blocks and host blocks are the same (and D == H).

source
Armon.block_idxFunction
block_idx(grid::BlockGrid, idx::CartesianIndex)

Linear index in grid.blocks of the block at idx in the statically-sized grid.

source
Armon.edge_block_idxFunction
edge_block_idx(grid::BlockGrid, idx::CartesianIndex)

Linear index in grid.edge_blocks of the block at idx, along the (dynamically-sized) edges of the grid.

source
Armon.remote_block_idxFunction
remote_block_idx(grid::BlockGrid, idx::CartesianIndex)

Linear index in grid.remote_blocks of the remote block at idx in the grid.

source
Armon.EdgeBlockRegionsType
EdgeBlockRegions(grid::BlockGrid; real_sizes=false)
EdgeBlockRegions(
    grid_size::NTuple{D, Int}, static_sized_grid::NTuple{D, Int},
    block_size::NTuple{D, Int}, remainder_block_size::NTuple{D, Int}; ghosts=0
)

Iterator over edge blocks, their positions and their size in each edge region of the grid.

ghosts only affects the size of edge blocks: B .- 2*ghosts. If real_sizes == true, then ghosts is ghosts(grid), therefore the real block size of edge blocks is returned.

There is a maximum of 2^D-1 edge regions in a grid (see this explanation). When grid_size[i] != static_sized_grid[i], there cannot be an edge region, hence there are exactly 2^sum(grid_size .!= static_sized_grid)-1 regions.

julia> collect(Armon.EdgeBlockRegions((5, 5), (4, 4), (32, 32), (16, 16)))
3-element Vector{Tuple{Int64, CartesianIndices{2, Tuple{UnitRange{Int64}, UnitRange{Int64}}}, Tuple{Int64, Int64}}}:
 (4, CartesianIndices((5:5, 1:4)), (16, 32))
 (4, CartesianIndices((1:4, 5:5)), (32, 16))
 (1, CartesianIndices((5:5, 5:5)), (16, 16))

julia> collect(Armon.EdgeBlockRegions((5, 5, 5), (4, 5, 4), (32, 32, 32), (16, 16, 16)))
3-element Vector{Tuple{Int64, CartesianIndices{3, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}}, Tuple{Int64, Int64, Int64}}}:
 (20, CartesianIndices((5:5, 1:5, 1:4)), (16, 32, 32))
 (20, CartesianIndices((1:4, 1:5, 5:5)), (32, 32, 16))
 (5, CartesianIndices((5:5, 1:5, 5:5)), (16, 32, 16))
source
Armon.RemoteBlockRegionsType
RemoteBlockRegions(grid::BlockGrid)
RemoteBlockRegions(grid_size::NTuple{D, Int})

Iterator of all remote block positions in each region (the faces of the grid). There is always 2*D regions.

julia> collect(Armon.RemoteBlockRegions((5, 3)))
4-element Vector{Tuple{Armon.Side.T, CartesianIndices{2, Tuple{UnitRange{Int64}, UnitRange{Int64}}}}}:
 (Armon.Side.Left, CartesianIndices((0:0, 1:3)))
 (Armon.Side.Right, CartesianIndices((6:6, 1:3)))
 (Armon.Side.Bottom, CartesianIndices((1:5, 0:0)))
 (Armon.Side.Top, CartesianIndices((1:5, 4:4)))
source
Armon.@iter_blocksMacro
@iter_blocks for blk in grid
    # body...
end

Applies the body of the for-loop in to all blocks of the grid. Threads iterate over the blocks they are assigned to via grid.threads_workload.

source
Armon.block_pos_containing_cellFunction
block_pos_containing_cell(grid::BlockGrid, pos::Union{CartesianIndex, NTuple})

Returns two CartesianIndexes: the first is the position of the block containing the cell at pos, the second is the position of the cell in that block.

source
Armon.block_originFunction
block_origin(grid::BlockGrid, pos, include_ghosts=false)

A Tuple of the position of the cell at the bottom left corner of the LocalTaskBlock at pos in the grid.

If include_ghosts == true, then the cell position includes all ghost cells of the grid.

pos can be any of Integer, NTuple{N, Integer} or CartesianIndex.

source
Armon.block_size_atFunction
block_size_at(grid::BlockGrid, idx)
block_size_at(idx, grid_size, static_sized_grid, block_size, remainder_block_size, ghosts)

Theoretical size of block at idx in grid. If ghosts == 0, ghosts cells will be included in the size.

source
Armon.move_pagesMethod
move_pages(grid::BlockGrid)

Move the pages of all blocks of the grid, including remote blocks, to the NUMA node of the thread which is in charge of working on that block.

source
Armon.lock_pagesMethod
lock_pages(grid::BlockGrid)

Locks the pages of all blocks of the grid, including remote blocks.

source

Block size and iteration

Armon.StaticBSizeType
StaticBSize{S, Ghost} <: BlockSize

A BlockSize of size S and Ghost cells. S is embedded in the type, this reduces the amount of memory in the parameters of every kernel, as well as allowing the compiler to make some minor optizations in indexing expressions.

Ghost cells are included in S: there are S .- 2*Ghost real cells.

source
Armon.DynamicBSizeType
DynamicBSize{Ghost} <: BlockSize

Similar to StaticBSize, but for blocks with a less-than-ideal size: block size is therefore not stored in the type. This results in less compilation when testing for different domain sizes with a constant StaticBSize.

The number of Ghost cells is still embedded in the type, as it can simplify some indexing expressions and for coherency.

source
Armon.border_domainFunction
border_domain(bsize::BlockSize, side::Side.T; single_strip=true)

DomainRange of the real cells along side.

If single_strip == true, it includes only one "strip" of cells, that is length(border_domain(bsize, side)) == size_along(bsize, side). Otherwise, there are ghosts(bsize) strips of cells: all real cells which would be exchanged with another block along side.

source
Armon.ghost_domainFunction
ghost_domain(bsize::BlockSize, side::Side.T; single_strip=true)

DomainRange of all ghosts cells of side, excluding the corners of the block.

If single_strip == true, then the domain is only 1 cell thick, positionned at the furthest ghost cell from the real cells. Otherwise, there are ghosts(bsize) strips of cells: all ghost cells which would be exchanged with another block along side.

source
Armon.block_domain_rangeFunction
block_domain_range(bsize::BlockSize, corners)
block_domain_range(bsize::BlockSize, bottom_left::Tuple, top_right::Tuple)

A DomainRange built from offsets from the corners of bsize.

block_domain_range(bsize, (0, 0), (0, 0)) is the domain of all real cells in the block. block_domain_range(bsize, (-g, -g), (g, g)) would be the domain of all cells (real cells + g ghost cells) in the block.

source
Armon.positionFunction
position(bsize::BlockSize, i)

N-dim position of the i-th cell in the block.

If 1 ≤ position(bsize, i)[d] ≤ block_size(bsize)[d] then the cell is not a ghost cell along the d dimension. See is_ghost.

source
Armon.lin_positionFunction
lin_position(bsize::BlockSize, I)

From the NTuple (e.g. returned from position), return the linear index in the block. lin_position(bsize, position(bsize, i)) == i.

source
Armon.in_gridFunction
in_grid(idx, size)
in_grid(start, idx, size)

true if each axis of idx is between start and size. start defaults to 1.

Argument types can any mix of Integer, Tuple or CartesianIndex.

julia> in_grid(1, (1, 2), 2)  # same as `in_grid((1, 1), (1, 2), (2, 2))`
true

julia> in_grid((3, 1), (3, 2))
true

julia> in_grid((1, 3), (3, 2))
false
source
in_grid(idx, grid, axis::Axis.T)
in_grid(start, idx, grid, axis::Axis.T)

Same as in_grid(start, idx, grid), but only checks along axis.

source
Armon.is_ghostFunction
is_ghost(bsize::BlockSize, i, o=0)

true if the i-th cell of the block is a ghost cell, false otherwise.

o would be a "ring" index: o == 1 excludes the first ring of ghost cells, etc.

source
Armon.BlockRowIteratorType
BlockRowIterator(grid::BlockGrid; kwargs...)
BlockRowIterator(grid::BlockGrid, blk::LocalTaskBlock; kwargs...)
BlockRowIterator(grid::BlockGrid, blk₁_pos, blk₂_pos; kwargs...)
BlockRowIterator(grid::BlockGrid, sub_domain::NTuple{2, CartesianIndex}; kwargs...)
BlockRowIterator(grid::BlockGrid, row_iter::CartesianIndices; global_ghosts=false, all_ghosts=false)

Iterate the rows of all blocks of the grid, row by row (and not block by block). This allows to iterate the cells of the grid as if it was a single block.

Elements are tuples of (block, global_row_idx, row_range). row_range is the range of cells in block for the current row.

Giving blk will return an iterator on the rows of the block.

Giving blk₁_pos and blk₂_pos will return an iterator over all rows between those blocks.

Giving sub_domain will return an iterator including only the cells contained in sub_domain. sub_domain is a cuboid defined by the position of the cells in the whole domain of grid, using its lower and upper corners.

row_iter is a iterator over global row indices.

If global_ghosts == true, then the ghost cells of at the border of the global domain are also returned. If all_ghosts == true, then the ghost cells of at the border of all blocks are also returned.

julia> params = ArmonParameters(; N=(24, 8), nghost=4, block_size=(20, 12), use_MPI=false);

julia> grid = BlockGrid(params);

julia> for (blk, row_idx, row_range) in Armon.BlockRowIterator(grid)
           println(Tuple(blk.pos), " - ", row_idx, " - ", row_range)
       end
(1, 1) - (1, 1) - 85:96
(2, 1) - (2, 1) - 85:96
(1, 1) - (1, 2) - 105:116
(2, 1) - (2, 2) - 105:116
(1, 1) - (1, 3) - 125:136
(2, 1) - (2, 3) - 125:136
(1, 1) - (1, 4) - 145:156
(2, 1) - (2, 4) - 145:156
(1, 2) - (1, 5) - 85:96
(2, 2) - (2, 5) - 85:96
(1, 2) - (1, 6) - 105:116
(2, 2) - (2, 6) - 105:116
(1, 2) - (1, 7) - 125:136
(2, 2) - (2, 7) - 125:136
(1, 2) - (1, 8) - 145:156
(2, 2) - (2, 8) - 145:156
source
Armon.DomainRangeType
DomainRange

Two dimensional range to index a 2D array stored with contiguous rows. Not equivalent to CartesianIndices as it handles StepRanges properly.

source

Block states

Armon.SolverStateType
SolverState

Object containing all non-constant parameters needed to run the solver, as well as type-parameters needed to avoid runtime dispatch.

This object is local to a block (or set of blocks): multiple blocks could be at different steps of the solver at once.

source
Armon.first_stateFunction
first_state(grid::BlockGrid)

A SolverState which can be used as a global state when outside of a solver cycle. It belongs to the first device block.

source
Armon.block_state_machineFunction
block_state_machine(params::ArmonParameters, blk::LocalTaskBlock)

Advances the SolverStep state of the blk, apply each step of the solver on the blk. This continues until the current cycle is done, or the block needs to wait for another block to do the ghost cells exchange (block_ghost_exchange) or compute the its time step (next_time_step).

Returns the new step of the block. If SolverStep.NewCycle is returned, the blk reached the end of the current cycle and will not progress any further until all other blocks have reached the same point.

source

Time step reduction

Armon.next_time_stepFunction
next_time_step(params::ArmonParameters, state::SolverState, blk::LocalTaskBlock; already_contributed=false)
next_time_step(params::ArmonParameters, state::SolverState, grid::BlockGrid)

Compute the time step of the next cycle. This is done at the start of the current cycle.

Since the current cycle does not rely on an up-to-date time step, the time step reduction is done fully asynchronously, including the global MPI reduction. The accuracy cost of this optimisation is minimal, as the CFL condition prevents the time step from being too large. Additionally, we prevent the time step from increasing of more than +5% of the previous one.

For first cycle, if no initial time step is given, the time step of the next cycle is reused for the initial cycle.

If blk is given, its contribution is only added to the state.global_dt (the GlobalTimeStep). Passing the whole block grid will block until the new time step is computed.

source
Armon.GlobalTimeStepType
GlobalTimeStep

Holds all information about the current time and time step for the current solver cycle. This struct is global and shared among all blocks.

When reaching next_time_step, blocks will contribute to the calculation of the next time step. The last block doing so will start the MPI reduction. The first block reaching the start of the next cycle will wait until this reduction is completed, updating the GlobalTimeStep when so.

source

Block exchanges

Armon.block_ghost_exchangeFunction
block_ghost_exchange(params::ArmonParameters, state::SolverState, blk::LocalTaskBlock)

Handles communications between the blk neighbours, along the current state.axis. If blk is on one of the edges of the grid, a remote exchange is performed with the neighbouring RemoteTaskBlock, or the global boundary conditions are applied.

Returns true if exchanges were not completed, and the block is waiting on another to be ready for the exchange.

source
Armon.BlockInterfaceType
BlockInterface

Represents the interface between two neighbouring TaskBlocks. It synchronizes the state of the blocks to make sure the halo exchange happens when both blocks are ready, and that the operation is done by only one of the blocks.

source
Armon.exchange_done!Function
exchange_done!(blk, side)

Mark the exchange of the interface of the blk along side as done. The other block need to acknowledge this before the exchange state can be reset.

source
Armon.side_is_done!Function
side_is_done!(blk::LocalTaskBlock, side::Side.T, done::Bool=true)

Set the interface of blk along side as done. This is a non-atomic operation, meant only to avoid repeating the exchange logic multiple times.

source
Armon.BlockExchangeStateModule
BlockExchangeState

State of an interface between two blocks, controlling if a cell exchange can happen or did happen.

source

Block distribution

Armon.thread_workload_distributionFunction
thread_workload_distribution(params::ArmonParameters; threads=nothing)
thread_workload_distribution(
    threads::Int, grid_size::Tuple;
    scotch=true, simple=false, perimeter_first=false, kwargs...
)

Distribute each block in grid_size among the threads, as evenly as possible.

With simple == true, blocks are distributed with simple_workload_distribution.

With scotch == true, the Scotch graph partitioning solver is used to better split the grid. kwargs are passed to scotch_grid_partition.

If perimeter_first == true, the resulting distribution will have blocks sorted in way that will place neighbours of other threads' blocks first in the list. By doing so, communications between threads may be overlapped more frequently.

source
Armon.simple_workload_distributionFunction
simple_workload_distribution(threads, grid_size)

Basic distribution of B = prod(grid_size) blocks into groups of B ÷ threads, with the remaining blocks distributed evenly.

source
Armon.scotch_grid_partitionFunction
scotch_grid_partition(
    threads, grid_size;
    strategy=:default, workload_tolerance=0, repart=false, retries=10, weighted=false,
    static_sized_grid=nothing, block_size=nothing, remainder_block_size=nothing, ghosts=0
)

Split grid_size to the threads.

strategy is passed to Scotch.strat_flags.

workload_tolerance is the acceptable workload uneveness among the partitions. Giving a few blocks of margin (e.g. at least 1/prod(grid_size)) is preferrable.

weighted == true will distribute blocks while taking into account the number of real cells they have. In this case all parameters of the grid must be present: static_sized_grid, block_size, remainder_block_size and ghosts must be given (obtained with e.g. grid_dimensions).

The partitioning is random, hence results may vary. To counterbalance this, giving retries > 0 will repeat the partitioning retries times and keep the best one.

source
Armon.block_grid_from_workloadFunction
block_grid_from_workload(grid_size, threads_workload)

Convenience function to convert a threads_workload (result of thread_workload_distribution) into an Array of grid_size, with each element assigned to the tid given by the distribution.

This makes it easy to visualize the efficiency of the distribution.

source

Device and backends

Armon.CPU_HPType
CPU_HP

Device tag for the high-performance CPU backend using multithreading (with Polyester.jl) and vectorisation.

source
Armon.create_deviceFunction
create_device(::Val{:device_name})

Create a device object from its name.

Default devices:

  • :CPU: the CPU backend of KernelAbstractions.jl
  • :CPU_HP: Polyester.jl multithreading

Extensions:

  • :Kokkos: the default Kokkos.jl device
  • :CUDA: the CUDA.jl backend of KernelAbstractions.jl
  • :ROCM: the AMDGPU.jl backend of KernelAbstractions.jl
source
Armon.init_backendFunction
init_backend(params::ArmonParameters, ::Dev; options...)

Initialize the backend corresponding to the Dev device returned by create_device using options. Set the params.backend_options field.

It must return options, with the backend-specific options removed.

source
Armon.memory_infoFunction
memory_info(params)

The total and free memory the current process can store on the params.device.

source
Armon.memory_requiredFunction
memory_required(params::ArmonParameters)

(device_memory, host_memory) required for params.

MPI buffers size are included in the appropriate field depending on params.gpu_aware. params.use_MPI and params.neighbours is taken into account.

If device_is_host, then, device_memory only includes memory required by data arrays and MPI buffers.

source
memory_required(N::NTuple{2, Int}, block_size::NTuple{2, Int}, ghost::Int, data_type)
memory_required(N::NTuple{2, Int}, block_size::NTuple{2, Int}, ghost::Int,
    device_array_type, host_array_type, buffer_array_type[, solver_state_type])

Compute the number of bytes needed to allocate all blocks. If only data_type is given, then device_array_type, host_array_type and buffer_array_type default to Vector{T}. solver_state_type defaults to SolverState{T, #= default schemes and test =#}.

In order of returned values:

  1. Amount of bytes needed for all arrays on the device. This amount is also required on the host when the host and device are not the same.
  2. Amount of bytes needed for all MPI buffers, if the sub-domain has neighbours on all of its sides. If params.gpu_aware, then this memory is allocated on the device.
  3. Amount of bytes needed on the host memory for all block objects, excluding array data and buffers. This memory is always allocated on the host.
res = memory_required((1000, 1000), (64, 64), 4, CuArray{Float64}, Vector{Float64}, Vector{Float64})
device_memory = res[1]
host_memory = res[3] + (device_is_host ? device_memory : res[1])
if params.gpu_aware
    device_memory += res[2]
else
    host_memory += res[2]
end
source

Kernels

Armon.@generic_kernelMacro
@generic_kernel(function definition)

Transforms a single kernel function into six different functions:

  • 4 which run on the CPU using Polyester.jl's multi-threading or not, as well as SIMD or not.
  • one which uses KernelAbstractions.jl to make a GPU-compatible kernel
  • a main function, which will take care of calling the two others depending if we want to use the GPU or not.

To do this, two things are done:

  • All calls to @index_1D_lin(), @index_2D_lin() and @iter_idx() are replaced by their equivalent in their respective platforms: a simple loop index for CPUs, and a call to KA.jl's @index for GPUs.
  • Arguments to each function are edited

A kernel function must call one of @index_1D_lin() or @index_2D_lin() at least once, since this will determine which type of indexing to use as well as which parameters to add.

The indexing macro @iter_idx gives the linear index to the current iteration (on CPU) or global thread (on GPU).

This means that in order to call the new main function, one needs to take into account which indexing macro was used:

  • In all cases, params::ArmonParameters is the first argument
  • Then, depending on the indexing macro used:
    • @index_1D_lin() : loop_range::OrdinalRange{Int}
    • @index_2D_lin() : main_range::OrdinalRange{Int}, inner_range::OrdinalRange{Int}
  • An optional keyword argument, no_threading, allows to override the use_threading parameter, which can be useful in asynchronous contexts. It defaults to false.

Using KA.jl's @Const to annotate arguments is supported, but they will be present only in the GPU kernel definition.

Further customisation of the kernel and main function can be obtained using @kernel_options and @kernel_init.

@generic_kernel f_kernel(A)
    i = @index_1D_lin()
    A[i] += 1
end

params.use_gpu = false
f(params, 1:10)  # CPU call

params.use_gpu = true
f(params, 1:10)  # GPU call
source
Armon.@kernel_initMacro
@kernel_init(expr)

Allows to initialize some internal variables of the kernel before the loop. The given expression must NOT depend on any index. You must not use any indexing macro (@index_1D_lin(), etc...) in the expression.

All paramerters of the kernel are available during the execution of the init expression. On GPU, this expression will be executed by each thread.

This is a workaround for a limitation of Polyester.jl which prevents you from typing variables.

@generic_kernel function simple_kernel(a, b)
    @kernel_init begin
        c::Float64 = sin(0.123)
    end
    i = @index_1D_lin()
    a[i] += b[i] * c
end
source
Armon.@kernel_optionsMacro
@kernel_options(options...)

Must be used (once, and explicitly) in the definition of a @generic_kernel function.

Gives options for @generic_kernel to adjust the resulting functions.

The possible options are:

  • debug: Prints the generated functions to stdout at compile time.
@generic_kernel function add_kernel(a, b)
    @kernel_options(debug)
    i = @index_1D_lin()
    a[i] += b[i]
end
source
Armon.@index_1D_linMacro
@index_1D_lin()

Indexing macro to use in a @generic_kernel function. Returns a linear index to access the 1D arrays used by the kernel.

Cannot be used alongside @index_2D_lin().

source
Armon.@index_2D_linMacro
@index_2D_lin()

Indexing macro to use in a @generic_kernel function. Returns a linear index to access the 2D arrays used by the kernel.

Cannot be used alongside @index_1D_lin().

source
Armon.@iter_idxMacro
@iter_idx()

Indexing macro to use in a @generic_kernel function. Returns a linear index to access the 2D arrays used by the kernel.

Equivalent to KernelAbstractions.jl's @index(Global, Linear).

source
Armon.@simd_loopMacro
@simd_loop(expr)

Allows to enable/disable SIMD optimisations for a loop. When SIMD is enabled, it is assumed that there is no dependencies between each iterations of the loop.

    @simd_loop for i = 1:n
        y[i] = x[i] * (x[i-1])
    end
source
Armon.@simd_threaded_iterMacro
@simd_threaded_iter(range, expr)

Same as @simd_threaded_loop(expr), but instead of slicing the range of the for loop in expr, we slice the range given as the first parameter and distribute the slices evenly to the threads.

The inner @simd loop assumes there is no dependencies between each iteration.

    @simd_threaded_iter 4:2:100 for i in 1:100
        y[i] = log10(x[i]) + x[i]
    end
    # is equivalent to (without threading and SIMD)
    for j in 4:2:100
        for i in (1:100) .+ (j - 1)
            y[i] = log10(x[i]) + x[i]
        end
    end
source
Armon.@simd_threaded_loopMacro
@simd_threaded_loop(expr)

Allows to enable/disable multithreading and/or SIMD of the loop depending on the parameters. When using SIMD, @fastmath and @inbounds are used.

In order to use SIMD and multithreading at the same time, the range of the loop is split in even batches. Each batch has a size of params.simd_batch iterations, meaning that the inner @simd loop has a fixed number of iterations, while the outer threaded loop will have N ÷ params.simd_batch iterations.

The loop range is assumed to be increasing, i.e. this is correct: 1:2:100, this is not: 100:-2:1 The inner @simd loop assumes there is no dependencies between each iteration.

    @simd_threaded_loop for i = 1:n
        y[i] = log10(x[i]) + x[i]
    end
source
Armon.@threadedMacro
@threaded(expr)

Allows to enable/disable multithreading of the loop depending on the parameters.

The default condition is params.use_threading && !params.use_cache_blocking. By passing :outside_kernel before expr, the condition becomes simply params.use_threading.

    @threaded for i = 1:n
        y[i] = log10(x[i]) + x[i]
    end
source

Logging

Armon.analyse_log_statsFunction
analyse_log_stats(f, grid_log::BlockGridLog)

Call f for each block of the grid_log, passing the block's position (a CartesianIndex), a Vector of BlockLogEvents, and the size of the block.

source
analyse_log_stats(grid_log::BlockGridLog)

Crunch all data of grid_log into tangible metrics contained in a BlockGridLogStats.

source
Armon.BLOCK_LOG_THREAD_LOCAL_STORAGEConstant
BLOCK_LOG_THREAD_LOCAL_STORAGE::Dict{UInt16, Int32}

Incremented by 1 every time a BlockLogEvent is created in a thread, i.e. each time a block has solver kernels applied to it through block_state_machine.

Since only differences between values are interesting, no need to reset it.

source

Utility

Armon.SolverExceptionType
SolverException

Thrown when the solver encounters an invalid state.

The category field can be used to distinguish between error types without inspecting the error message:

  • :config: a problem in the solver configuration, usually thrown when constructing ArmonParameters
  • :cpp: a C++ exception thrown by the C++ Kokkos backend
  • :time: an invalid time step

ErrorExceptions thrown by the solver represent internal errors.

source
Armon.@sectionMacro
@section(name, expr)
@section(name, options, expr)

Introduce a profiling section around expr. Sections can be nested. Sections do not introduce a new scope.

Placed before a for-loop, a new section will be started for each iteration. name can interpolate using loop variables (like Test.@testset).

It is assumed that a params variable of ArmonParameters is present in the scope of the @section.

options is of the form key=value:

  • async (default: false): if async=false (and !params.time_async), a barrier (wait(params)) is added at the end of the section.
params = ArmonParameters(#= ... =#)

@section "Iteration $i" for i in 1:10
    j = @section "Foo" begin
        foo(i)
    end

    @section "Some calculation" begin
        k = bar(i, j)
    end

    @sync begin
        @async begin
            @section "Task 1" async=true my_task_1(i, j, k)
        end

        @async begin
            @section "Task 2" async=true my_task_2(i, j, k)
        end
    end
end
source

NUMA utilities

Armon.array_pagesFunction
array_pages(A::Ptr, A_length)
array_pages(A::DenseArray)

Iterator over the memory pages (aligned to PAGE_SIZE) used by the array A.

source
Armon.touch_pagesFunction
touch_pages(pages)

Parses over each of pages and "touch" them by loading a value then storing it again.

The (only) use case is to manually trigger the first-touch policy and/or force the kernel to physically allocate all pages.

source
Armon.move_pagesMethod
move_pages(pages::Vector{Ptr}, target_node; retries=4)
move_pages(pages::OrdinalRange{Ptr, Int}, target_node)
move_pages(A::DenseArray, target_node)

Move all pages (of either the array A or from the pages iterable) to the target_node (1-index).

The move status of all pages is checked afterwards. No error is thrown only if all pages are on target_node at the end of the move.

In case some pages cannot be moved because of EBUSY status, the move is retried retries-times until it succeeds, with a short sleep in-between attempts.

source
Armon.lock_pagesMethod
lock_pages(ptr, len)
lock_pages(pages::OrdinalRange{Ptr, Int})
lock_pages(A::DenseArray)

Lock the pages ranging from ptr to ptr+len on the RAM, preventing the kernel from moving it around or putting it in swap memory.

source
Armon.unlock_pagesFunction
unlock_pages(ptr, len)
unlock_pages(pages::OrdinalRange{Ptr, Int})
unlock_pages(A::DenseArray)

Unlock the pages, opposite of lock_pages.

source