Armon
Documentation for Armon.
Armon is a 2D CFD solver for compressible non-viscous fluids, using the finite volume method.
It was made to explore Julia's capabilities in HPC and for performance portability: it should perform very well on any CPU and GPU. Domain decomposition using MPI is supported.
The twin project Armon-Kokkos is a mirror of the core of this solver (with much less options) written in C++ using the Kokkos library. It is possible to reuse kernels from that solver in this one, using the Kokkos.jl package.
Parameters and entry point
Armon.ArmonParameters
— TypeArmonParameters(; options...)
The parameters and current state of the solver.
The state is reset at each call to armon
.
There are many options. Each backend can add their own.
Options
Backend and MPI
device = :CUDA
Device to use. Supported values:
:CPU_HP
:Polyester.jl
CPU multithreading (default ifuse_gpu=false
):CUDA
:CUDA.jl
GPU (default ifuse_gpu=true
):ROCM
:AMDGPU.jl
GPU:CPU
:KernelAbstractions.jl
CPU multithreading (using the standardThreads.jl
)
use_MPI = true, P = (1, 1), reorder_grid = true, global_comm = nothing
MPI config. The MPI domain will be a process grid of size P
. global_comm
is the global communicator to use, defaults to MPI.COMM_WORLD
. reorder_grid
is passed to MPI.Cart_create
.
gpu_aware = true
Store MPI buffers on the device. This requires to use a GPU-aware MPI implementation. Does nothing when using the CPU only.
numa_aware = true
Allocate memory according to which NUMA node is associated with the thread supposed to work on a chunk of memory. This effectively enforces the first-touch policy, instead of blindly relying on it.
lock_memory = false
Lock all memory pages using mlock
to RAM.
Kernels
use_threading = true, use_simd = true
Switches for CPU_HP
kernels. use_threading
enables @threaded
for outer loops. use_simd
enables @simd_loop
for inner loops.
use_gpu = false
Enables the use of KernelAbstractions.jl
kernels.
use_kokkos = false
Use kernels for Kokkos.jl
.
use_cache_blocking = true
Separate the domain into semi-independant blocks, improving the cache-locality of memory accesses and therefore memory throughput.
async_cycle = false
Apply all steps of the solver to all blocks asynchronously, fully taking advantage of cache blocking.
block_size = 1024
Size of blocks for cache blocking. Can be a tuple. If use_cache_blocking == false
, this option only controls the size of GPU blocks.
use_two_step_reduction = false
Reduction kernels (dtCFL_kernel
and conservation_vars
) use some optimizations to perform the reduction in a single step. It might cause issues on some GPU backends: a more "gentle" approach could avoid those by doing it in two steps.
workload_distribution = :simple
Dictates how blocks are distributed among threads when async_cycle == true
:
:simple
trivially spreads all blocks to all threads evenly:scotch
uses theScotch
solver to partition the block grid:sorted_scotch
is the same as:scotch
, but additionally sorts the blocks to work on those at the perimeter of the square first, reducing the likelyness of waiting for neighbouring threads:weighted_sorted_scotch
takes into account the number of cells in each block instead of assuming an even workload for all blocks
busy_wait_limit = 100
Number of calls in a cycle to block_state_machine
which did not advance any of a thread's blocks until a call to MPI_Wait
or a system sleep is done. This is only relevant when async_cycle == true
.
Profiling
profiling = Symbol[]
List of profiling callbacks to use:
:TimerOutputs
:TimerOutputs.jl
sections (added ifmeasure_time=true
):NVTX_sections
:NVTX.jl
sections:NVTX_kernels
:NVTX.jl
sections for kernels:CUDA_kernels
: equivalent toCUDA.@profile
in front of all kernels
measure_time = true
measure_time=false
can remove any overhead caused by profiling.
time_async = true
time_async=false
will add a barrier at the end of every section. Useful for GPU kernels.
Scheme and CFD solver
scheme = :GAD, riemann_limiter = :minmod
scheme
is the Riemann solver scheme to use:
:Godunov
(1st order):GAD
(2nd order, with limiter).
riemann_limiter
is the limiter to use for the Riemann solver: :no_limiter
, :minmod
or :superbee
.
projection = :euler_2nd
Scheme for the Eulerian remap step:
:euler
(1st order):euler_2nd
(2nd order, +minmod limiter)
axis_splitting = :Sequential
Axis splitting to use:
:Sequential
: X then Y:SequentialSym
(or:Godunov
): X and Y then Y and X, alternating:Strang
: ½X, Y, ½X then ½Y, X, ½Y, alternating (½ is for halved time step):X_only
:Y_only
N = (10, 10)
Number of cells of the global domain in each axes.
nghost = 4
Number of ghost cells. Must be greater or equal to the minimum number of ghost cells (min 1, scheme=:GAD
adds one, projection=:euler_2nd
adds one, scheme=:GAD
+ projection=:euler_2nd
adds another one)
Dt = 0., cst_dt = false, dt_on_even_cycles = false
Dt
is the initial time step, it is computed after initialization by default. If cst_dt=true
then the time step is always Dt
and no reduction over the entire domain occurs. If dt_on_even_cycles=true
then then time step is only updated at even cycles (the first cycle is even).
data_type = Float64
Data type for all variables. Should be an AbstractFloat
.
Test case and domain
test = :Sod, domain_size = nothing, origin = nothing
test
is the test case name to use:
:Sod
: Sod shock tube test:Sod_y
: Sod shock tube test along the Y axis:Sod_circ
: Circular Sod shock tube test (centered in the domain):Bizarrium
: Bizarrium test, similar to the Sod shock tube but with a special equation of state:Sedov
: Sedov blast-wave test (centered in the domain, reaches the border att=1
by default):DebugIndexes
: Set all variables to their index in the global domain. Debug only.
cfl = 0., maxtime = 0., maxcycle = 500_000
cfl
defaults to the test's default value, same for maxtime
. The solver stops when t
reaches maxtime
or maxcycle
iterations were done (maxcycle=0
stops after initialization).
Output
silent = 0
silent=0
for maximum verbosity. silent=3
doesn't print info at each cycle. silent=5
doesn't print anything.
output_dir = ".", output_file = "output"
joinpath(output_dir, output_file)
will be path to the output file.
write_output = false, write_ghosts = false
write_output=true
will write all saved_vars()
to the output file. If write_ghosts=true
, ghost cells will also be included.
write_slices = false
Will write all saved_vars()
to 3 output files, one for the middle X row, another for the middle Y column, and another for the diagonal. If write_ghosts=true
, ghost cells will also be included.
output_precision = nothing
Numbers are saved with output_precision
digits of precision. Defaults to enough numbers for an exact decimal representation.
animation_step = 0
If animation_step ≥ 1
, then every animation_step
cycles, variables will be saved as with write_output=true
.
compare = false, is_ref = false, comparison_tolerance = 1e-10
If compare=true
, then at every sub step of each iteration of the solver all variables will:
- (
is_ref=false
) be compared with a reference file found inoutput_dir
- (
is_ref=true
) be saved to a reference file inoutput_dir
When comparing, a relative comparison_tolerance
(the rtol
kwarg of isapprox
) is accepted between values.
check_result = false
Check if conservation of mass and energy is verified between initialization and the last iteration. An error is thrown otherwise. Accepts a relative comparison_tolerance
.
return_data = false
If return_data=true
, then in the SolverStats
returned by armon
, the data
field will contain the BlockGrid
used by the solver.
Armon.armon
— Functionarmon(::ArmonParameters)
Main entry point of the solver. Returns a SolverStats
.
Armon.SolverStats
— TypeSolverStats
Solver output.
data
is nothing
if parameters.return_data
is false
.
timer
is nothing
if parameters.measure_time
is false
.
grid_log
is nothing
if parameters.log_blocks
is false
.
Armon.StepsRanges
— TypeStepsRanges
Holds indexing information for all steps of the solver.
Domains are stored as block corner offsets: blocks can have different sizes, but always the same amount of ghost cells, therefore the iteration domain is determined from the dimensions of the block. The first field is the offset to the first cell, the second is the offset to the last cell.
Armon.data_type
— Functiondata_type(::ArmonParameters{T})
Get T
, the type used for numbers by the solver
Grid and blocks
Armon.BlockGrid
— TypeBlockGrid{T, DeviceA, HostA, BufferA, Ghost, BlockSize, Device, SolverState}
Stores TaskBlock
s on the Device
and host memory, in a grid.
LocalTaskBlock
are stored separately depending on if they have a StaticBSize
of BlockSize
(in blocks
) or if they have a DynamicBSize
(in edge_blocks
).
Blocks have Ghost
cells padding their real cells. This is included in their block_size
. A block cannot have a number of real cells along an axis smaller than the number of ghost cells, unless there is only a single block in the grid along that axis.
"Edge blocks" are blocks located on the right and/or top edge of the grid. They exist in order to handle domains with dimensions which are not multiples of the block size.
DeviceA
and HostA
are AbstractArray
types for the device and host respectively.
BufferA
is the type of storage used for MPI buffers. MPI buffers are homogenous: they are either all on the host or all on the device.
Armon.grid_dimensions
— Functiongrid_dimensions(params::ArmonParameters)
grid_dimensions(block_size::NTuple{D, Int}, domain_size::NTuple{D, Int}, ghost::Int) where {D}
Returns the dimensions of the grid in the form (grid_size, static_sized_grid, remainder_block_size)
from the block_size
(the size of blocks in the static_sized_grid
), the domain_size
(number of real cells) and the number of ghost
cells, common to all blocks.
grid_size
is the static_sized_grid
including the edge blocks. Edge blocks along the axis d
have a size of remainder_block_size[d]
along d
only if they are at the edge of the grid along d
, or block_size
otherwise.
block_size
includes the ghost
cells in its dimensions, which must all be greater than 2*ghost
.
If prod(block_size) == 0
, then block_size
is ignored and the grid is made of only a single block of size domain_size
.
Blocks must not be smaller than ghost
, therefore edge blocks might be made bigger than block_size
.
In case domain_size
is smaller than block_size .- 2*ghost
along any axis, the grid will contain only edge blocks.
Armon.TaskBlock
— TypeTaskBlock{V}
Abstract block used for cache blocking.
Armon.LocalTaskBlock
— TypeLocalTaskBlock{D, H, Size, SolverState} <: TaskBlock{V}
Container of Size
and variables of type D
on the device and H
on the host. Part of a BlockGrid
.
The block stores its own solver state, allowing it to run all solver steps independantly of all other blocks, apart from steps requiring synchronization.
Armon.BlockData
— TypeBlockData{V}
Holds the variables of all cells of a LocalTaskBlock
.
Armon.RemoteTaskBlock
— TypeRemoteTaskBlock{B} <: TaskBlock{B}
Block located at the border of a BlockGrid
, containing MPI buffer of type B
for communication with other BlockGrid
s.
Armon.device_to_host!
— Functiondevice_to_host!(blk::LocalTaskBlock)
Copies the device data of blk
to the host data. A no-op if the device is the host.
device_to_host!(grid::BlockGrid)
Copies device data of all blocks to the host data. A no-op if the device is the host.
Armon.host_to_device!
— Functiondevice_to_host!(blk::LocalTaskBlock)
Copies the host data of blk
to its device data. A no-op if the device is the host.
device_to_host!(grid::BlockGrid)
Copies host data of all blocks to the device data. A no-op if the device is the host.
Armon.buffers_on_device
— Functionbuffers_on_device(::BlockGrid)
buffers_on_device(::Type{<:BlockGrid})
true
if the communication buffers are stored on the device, allowing direct transfers without passing through the host (GPU-aware communication).
Armon.device_is_host
— Functiondevice_is_host(::BlockGrid{T, D, H})
device_is_host(::Type{<:BlockGrid{T, D, H}})
true
if the device is the host, i.e. device blocks and host blocks are the same (and D == H
).
Armon.all_blocks
— Functionall_blocks(grid::BlockGrid)
Simple iterator over all blocks of the grid, excluding RemoteTaskBlock
s.
Armon.block_idx
— Functionblock_idx(grid::BlockGrid, idx::CartesianIndex)
Linear index in grid.blocks
of the block at idx
in the statically-sized grid.
Armon.edge_block_idx
— Functionedge_block_idx(grid::BlockGrid, idx::CartesianIndex)
Linear index in grid.edge_blocks
of the block at idx
, along the (dynamically-sized) edges of the grid
.
Armon.remote_block_idx
— Functionremote_block_idx(grid::BlockGrid, idx::CartesianIndex)
Linear index in grid.remote_blocks
of the remote block at idx
in the grid
.
Armon.EdgeBlockRegions
— TypeEdgeBlockRegions(grid::BlockGrid; real_sizes=false)
EdgeBlockRegions(
grid_size::NTuple{D, Int}, static_sized_grid::NTuple{D, Int},
block_size::NTuple{D, Int}, remainder_block_size::NTuple{D, Int}; ghosts=0
)
Iterator over edge blocks, their positions and their size in each edge region of the grid
.
ghosts
only affects the size of edge blocks: B .- 2*ghosts
. If real_sizes == true
, then ghosts
is ghosts(grid)
, therefore the real block size of edge blocks is returned.
There is a maximum of 2^D-1
edge regions in a grid (see this explanation). When grid_size[i] != static_sized_grid[i]
, there cannot be an edge region, hence there are exactly 2^sum(grid_size .!= static_sized_grid)-1
regions.
julia> collect(Armon.EdgeBlockRegions((5, 5), (4, 4), (32, 32), (16, 16)))
3-element Vector{Tuple{Int64, CartesianIndices{2, Tuple{UnitRange{Int64}, UnitRange{Int64}}}, Tuple{Int64, Int64}}}:
(4, CartesianIndices((5:5, 1:4)), (16, 32))
(4, CartesianIndices((1:4, 5:5)), (32, 16))
(1, CartesianIndices((5:5, 5:5)), (16, 16))
julia> collect(Armon.EdgeBlockRegions((5, 5, 5), (4, 5, 4), (32, 32, 32), (16, 16, 16)))
3-element Vector{Tuple{Int64, CartesianIndices{3, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}}, Tuple{Int64, Int64, Int64}}}:
(20, CartesianIndices((5:5, 1:5, 1:4)), (16, 32, 32))
(20, CartesianIndices((1:4, 1:5, 5:5)), (32, 32, 16))
(5, CartesianIndices((5:5, 1:5, 5:5)), (16, 32, 16))
Armon.RemoteBlockRegions
— TypeRemoteBlockRegions(grid::BlockGrid)
RemoteBlockRegions(grid_size::NTuple{D, Int})
Iterator of all remote block positions in each region (the faces of the grid
). There is always 2*D
regions.
julia> collect(Armon.RemoteBlockRegions((5, 3)))
4-element Vector{Tuple{Armon.Side.T, CartesianIndices{2, Tuple{UnitRange{Int64}, UnitRange{Int64}}}}}:
(Armon.Side.Left, CartesianIndices((0:0, 1:3)))
(Armon.Side.Right, CartesianIndices((6:6, 1:3)))
(Armon.Side.Bottom, CartesianIndices((1:5, 0:0)))
(Armon.Side.Top, CartesianIndices((1:5, 4:4)))
Armon.@iter_blocks
— Macro@iter_blocks for blk in grid
# body...
end
Applies the body of the for-loop in to all blocks of the grid
. Threads iterate over the blocks they are assigned to via grid.threads_workload
.
Armon.block_pos_containing_cell
— Functionblock_pos_containing_cell(grid::BlockGrid, pos::Union{CartesianIndex, NTuple})
Returns two CartesianIndex
es: the first is the position of the block containing the cell at pos
, the second is the position of the cell in that block.
Armon.block_origin
— Functionblock_origin(grid::BlockGrid, pos, include_ghosts=false)
A Tuple
of the position of the cell at the bottom left corner of the LocalTaskBlock
at pos
in the grid
.
If include_ghosts == true
, then the cell position includes all ghost cells of the grid
.
pos
can be any of Integer
, NTuple{N, Integer}
or CartesianIndex
.
Armon.block_at
— Functionblock_at(grid::BlockGrid, idx::CartesianIndex)
The TaskBlock
at position idx
in the grid
.
Armon.block_size_at
— Functionblock_size_at(grid::BlockGrid, idx)
block_size_at(idx, grid_size, static_sized_grid, block_size, remainder_block_size, ghosts)
Theoretical size of block at idx
in grid
. If ghosts == 0
, ghosts cells will be included in the size.
Armon.move_pages
— Methodmove_pages(grid::BlockGrid)
Move the pages of all blocks of the grid
, including remote blocks, to the NUMA node of the thread which is in charge of working on that block.
Armon.lock_pages
— Methodlock_pages(grid::BlockGrid)
Locks the pages of all blocks of the grid
, including remote blocks.
Block size and iteration
Armon.BlockSize
— TypeBlockSize
Dimensions of a LocalTaskBlock
.
Armon.StaticBSize
— TypeStaticBSize{S, Ghost} <: BlockSize
A BlockSize
of size S
and Ghost
cells. S
is embedded in the type, this reduces the amount of memory in the parameters of every kernel, as well as allowing the compiler to make some minor optizations in indexing expressions.
Ghost cells are included in S
: there are S .- 2*Ghost
real cells.
Armon.DynamicBSize
— TypeDynamicBSize{Ghost} <: BlockSize
Similar to StaticBSize
, but for blocks with a less-than-ideal size: block size is therefore not stored in the type. This results in less compilation when testing for different domain sizes with a constant StaticBSize
.
The number of Ghost
cells is still embedded in the type, as it can simplify some indexing expressions and for coherency.
Armon.block_size
— FunctionNTuple
of the dimensions of a BlockSize
Armon.real_block_size
— FunctionNTuple
of the dimensions of a BlockSize
, excluding ghost cells
Armon.ghosts
— FunctionNumber of ghost cells of a BlockSize
Armon.border_domain
— Functionborder_domain(bsize::BlockSize, side::Side.T; single_strip=true)
DomainRange
of the real cells along side
.
If single_strip == true
, it includes only one "strip" of cells, that is length(border_domain(bsize, side)) == size_along(bsize, side)
. Otherwise, there are ghosts(bsize)
strips of cells: all real cells which would be exchanged with another block along side
.
Armon.ghost_domain
— Functionghost_domain(bsize::BlockSize, side::Side.T; single_strip=true)
DomainRange
of all ghosts cells of side
, excluding the corners of the block.
If single_strip == true
, then the domain is only 1 cell thick, positionned at the furthest ghost cell from the real cells. Otherwise, there are ghosts(bsize)
strips of cells: all ghost cells which would be exchanged with another block along side
.
Armon.block_domain_range
— Functionblock_domain_range(bsize::BlockSize, corners)
block_domain_range(bsize::BlockSize, bottom_left::Tuple, top_right::Tuple)
A DomainRange
built from offsets from the corners of bsize
.
block_domain_range(bsize, (0, 0), (0, 0))
is the domain of all real cells in the block. block_domain_range(bsize, (-g, -g), (g, g))
would be the domain of all cells (real cells + g
ghost cells) in the block.
Armon.position
— Functionposition(bsize::BlockSize, i)
N-dim position of the i
-th cell in the block.
If 1 ≤ position(bsize, i)[d] ≤ block_size(bsize)[d]
then the cell is not a ghost cell along the d
dimension. See is_ghost
.
Armon.lin_position
— Functionlin_position(bsize::BlockSize, I)
From the NTuple
(e.g. returned from position
), return the linear index in the block. lin_position(bsize, position(bsize, i)) == i
.
Armon.in_grid
— Functionin_grid(idx, size)
in_grid(start, idx, size)
true
if each axis of idx
is between start
and size
. start
defaults to 1
.
Argument types can any mix of Integer
, Tuple
or CartesianIndex
.
julia> in_grid(1, (1, 2), 2) # same as `in_grid((1, 1), (1, 2), (2, 2))`
true
julia> in_grid((3, 1), (3, 2))
true
julia> in_grid((1, 3), (3, 2))
false
in_grid(idx, grid, axis::Axis.T)
in_grid(start, idx, grid, axis::Axis.T)
Same as in_grid(start, idx, grid)
, but only checks along axis
.
Armon.is_ghost
— Functionis_ghost(bsize::BlockSize, i, o=0)
true
if the i
-th cell of the block is a ghost cell, false
otherwise.
o
would be a "ring" index: o == 1
excludes the first ring of ghost cells, etc.
Armon.BlockRowIterator
— TypeBlockRowIterator(grid::BlockGrid; kwargs...)
BlockRowIterator(grid::BlockGrid, blk::LocalTaskBlock; kwargs...)
BlockRowIterator(grid::BlockGrid, blk₁_pos, blk₂_pos; kwargs...)
BlockRowIterator(grid::BlockGrid, sub_domain::NTuple{2, CartesianIndex}; kwargs...)
BlockRowIterator(grid::BlockGrid, row_iter::CartesianIndices; global_ghosts=false, all_ghosts=false)
Iterate the rows of all blocks of the grid
, row by row (and not block by block). This allows to iterate the cells of the grid
as if it was a single block.
Elements are tuples of (block, global_row_idx, row_range)
. row_range
is the range of cells in block
for the current row.
Giving blk
will return an iterator on the rows of the block.
Giving blk₁_pos
and blk₂_pos
will return an iterator over all rows between those blocks.
Giving sub_domain
will return an iterator including only the cells contained in sub_domain
. sub_domain
is a cuboid defined by the position of the cells in the whole domain of grid
, using its lower and upper corners.
row_iter
is a iterator over global row indices.
If global_ghosts == true
, then the ghost cells of at the border of the global domain are also returned. If all_ghosts == true
, then the ghost cells of at the border of all blocks are also returned.
julia> params = ArmonParameters(; N=(24, 8), nghost=4, block_size=(20, 12), use_MPI=false);
julia> grid = BlockGrid(params);
julia> for (blk, row_idx, row_range) in Armon.BlockRowIterator(grid)
println(Tuple(blk.pos), " - ", row_idx, " - ", row_range)
end
(1, 1) - (1, 1) - 85:96
(2, 1) - (2, 1) - 85:96
(1, 1) - (1, 2) - 105:116
(2, 1) - (2, 2) - 105:116
(1, 1) - (1, 3) - 125:136
(2, 1) - (2, 3) - 125:136
(1, 1) - (1, 4) - 145:156
(2, 1) - (2, 4) - 145:156
(1, 2) - (1, 5) - 85:96
(2, 2) - (2, 5) - 85:96
(1, 2) - (1, 6) - 105:116
(2, 2) - (2, 6) - 105:116
(1, 2) - (1, 7) - 125:136
(2, 2) - (2, 7) - 125:136
(1, 2) - (1, 8) - 145:156
(2, 2) - (2, 8) - 145:156
Armon.DomainRange
— TypeDomainRange
Two dimensional range to index a 2D array stored with contiguous rows. Not equivalent to CartesianIndices
as it handles StepRange
s properly.
Block states
Armon.SolverState
— TypeSolverState
Object containing all non-constant parameters needed to run the solver, as well as type-parameters needed to avoid runtime dispatch.
This object is local to a block (or set of blocks): multiple blocks could be at different steps of the solver at once.
Armon.first_state
— Functionfirst_state(grid::BlockGrid)
A SolverState
which can be used as a global state when outside of a solver cycle. It belongs to the first device block.
Armon.SolverStep
— ModuleSolverStep
Enumeration of each state a LocalTaskBlock
can be in. block_state_machine
advances this state.
Armon.block_state_machine
— Functionblock_state_machine(params::ArmonParameters, blk::LocalTaskBlock)
Advances the SolverStep
state of the blk
, apply each step of the solver on the blk
. This continues until the current cycle is done, or the block needs to wait for another block to do the ghost cells exchange (block_ghost_exchange
) or compute the its time step (next_time_step
).
Returns the new step of the block. If SolverStep.NewCycle
is returned, the blk
reached the end of the current cycle and will not progress any further until all other blocks have reached the same point.
Time step reduction
Armon.next_time_step
— Functionnext_time_step(params::ArmonParameters, state::SolverState, blk::LocalTaskBlock; already_contributed=false)
next_time_step(params::ArmonParameters, state::SolverState, grid::BlockGrid)
Compute the time step of the next cycle. This is done at the start of the current cycle.
Since the current cycle does not rely on an up-to-date time step, the time step reduction is done fully asynchronously, including the global MPI reduction. The accuracy cost of this optimisation is minimal, as the CFL condition prevents the time step from being too large. Additionally, we prevent the time step from increasing of more than +5% of the previous one.
For first cycle, if no initial time step is given, the time step of the next cycle is reused for the initial cycle.
If blk
is given, its contribution is only added to the state.global_dt
(the GlobalTimeStep
). Passing the whole block grid
will block until the new time step is computed.
Armon.GlobalTimeStep
— TypeGlobalTimeStep
Holds all information about the current time and time step for the current solver cycle. This struct is global and shared among all blocks.
When reaching next_time_step
, blocks will contribute to the calculation of the next time step. The last block doing so will start the MPI reduction. The first block reaching the start of the next cycle will wait until this reduction is completed, updating the GlobalTimeStep
when so.
Armon.TimeStepState.WaitingForMPI
— ConstantOne thread is waiting for the MPI reduction to complete
Armon.TimeStepState.Done
— ConstantMPI is done: current_dt
is up-to-date
Armon.TimeStepState.Ready
— Constantcurrent_dt
is up-to-date and blocks can contribute to next_dt
Armon.TimeStepState.DoingMPI
— ConstantThe MPI reduction has started
Armon.TimeStepState.AllContributed
— ConstantAll blocks contributed to next_dt
, one thread will start the MPI reduction
Block exchanges
Armon.block_ghost_exchange
— Functionblock_ghost_exchange(params::ArmonParameters, state::SolverState, blk::LocalTaskBlock)
Handles communications between the blk
neighbours, along the current state.axis
. If blk
is on one of the edges of the grid, a remote exchange is performed with the neighbouring RemoteTaskBlock
, or the global boundary conditions are applied.
Returns true
if exchanges were not completed, and the block is waiting on another to be ready for the exchange.
Armon.start_exchange
— Functionstart_exchange(
params::ArmonParameters,
blk::LocalTaskBlock{D, H}, other_blk::RemoteTaskBlock{B}, side::Side.T
) where {D, H, B}
Start the exchange between one local block and a remote block from another sub-domain. Returns true
if the exchange is BlockExchangeState.Done
, false
if BlockExchangeState.InProgress
.
Armon.finish_exchange
— Functionfinish_exchange(
params::ArmonParameters,
blk::LocalTaskBlock{D, H}, other_blk::RemoteTaskBlock{B}, side::Side.T
) where {D, H, B}
Finish the exchange between one local block and a remote block from another sub-domain, if MPI communications are done. Returns true
if the exchange is BlockExchangeState.Done
, false
if BlockExchangeState.InProgress
.
Armon.BlockInterface
— TypeBlockInterface
Represents the interface between two neighbouring TaskBlock
s. It synchronizes the state of the blocks to make sure the halo exchange happens when both blocks are ready, and that the operation is done by only one of the blocks.
Armon.mark_ready_for_exchange!
— Functionmark_ready_for_exchange!(blk, side)
Mark blk
in the interface along its side
as ready for an exchange.
Return true
if the blk
should do the exchange, and the new BlockExchangeState
of the interface.
Armon.exchange_done!
— Functionexchange_done!(blk, side)
Mark the exchange of the interface of the blk
along side
as done. The other block need to acknowledge this before the exchange state can be reset.
Armon.side_is_done!
— Functionside_is_done!(blk::LocalTaskBlock, side::Side.T, done::Bool=true)
Set the interface of blk
along side
as done
. This is a non-atomic operation, meant only to avoid repeating the exchange logic multiple times.
Armon.is_side_done
— Functionis_side_done(blk::LocalTaskBlock, side::Side.T)
The value set by side_is_done!
.
Armon.BlockExchangeState
— ModuleBlockExchangeState
State of an interface between two blocks, controlling if a cell exchange can happen or did happen.
Armon.BlockExchangeState.NotReady
— ConstantOne of the blocks is not ready yet
Armon.BlockExchangeState.InProgress
— ConstantOne of the blocks is performing the exchange
Armon.BlockExchangeState.Done
— ConstantThe exchange is done
Block distribution
Armon.thread_workload_distribution
— Functionthread_workload_distribution(params::ArmonParameters; threads=nothing)
thread_workload_distribution(
threads::Int, grid_size::Tuple;
scotch=true, simple=false, perimeter_first=false, kwargs...
)
Distribute each block in grid_size
among the threads
, as evenly as possible.
With simple == true
, blocks are distributed with simple_workload_distribution
.
With scotch == true
, the Scotch
graph partitioning solver is used to better split the grid. kwargs
are passed to scotch_grid_partition
.
If perimeter_first == true
, the resulting distribution will have blocks sorted in way that will place neighbours of other threads' blocks first in the list. By doing so, communications between threads may be overlapped more frequently.
Armon.simple_workload_distribution
— Functionsimple_workload_distribution(threads, grid_size)
Basic distribution of B = prod(grid_size)
blocks into groups of B ÷ threads
, with the remaining blocks distributed evenly.
Armon.scotch_grid_partition
— Functionscotch_grid_partition(
threads, grid_size;
strategy=:default, workload_tolerance=0, repart=false, retries=10, weighted=false,
static_sized_grid=nothing, block_size=nothing, remainder_block_size=nothing, ghosts=0
)
Split grid_size
to the threads
.
strategy
is passed to Scotch.strat_flags
.
workload_tolerance
is the acceptable workload uneveness among the partitions. Giving a few blocks of margin (e.g. at least 1/prod(grid_size)
) is preferrable.
weighted == true
will distribute blocks while taking into account the number of real cells they have. In this case all parameters of the grid must be present: static_sized_grid
, block_size
, remainder_block_size
and ghosts
must be given (obtained with e.g. grid_dimensions
).
The partitioning is random, hence results may vary. To counterbalance this, giving retries > 0
will repeat the partitioning retries
times and keep the best one.
Armon.block_grid_from_workload
— Functionblock_grid_from_workload(grid_size, threads_workload)
Convenience function to convert a threads_workload
(result of thread_workload_distribution
) into an Array
of grid_size
, with each element assigned to the tid
given by the distribution.
This makes it easy to visualize the efficiency of the distribution.
Device and backends
Armon.CPU_HP
— TypeCPU_HP
Device tag for the high-performance CPU backend using multithreading (with Polyester.jl) and vectorisation.
Armon.create_device
— Functioncreate_device(::Val{:device_name})
Create a device object from its name.
Default devices:
:CPU
: the CPU backend ofKernelAbstractions.jl
:CPU_HP
:Polyester.jl
multithreading
Extensions:
:Kokkos
: the defaultKokkos.jl
device:CUDA
: theCUDA.jl
backend ofKernelAbstractions.jl
:ROCM
: theAMDGPU.jl
backend ofKernelAbstractions.jl
Armon.init_backend
— Functioninit_backend(params::ArmonParameters, ::Dev; options...)
Initialize the backend corresponding to the Dev
device returned by create_device
using options
. Set the params.backend_options
field.
It must return options
, with the backend-specific options removed.
Armon.device_memory_info
— Functiondevice_memory_info(device)
The total and free memory on the device, in bytes.
Armon.memory_info
— Functionmemory_info(params)
The total and free memory the current process can store on the params.device
.
Armon.memory_required
— Functionmemory_required(params::ArmonParameters)
(device_memory, host_memory)
required for params
.
MPI buffers size are included in the appropriate field depending on params.gpu_aware
. params.use_MPI
and params.neighbours
is taken into account.
If device_is_host
, then, device_memory
only includes memory required by data arrays and MPI buffers.
memory_required(N::NTuple{2, Int}, block_size::NTuple{2, Int}, ghost::Int, data_type)
memory_required(N::NTuple{2, Int}, block_size::NTuple{2, Int}, ghost::Int,
device_array_type, host_array_type, buffer_array_type[, solver_state_type])
Compute the number of bytes needed to allocate all blocks. If only data_type
is given, then device_array_type
, host_array_type
and buffer_array_type
default to Vector{T}
. solver_state_type
defaults to SolverState{T, #= default schemes and test =#}
.
In order of returned values:
- Amount of bytes needed for all arrays on the device. This amount is also required on the host when the host and device are not the same.
- Amount of bytes needed for all MPI buffers, if the sub-domain has neighbours on all of its sides. If
params.gpu_aware
, then this memory is allocated on the device. - Amount of bytes needed on the host memory for all block objects, excluding array data and buffers. This memory is always allocated on the host.
res = memory_required((1000, 1000), (64, 64), 4, CuArray{Float64}, Vector{Float64}, Vector{Float64})
device_memory = res[1]
host_memory = res[3] + (device_is_host ? device_memory : res[1])
if params.gpu_aware
device_memory += res[2]
else
host_memory += res[2]
end
Kernels
Armon.@generic_kernel
— Macro@generic_kernel(function definition)
Transforms a single kernel function into six different functions:
- 4 which run on the CPU using Polyester.jl's multi-threading or not, as well as SIMD or not.
- one which uses KernelAbstractions.jl to make a GPU-compatible kernel
- a main function, which will take care of calling the two others depending if we want to use the GPU or not.
To do this, two things are done:
- All calls to
@index_1D_lin()
,@index_2D_lin()
and@iter_idx()
are replaced by their equivalent in their respective platforms: a simple loop index for CPUs, and a call to KA.jl's@index
for GPUs. - Arguments to each function are edited
A kernel function must call one of @index_1D_lin()
or @index_2D_lin()
at least once, since this will determine which type of indexing to use as well as which parameters to add.
The indexing macro @iter_idx
gives the linear index to the current iteration (on CPU) or global thread (on GPU).
This means that in order to call the new main function, one needs to take into account which indexing macro was used:
- In all cases,
params::ArmonParameters
is the first argument - Then, depending on the indexing macro used:
@index_1D_lin()
:loop_range::OrdinalRange{Int}
@index_2D_lin()
:main_range::OrdinalRange{Int}
,inner_range::OrdinalRange{Int}
- An optional keyword argument,
no_threading
, allows to override theuse_threading
parameter, which can be useful in asynchronous contexts. It defaults tofalse
.
Using KA.jl's @Const
to annotate arguments is supported, but they will be present only in the GPU kernel definition.
Further customisation of the kernel and main function can be obtained using @kernel_options
and @kernel_init
.
@generic_kernel f_kernel(A)
i = @index_1D_lin()
A[i] += 1
end
params.use_gpu = false
f(params, 1:10) # CPU call
params.use_gpu = true
f(params, 1:10) # GPU call
Armon.@kernel_init
— Macro@kernel_init(expr)
Allows to initialize some internal variables of the kernel before the loop. The given expression must NOT depend on any index. You must not use any indexing macro (@index_1D_lin()
, etc...) in the expression.
All paramerters of the kernel are available during the execution of the init expression. On GPU, this expression will be executed by each thread.
This is a workaround for a limitation of Polyester.jl which prevents you from typing variables.
@generic_kernel function simple_kernel(a, b)
@kernel_init begin
c::Float64 = sin(0.123)
end
i = @index_1D_lin()
a[i] += b[i] * c
end
Armon.@kernel_options
— Macro@kernel_options(options...)
Must be used (once, and explicitly) in the definition of a @generic_kernel
function.
Gives options for @generic_kernel
to adjust the resulting functions.
The possible options are:
debug
: Prints the generated functions to stdout at compile time.
@generic_kernel function add_kernel(a, b)
@kernel_options(debug)
i = @index_1D_lin()
a[i] += b[i]
end
Armon.@index_1D_lin
— Macro@index_1D_lin()
Indexing macro to use in a @generic_kernel
function. Returns a linear index to access the 1D arrays used by the kernel.
Cannot be used alongside @index_2D_lin()
.
Armon.@index_2D_lin
— Macro@index_2D_lin()
Indexing macro to use in a @generic_kernel
function. Returns a linear index to access the 2D arrays used by the kernel.
Cannot be used alongside @index_1D_lin()
.
Armon.@iter_idx
— Macro@iter_idx()
Indexing macro to use in a @generic_kernel
function. Returns a linear index to access the 2D arrays used by the kernel.
Equivalent to KernelAbstractions.jl's @index(Global, Linear)
.
Armon.@simd_loop
— Macro@simd_loop(expr)
Allows to enable/disable SIMD optimisations for a loop. When SIMD is enabled, it is assumed that there is no dependencies between each iterations of the loop.
@simd_loop for i = 1:n
y[i] = x[i] * (x[i-1])
end
Armon.@simd_threaded_iter
— Macro@simd_threaded_iter(range, expr)
Same as @simd_threaded_loop(expr)
, but instead of slicing the range of the for loop in expr
, we slice the range
given as the first parameter and distribute the slices evenly to the threads.
The inner @simd
loop assumes there is no dependencies between each iteration.
@simd_threaded_iter 4:2:100 for i in 1:100
y[i] = log10(x[i]) + x[i]
end
# is equivalent to (without threading and SIMD)
for j in 4:2:100
for i in (1:100) .+ (j - 1)
y[i] = log10(x[i]) + x[i]
end
end
Armon.@simd_threaded_loop
— Macro@simd_threaded_loop(expr)
Allows to enable/disable multithreading and/or SIMD of the loop depending on the parameters. When using SIMD, @fastmath
and @inbounds
are used.
In order to use SIMD and multithreading at the same time, the range of the loop is split in even batches. Each batch has a size of params.simd_batch
iterations, meaning that the inner @simd
loop has a fixed number of iterations, while the outer threaded loop will have N ÷ params.simd_batch
iterations.
The loop range is assumed to be increasing, i.e. this is correct: 1:2:100, this is not: 100:-2:1 The inner @simd
loop assumes there is no dependencies between each iteration.
@simd_threaded_loop for i = 1:n
y[i] = log10(x[i]) + x[i]
end
Armon.@threaded
— Macro@threaded(expr)
Allows to enable/disable multithreading of the loop depending on the parameters.
The default condition is params.use_threading && !params.use_cache_blocking
. By passing :outside_kernel
before expr
, the condition becomes simply params.use_threading
.
@threaded for i = 1:n
y[i] = log10(x[i]) + x[i]
end
Armon.@threads
— MacroControls which multi-threading library to use.
Logging
Armon.BlockLogEvent
— TypeBlockLogEvent
Info about a block, emitted after a call to block_state_machine
which successfully advanced the internal state of the block, only if params.log_blocks == true
.
Armon.ThreadLogEvent
— TypeThreadLogEvent
Info about a thread, emitted after a call to solver_cycle_async
, which itself calls block_state_machine
.
Armon.collect_logs
— Functioncollect_logs(grid::BlockGrid)
Collect all BlockLogEvent
of the grid
into a single object (a BlockGridLog
). See analyse_log_stats
to print metrics about all blocks.
Armon.analyse_log_stats
— Functionanalyse_log_stats(f, grid_log::BlockGridLog)
Call f
for each block of the grid_log
, passing the block's position (a CartesianIndex
), a Vector
of BlockLogEvent
s, and the size of the block.
analyse_log_stats(grid_log::BlockGridLog)
Crunch all data of grid_log
into tangible metrics contained in a BlockGridLogStats
.
Armon.BlockGridLogStats
— TypeBlockGridLogStats
Metrics about the blocks and threads of a full solver excution.
Armon.BLOCK_LOG_THREAD_LOCAL_STORAGE
— ConstantBLOCK_LOG_THREAD_LOCAL_STORAGE::Dict{UInt16, Int32}
Incremented by 1 every time a BlockLogEvent
is created in a thread, i.e. each time a block has solver kernels applied to it through block_state_machine
.
Since only differences between values are interesting, no need to reset it.
Utility
Armon.Axis
— ModuleAxis
Enumeration of the axes of the domain
Armon.Side
— ModuleSide
Enumeration of the sides of the domain
Armon.SolverException
— TypeSolverException
Thrown when the solver encounters an invalid state.
The category
field can be used to distinguish between error types without inspecting the error message:
:config
: a problem in the solver configuration, usually thrown when constructingArmonParameters
:cpp
: a C++ exception thrown by the C++ Kokkos backend:time
: an invalid time step
ErrorException
s thrown by the solver represent internal errors.
Armon.@section
— Macro@section(name, expr)
@section(name, options, expr)
Introduce a profiling section around expr
. Sections can be nested. Sections do not introduce a new scope.
Placed before a for-loop, a new section will be started for each iteration. name
can interpolate using loop variables (like Test.@testset
).
It is assumed that a params
variable of ArmonParameters
is present in the scope of the @section
.
options
is of the form key=value
:
async
(default:false
): ifasync=false
(and!params.time_async
), a barrier (wait(params)
) is added at the end of the section.
params = ArmonParameters(#= ... =#)
@section "Iteration $i" for i in 1:10
j = @section "Foo" begin
foo(i)
end
@section "Some calculation" begin
k = bar(i, j)
end
@sync begin
@async begin
@section "Task 1" async=true my_task_1(i, j, k)
end
@async begin
@section "Task 2" async=true my_task_2(i, j, k)
end
end
end
NUMA utilities
Armon.array_pages
— Functionarray_pages(A::Ptr, A_length)
array_pages(A::DenseArray)
Iterator over the memory pages (aligned to PAGE_SIZE
) used by the array A
.
Armon.touch_pages
— Functiontouch_pages(pages)
Parses over each of pages
and "touch" them by loading a value then storing it again.
The (only) use case is to manually trigger the first-touch policy and/or force the kernel to physically allocate all pages.
Armon.move_pages
— Methodmove_pages(pages::Vector{Ptr}, target_node; retries=4)
move_pages(pages::OrdinalRange{Ptr, Int}, target_node)
move_pages(A::DenseArray, target_node)
Move all pages (of either the array A
or from the pages
iterable) to the target_node
(1-index).
The move status of all pages is checked afterwards. No error is thrown only if all pages are on target_node
at the end of the move.
In case some pages cannot be moved because of EBUSY status, the move is retried retries
-times until it succeeds, with a short sleep
in-between attempts.
Armon.lock_pages
— Methodlock_pages(ptr, len)
lock_pages(pages::OrdinalRange{Ptr, Int})
lock_pages(A::DenseArray)
Lock the pages
ranging from ptr
to ptr+len
on the RAM, preventing the kernel from moving it around or putting it in swap memory.
Armon.unlock_pages
— Functionunlock_pages(ptr, len)
unlock_pages(pages::OrdinalRange{Ptr, Int})
unlock_pages(A::DenseArray)
Unlock the pages
, opposite of lock_pages
.