Armon

Documentation for Armon.

Armon is a 2D CFD solver for compressible non-viscous fluids, using the finite volume method.

It was made to explore Julia's capabilities in HPC and for performance portability: it should perform very well on any CPU and GPU. Domain decomposition using MPI is supported.

The twin project Armon-Kokkos is a mirror of the core of this solver (with much less options) written in C++ using the Kokkos library. It is possible to reuse kernels from that solver in this one, using the Kokkos.jl package.

Parameters and entry point

Armon.ArmonParameters — Type

ArmonParameters(; options...)

The parameters and current state of the solver.

The state is reset at each call to armon.

There are many options. Each backend can add their own.

Options

Backend and MPI

device = :CUDA

Device to use. Supported values:

:CPU_HP: Polyester.jl CPU multithreading (default if use_gpu=false)
:CUDA: CUDA.jl GPU (default if use_gpu=true)
:ROCM: AMDGPU.jl GPU
:CPU: KernelAbstractions.jl CPU multithreading (using the standard Threads.jl)

use_MPI = true, P = (1, 1), reorder_grid = true, global_comm = nothing

MPI config. The MPI domain will be a process grid of size P. global_comm is the global communicator to use, defaults to MPI.COMM_WORLD. reorder_grid is passed to MPI.Cart_create.

gpu_aware = true

Store MPI buffers on the device. This requires to use a GPU-aware MPI implementation. Does nothing when using the CPU only.

numa_aware = true

Allocate memory according to which NUMA node is associated with the thread supposed to work on a chunk of memory. This effectively enforces the first-touch policy, instead of blindly relying on it.

lock_memory = false

Lock all memory pages using mlock to RAM.

Kernels

use_threading = true, use_simd = true

Switches for CPU_HP kernels. use_threading enables @threaded for outer loops. use_simd enables @simd_loop for inner loops.

use_gpu = false

Enables the use of KernelAbstractions.jl kernels.

use_kokkos = false

Use kernels for Kokkos.jl.

use_cache_blocking = true

Separate the domain into semi-independant blocks, improving the cache-locality of memory accesses and therefore memory throughput.

async_cycle = false

Apply all steps of the solver to all blocks asynchronously, fully taking advantage of cache blocking.

block_size = 1024

Size of blocks for cache blocking. Can be a tuple. If use_cache_blocking == false, this option only controls the size of GPU blocks.

use_two_step_reduction = false

Reduction kernels (dtCFL_kernel and conservation_vars) use some optimizations to perform the reduction in a single step. It might cause issues on some GPU backends: a more "gentle" approach could avoid those by doing it in two steps.

workload_distribution = :simple

Dictates how blocks are distributed among threads when async_cycle == true:

:simple trivially spreads all blocks to all threads evenly
:scotch uses the Scotch solver to partition the block grid
:sorted_scotch is the same as :scotch, but additionally sorts the blocks to work on those at the perimeter of the square first, reducing the likelyness of waiting for neighbouring threads
:weighted_sorted_scotch takes into account the number of cells in each block instead of assuming an even workload for all blocks

busy_wait_limit = 100

Number of calls in a cycle to block_state_machine which did not advance any of a thread's blocks until a call to MPI_Wait or a system sleep is done. This is only relevant when async_cycle == true.

Profiling

profiling = Symbol[]

List of profiling callbacks to use:

:TimerOutputs: TimerOutputs.jl sections (added if measure_time=true)
:NVTX_sections: NVTX.jl sections
:NVTX_kernels: NVTX.jl sections for kernels
:CUDA_kernels: equivalent to CUDA.@profile in front of all kernels

measure_time = true

measure_time=false can remove any overhead caused by profiling.

time_async = true

time_async=false will add a barrier at the end of every section. Useful for GPU kernels.

Scheme and CFD solver

scheme = :GAD, riemann_limiter = :minmod

scheme is the Riemann solver scheme to use:

:Godunov (1st order)
:GAD (2nd order, with limiter).

riemann_limiter is the limiter to use for the Riemann solver: :no_limiter, :minmod or :superbee.

projection = :euler_2nd

Scheme for the Eulerian remap step:

:euler (1st order)
:euler_2nd (2nd order, +minmod limiter)

axis_splitting = :Sequential

Axis splitting to use:

:Sequential: X then Y
:SequentialSym (or :Godunov): X and Y then Y and X, alternating
:Strang: ½X, Y, ½X then ½Y, X, ½Y, alternating (½ is for halved time step)
:X_only
:Y_only

N = (10, 10)

Number of cells of the global domain in each axes.

nghost = 4

Number of ghost cells. Must be greater or equal to the minimum number of ghost cells (min 1, scheme=:GAD adds one, projection=:euler_2nd adds one, scheme=:GAD + projection=:euler_2nd adds another one)

Dt = 0., cst_dt = false, dt_on_even_cycles = false

Dt is the initial time step, it is computed after initialization by default. If cst_dt=true then the time step is always Dt and no reduction over the entire domain occurs. If dt_on_even_cycles=true then then time step is only updated at even cycles (the first cycle is even).

data_type = Float64

Data type for all variables. Should be an AbstractFloat.

Test case and domain

test = :Sod, domain_size = nothing, origin = nothing

test is the test case name to use:

:Sod: Sod shock tube test
:Sod_y: Sod shock tube test along the Y axis
:Sod_circ: Circular Sod shock tube test (centered in the domain)
:Bizarrium: Bizarrium test, similar to the Sod shock tube but with a special equation of state
:Sedov: Sedov blast-wave test (centered in the domain, reaches the border at t=1 by default)
:DebugIndexes: Set all variables to their index in the global domain. Debug only.

cfl = 0., maxtime = 0., maxcycle = 500_000

cfl defaults to the test's default value, same for maxtime. The solver stops when t reaches maxtime or maxcycle iterations were done (maxcycle=0 stops after initialization).

Output

silent = 0

silent=0 for maximum verbosity. silent=3 doesn't print info at each cycle. silent=5 doesn't print anything.

output_dir = ".", output_file = "output"

joinpath(output_dir, output_file) will be path to the output file.

write_output = false, write_ghosts = false

write_output=true will write all saved_vars() to the output file. If write_ghosts=true, ghost cells will also be included.

write_slices = false

Will write all saved_vars() to 3 output files, one for the middle X row, another for the middle Y column, and another for the diagonal. If write_ghosts=true, ghost cells will also be included.

output_precision = nothing

Numbers are saved with output_precision digits of precision. Defaults to enough numbers for an exact decimal representation.

animation_step = 0

If animation_step ≥ 1, then every animation_step cycles, variables will be saved as with write_output=true.

compare = false, is_ref = false, comparison_tolerance = 1e-10

If compare=true, then at every sub step of each iteration of the solver all variables will:

(is_ref=false) be compared with a reference file found in output_dir
(is_ref=true) be saved to a reference file in output_dir

When comparing, a relative comparison_tolerance (the rtol kwarg of isapprox) is accepted between values.

check_result = false

Check if conservation of mass and energy is verified between initialization and the last iteration. An error is thrown otherwise. Accepts a relative comparison_tolerance.

return_data = false

If return_data=true, then in the SolverStats returned by armon, the data field will contain the BlockGrid used by the solver.