2.12. Parallelism¶

Some comments on parallelism and reporting of time taken for parallel overheads.

A hierarchy of parallelism is employed in the code. At the most coarse grained level, domain decomposition is implemented via the message pessing interface (MPI). Within each subdomain, computational kernels are executed in a threaded model which may involve either OpenMP for CPU architectures, or an appropriate GPU model. This means the user may have a number of choices in how to deploy resources for problems of any given size.

2.12.1. Distributed memory parallelism¶

2.12.1.1. Halo swap mechanisms¶

Under usual circumstances, it should not be necessary to select the type of halo swap used for the various lattice-based fields in the calculation. The code will select the relevant mechanism. In particular, relevant CPU/GPU halo swaps will be applied.

2.12.1.1.1. Lattice distribution halo swaps¶

The halo swap for the lattice Boltzmann distributions can be either a “full” halo wsap, in which all elements of the distribution are communicated to neighbouring sub-domains, or a “reduced” version in which only elements propagating in a ertain direction are communicated in that direction (with consequent saving in message size).

lb_halo_scheme          lb_halo_full              # full
lb_halo_scheme          lb_halo_reduced           # reduced

The default option is “full”. For fluid only problems it may give a performance inprovement to use “reduced”. However, the reduced halo swap should not be used if solid objects (boundaries or colloids) are present.

2.12.1.1.2. Halo swaps for order parameter fields¶

Order parameter fields also have halo swaps (the extent of which is worked out automatically). Information on the halo swap can be obtained to standard output via:

field_halo_verbose      yes                     # [no] report information

There is no concept of a reduced halo swap for these data; the full data are always exchanged.

2.12.2. Threaded parallelism via OpenMP¶

The recommended way to run the OpenMP version is to use one MPI task per NUMA region. Most modern machines have two sockets per node, and each socket may be one or more NUMA regions. The number of threads should be set to the number of physical cores per NUMA region. Many modern processors report the number of “CPUs” available, which is often two times the number of physical cores owing to the ability to run two threads in hardware per physical core.

Typically, for 16 cores per NUMA regions, one might set

export OMP_NUM_THREADS=16
export OMP_PLACES=cores

Local details may vary on how exactly to run in this hybrid MPI/OpenMP mode.

It is possible to specify “first touch” policy for a number of data structures in the input file:

lb_data_use_first_touch      yes    # lattice Boltzmann data [no]
field_data_use_first_touch   yes    # field data: default is [no]
hydro_data_use_first_touch   yes    # hydrodynamic quantities [no]

The field option only refers to tensor order parameter at the moment. All other fields have the default setting.

2.12.3. Threaded parallelism for GPU¶

The usual mode of opertation envisaged for GPU-based architectures is one MPI task per GPU. The code will determine internally how to aportion individual GPUs to each MPI task.

At the moment the code does no checking that the assumption of one MPI task per GPU device is valid, so some care may be required if running multi-GPU jobs.

2.12.4. Timing¶

2.12.4.1. Halo swap imbalance time¶

A facility is provided to report the imbalance time observed at the point of the lattice distribition halo swap. This introduces an explicit synchronisation immediately before the halo swap to allow the separation of load imbalance and actually communication overhead.

lb_halo_scheme_report_imbalance    yes      # Default no

This appears as part of the breakdown of the lattice halo swap time, e.g.,

Lattice halos:      0.001      0.013     84.246   0.000842 (100000 calls)
 -> imbalance:      0.000      0.013     14.277   0.000143 (100000 calls)

The introduction of the synchronisation itself introduces an overhead, so this option should not be used for production runs.