Some comments on parallelism and reporting of time taken for parallel overheads.
A hierarchy of parallelism is employed in the code. At the most coarse grained level, domain decomposition is implemented via the message pessing interface (MPI). Within each subdomain, computational kernels are executed in a threaded model which may involve either OpenMP for CPU architectures, or an appropriate GPU model. This means the user may have a number of choices in how to deploy resources for problems of any given size.
2.12.1. Distributed memory parallelism¶
126.96.36.199. Halo swap mechanisms¶
Under usual circumstances, it should not be necessary to select the type of halo swap used for the various lattice-based fields in the calculation. The code will select the relevant mechanism. In particular, relevant CPU/GPU halo swaps will be applied.
If explicit control is required, one can use some of the following flags.
188.8.131.52.1. Lattice distribution halo swaps¶
There a number of different halo swap mechaniams for the lattice Boltzmann distributions. One or other may be selected explicitly via
lb_halo_scheme lb_halo_openmp_full # host (full)
lb_halo_scheme lb_halo_openmp_reduced # host (reduced)
lb_halo_scheme lb_halo_target # target (full) [default]
The two host options (which, despite their names, will work with or without OpenMP) are full or reduced. The reduced halo swap may be quicker as it sends only distributions propagating in the relevant direction. However, the reduced halo swap should not be used if solid objects (boundaries or colloids) are present. It is intended for fluid only computations. The target halo swap handles the extra complexity in the GPU case: this is always a ‘full’ halo swap at the moment.
There is an additional restriction that if two distribitions are needed (two-distribution binary fluid case), the target halo swap must be used.
184.108.40.206.2. Halo swaps for order parameter fields¶
Order parameter fields also have halo swaps (the extent of which is worked out automatically). Information on the halo swap can be obtained to standard output via:
field_halo_verbose yes # [no] report information
There is no concept of a reduced halo swap for these data; the full data are always exchanged.
2.12.2. Threaded parallelism via OpenMP¶
The recommended way to run the OpenMP version is to use one MPI task per NUMA region. Most modern machines have two sockets per node, and each socket may be one or more NUMA regions. The number of threads should be set to the number of physical cores per NUMA region. Many modern processors report the number of “CPUs” available, which is often two times the number of physical cores owing to the ability to run two threads in hardware per physical core.
Typically, for 16 cores per NUMA regions, one might set
Local details may vary on how exactly to run in this hybrid MPI/OpenMP mode.
It is possible to specify “first touch” policy for a number of data structures in the input file:
lb_data_use_first_touch yes # lattice Boltzmann data [no]
field_data_use_first_touch yes # field data: default is [no]
hydro_data_use_first_touch yes # hydrodynamic quantities [no]
The field option only refers to tensor order parameter at the moment. All other fields have the default setting.
2.12.3. Threaded parallelism for GPU¶
The usual mode of opertation envisaged for GPU-based architectures is one MPI task per GPU. The code will determine internally how to aportion individual GPUs to each MPI task.
At the moment the code does no checking that the assumption of one MPI task per GPU device is valid, so some care may be required if running multi-GPU jobs.
220.127.116.11. Halo swap imbalance time¶
A facility is provided to report the imbalance time observed at the point of the lattice distribition halo swap. This introduces an explicit synchronisation immediately before the halo swap to allow the separation of load imbalance and actually communication overhead.
lb_halo_scheme_report_imbalance yes # Default no
This appears as part of the breakdown of the lattice halo swap time, e.g.,
Lattice halos: 0.001 0.013 84.246 0.000842 (100000 calls)
-> imbalance: 0.000 0.013 14.277 0.000143 (100000 calls)
The introduction of the synchronisation itself introduces an overhead, so this option should not be used for production runs.