The Gray Scott School

Day 9 — Kokkos on GPU

The portable kernel reaches the accelerator. Portability is not free — layout, host↔device transfers and synchronization — then the school closes.

July 2, 2026 — last day · Morning: Paul Zehner, Juan-José Silva Cuevas & Thomas Padioleau (Maison de la Simulation, CEA) — Kokkos on GPU · 5 pm: Vincent LafageA story about cubic root optimisation in C++ and Fortran · then the closing presentation. The hands-on reuses Day 4's repo (day-4/exercises/), the gpugpu_asyncgpu_async_more side.

Morning session — Kokkos reaches the accelerator

1. One source, all backends

The Day 4 Kokkos kernel now lands on the GPU: the backend is chosen at compile time — OpenMP for CPU cores, CUDA (NVIDIA), HIP (AMD), SYCL (Intel). The same Gray-Scott runs on a workstation, on a Jean-Zay node (Volta GPU), or tomorrow on an AMD accelerator.

2. Portability is not free

The honest lesson of the day. The Day 4 "CPU" Kokkos carried an explicit warning in the code: it only runs on CPU, it is not yet portable. Reaching the GPU required two adjustments:

  • Kokkos::LayoutRight on the Views and Iterate::Right in the policy, so neighboring threads read neighboring addresses — Day 6's memory coalescing. A layout that ignores it collapses GPU bandwidth.
  • Explicit host↔device management.

Kokkos guarantees one code compiles and runs everywhere, not that it is fast everywhere.

3. Moving the data

auto u_h = Kokkos::create_mirror_view(u);  // host buffer paired with the device View
Kokkos::deep_copy(u_h, u);                  // device → host, only when needed
Kokkos::fence("wait for compute");          // kernels are asynchronous

Minimizing these transfers is Day 2's "cut the round trips", transposed to the PCIe bus.

4. Overlapping: gpugpu_asyncgpu_async_more

VariantIdea
gpubaseline: compute on device, synchronous copies
gpu_asyncasynchronous writing — I/O overlaps the next compute
gpu_async_moreasynchronous sync and writing — transfers hidden as much as possible
gpu — synchronous: every copy blocks the computegpu_async / gpu_async_more — the write overlaps the next computetime savedcompute (device)D→H copy + HDF5 writetime →
Overlap: Kokkos kernels are asynchronous by nature — writing image n while computing n+1 removes the transfers from the critical path

Late afternoon (5 pm) & closing

The cube root — the last word

The final talk optimises a cube root in C++ and Fortran — looping straight back to Day 3, and tying the week together: the compiler, the language, measurement, and never trusting intuition before timing.

6. The arc of optimization

From CPU foundations (Day 1) to a portable GPU kernel (today), the school traced one arc: the same Gray-Scott, made faster step by step — vectorization, multicore, Fortran, do concurrent, Python, and finally Kokkos — at a constant numerical result. The lesson is neither a language nor an API: it is a method. Measure, find the limiting factor, exploit the hardware.

On video — the official replay

Replay — Kokkos on GPU (Gray Scott Thursdays)

Sources & official material

Copyright © 2026