The Gray Scott School

Day 9 — Kokkos on GPU

The portable kernel reaches the accelerator. Portability is not free — layout, host↔device transfers and synchronization — then the school closes.

July 2, 2026 — last day · Morning: Paul Zehner, Juan-José Silva Cuevas & Thomas Padioleau (Maison de la Simulation, CEA) — Kokkos on GPU · 5 pm: Vincent Lafage — A story about cubic root optimisation in C++ and Fortran · then the closing presentation. The hands-on reuses Day 4's repo (day-4/exercises/), the gpu → gpu_async → gpu_async_more side.

Morning session — Kokkos reaches the accelerator

1. One source, all backends

The Day 4 Kokkos kernel now lands on the GPU: the backend is chosen at compile time — OpenMP for CPU cores, CUDA (NVIDIA), HIP (AMD), SYCL (Intel). The same Gray-Scott runs on a workstation, on a Jean-Zay node (Volta GPU), or tomorrow on an AMD accelerator.

2. Portability is not free

The honest lesson of the day. The Day 4 "CPU" Kokkos carried an explicit warning in the code: it only runs on CPU, it is not yet portable. Reaching the GPU required two adjustments:

Kokkos::LayoutRight on the Views and Iterate::Right in the policy, so neighboring threads read neighboring addresses — Day 6's memory coalescing. A layout that ignores it collapses GPU bandwidth.
Explicit host↔device management.

Kokkos guarantees one code compiles and runs everywhere, not that it is fast everywhere.

3. Moving the data

auto u_h = Kokkos::create_mirror_view(u);  // host buffer paired with the device View
Kokkos::deep_copy(u_h, u);                  // device → host, only when needed
Kokkos::fence("wait for compute");          // kernels are asynchronous

Minimizing these transfers is Day 2's "cut the round trips", transposed to the PCIe bus.

4. Overlapping: `gpu` → `gpu_async` → `gpu_async_more`

Variant	Idea
`gpu`	baseline: compute on device, synchronous copies
`gpu_async`	asynchronous writing — I/O overlaps the next compute
`gpu_async_more`	asynchronous sync and writing — transfers hidden as much as possible

Overlap: Kokkos kernels are asynchronous by nature — writing image n while computing n+1 removes the transfers from the critical path

Late afternoon (5 pm) & closing

The cube root — the last word

The final talk optimises a cube root in C++ and Fortran — looping straight back to Day 3, and tying the week together: the compiler, the language, measurement, and never trusting intuition before timing.

6. The arc of optimization

From CPU foundations (Day 1) to a portable GPU kernel (today), the school traced one arc: the same Gray-Scott, made faster step by step — vectorization, multicore, Fortran, do concurrent, Python, and finally Kokkos — at a constant numerical result. The lesson is neither a language nor an API: it is a method. Measure, find the limiting factor, exploit the hardware.

On video — the official replay

Replay — Kokkos on GPU (Gray Scott Thursdays)

Sources & official material

The day's slides (PDF, school GitLab wiki): kokkos_gpu.pdf · the cube root — Reprises IJC/PSA vectorisation
The course repository (gpu/gpu_async/gpu_async_more exercises + LaTeX lecture): github.com/Maison-de-la-Simulation/gray-scott-kokkos
Kokkos: kokkos.org · github.com/kokkos/kokkos
Video replays (YouTube): Gray Scott Thursdays
School website: GrayScott2026

Edit this pageorReport an issue

Day 8 — Fortran on GPU

Standard Fortran on the GPU via do concurrent, compared with OpenACC and OpenMP target from a single source — plus the closing polyglot session.

Projects

Projects built by Gray Scott School learners, applying the HPC techniques taught at CINERI to real problems.