Day 9 — Kokkos on GPU
July 2, 2026 — last day · Morning: Paul Zehner, Juan-José Silva Cuevas & Thomas Padioleau (Maison de la Simulation, CEA) — Kokkos on GPU · 5 pm: Vincent Lafage — A story about cubic root optimisation in C++ and Fortran · then the closing presentation. The hands-on reuses Day 4's repo (
day-4/exercises/), thegpu→gpu_async→gpu_async_moreside.
Morning session — Kokkos reaches the accelerator
1. One source, all backends
The Day 4 Kokkos kernel now lands on the GPU: the backend is chosen at compile time — OpenMP for CPU cores, CUDA (NVIDIA), HIP (AMD), SYCL (Intel). The same Gray-Scott runs on a workstation, on a Jean-Zay node (Volta GPU), or tomorrow on an AMD accelerator.
2. Portability is not free
The honest lesson of the day. The Day 4 "CPU" Kokkos carried an explicit warning in the code: it only runs on CPU, it is not yet portable. Reaching the GPU required two adjustments:
Kokkos::LayoutRighton theViews andIterate::Rightin the policy, so neighboring threads read neighboring addresses — Day 6's memory coalescing. A layout that ignores it collapses GPU bandwidth.- Explicit host↔device management.
Kokkos guarantees one code compiles and runs everywhere, not that it is fast everywhere.
3. Moving the data
auto u_h = Kokkos::create_mirror_view(u); // host buffer paired with the device View
Kokkos::deep_copy(u_h, u); // device → host, only when needed
Kokkos::fence("wait for compute"); // kernels are asynchronous
Minimizing these transfers is Day 2's "cut the round trips", transposed to the PCIe bus.
4. Overlapping: gpu → gpu_async → gpu_async_more
| Variant | Idea |
|---|---|
gpu | baseline: compute on device, synchronous copies |
gpu_async | asynchronous writing — I/O overlaps the next compute |
gpu_async_more | asynchronous sync and writing — transfers hidden as much as possible |
Late afternoon (5 pm) & closing
The cube root — the last word
The final talk optimises a cube root in C++ and Fortran — looping straight back to Day 3, and tying the week together: the compiler, the language, measurement, and never trusting intuition before timing.
6. The arc of optimization
From CPU foundations (Day 1) to a portable GPU kernel (today), the school traced one arc: the
same Gray-Scott, made faster step by step — vectorization, multicore, Fortran, do concurrent, Python, and finally Kokkos — at a constant numerical result. The lesson is neither a
language nor an API: it is a method. Measure, find the limiting factor, exploit the hardware.
On video — the official replay
Sources & official material
- The day's slides (PDF, school GitLab wiki): kokkos_gpu.pdf · the cube root — Reprises IJC/PSA vectorisation
- The course repository (gpu/gpu_async/gpu_async_more exercises + LaTeX lecture): github.com/Maison-de-la-Simulation/gray-scott-kokkos
- Kokkos: kokkos.org · github.com/kokkos/kokkos
- Video replays (YouTube): Gray Scott Thursdays
- School website: GrayScott2026