The Gray Scott School

Day 4 — Kokkos on CPU

June 25, with Paul Zehner, Juan-José Silva Cuevas and Thomas Padioleau: Kokkos on CPU — one C++ source, backends chosen at compile time, Views, parallel_for and SIMD.

June 25, 2026 · Speakers: Paul Zehner, Juan-José Silva Cuevas & Thomas Padioleau (Maison de la Simulation, CEA) · Marcel Vivargent Auditorium + satellite sites (including CINERI). The hands-on lives in GrayScott2026/day-4/exercises/CPU only: the gpu* exercises wait for Day 9. Rare bonus: the lecture itself is in the repo (courses/kokkos_cpu.tex, Gray-Scott beamer theme).

Morning session — from sequential to the first kernel

1. Why Kokkos

The lecture opens on the promises — and they are precise. Kokkos targets the 3Ps: Performance (the best of a given hardware), Portability (the same code on different hardware), Productivity (write, maintain, extend fast) — plus maturity (production-ready, not a research product), community, longevity and interoperability (I/O, linear algebra, ML).

ApproachRuns onYou write
Raw CUDA / HIPone vendor's GPUlow-level kernels
OpenMP / OpenACCCPU (+ some GPU)directives on loops
std::execution::parcompiler-dependentstandard algorithms
KokkosCPU + NVIDIA / AMD / Intel GPUC++ patterns + Views

2. The thesis: one code, many backends

one single Kokkos sourceparallel_for + lambda + Viewbackend chosen AT COMPILE TIME-DKokkos_ENABLE_…=ONtoday — CPUSerialOpenMPThreadsDay 9 — GPUCUDAHIPSYCLPerformance · Portability · Productivity
The Kokkos thesis: the same code, recompiled with another flag, targets another hardware — CPU backends today, GPUs on Day 9

3. The hands-on starts: hello_worldsequential

The repo progression is built to isolate each idea. hello_world checks the installation (init/finalize via Kokkos::ScopeGuard); sequential sets the reference sequential Gray-Scott — the baseline every variant is compared against. Shared infrastructure lives in common/ (CLI11 parameters, output writer, helpers).

Afternoon session — Views, parallelism, SIMD

4. View — the container that abstracts memory

Why an abstracted container? The lecture answers: no more manual allocation, a unified CPU/GPU memory semantic, vendor-specific allocation hidden, advanced capabilities (abstracted layout, subarrays, multidimensionality) and safety (compile/runtime checks).

Kokkos::View<double**> · logical 3 × 4 arrayLayoutRight — rows first (C's row-major)123456789101112LayoutLeft — columns first (Fortran's column-major)147102581136912default: Right on CPU, Left on GPU — Kokkos picks the layout that fits the hardwarev(i, j) — the kernel does not change
The View separates logical indexing v(i, j) from the actual memory order: switching layout does not change one line of the kernel

The loop closes with Days 2-3: layout was the biggest lever — Kokkos turns it into a type parameter, defaulting to what fits the hardware.

5. parallel_for and parallel_reduce

The kernel becomes a lambda handed to a parallel pattern, with an iteration policy:

Kokkos::parallel_for("compute",
  Kokkos::MDRangePolicy<Kokkos::Rank<2>>({1, 1}, {rows - 1, cols - 1}),
  KOKKOS_LAMBDA (int i, int j) {
    u_temp(i, j) = u(i, j) + dt * (/* … the stencil … */);
  });

The reduction (field checksum) follows the same pattern with parallel_reduce — the parallel, safe version of accumulation.

6. Switching to the CPU backend

The cpu exercise enables OpenMP (-DKokkos_ENABLE_OPENMP=ON): same Views, same lambdas, all the cores. Repo honesty: the file carries an explicit warning — this version "only runs on CPU, it is not yet portable Kokkos". What is missing for the GPU (layouts, transfers, fence) is exactly Day 9's program.

BackendFlag
Serial-DKokkos_ENABLE_SERIAL=ON
OpenMP-DKokkos_ENABLE_OPENMP=ON
Threads-DKokkos_ENABLE_THREADS=ON

7. SIMD — explicit vectorization, Kokkos style

The cpu_simd exercise goes one level down: the Kokkos::Experimental::simd types pack simd_width = SimdType::size() values per operation, loaded from the Views (simd_flag_default). It is the same in-core parallelism as Day 1 — and the direct preview of EVE on Day 6.

8. Verify, always

scripts/check_outcome.sh replays every implementation on the 10 × 10 case and compares checksums; scripts/run_all.sh runs all the variants of a build. Day 1's rule (sequential/parallel equivalence) is here tooled.

The hands-on — GrayScott2026/day-4/

Dependencies: CMake ≥ 3.28, Kokkos ≥ 5.1.1, HDF5 (C++), CLI11. Two paths:

# 1) dependencies handled by CMake (except HDF5)
cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_DOWNLOAD_FALLBACK=ON -DHDF5_ROOT=…
cmake --build build --parallel $(nproc)

# 2) check an implementation
bash exercises/scripts/check_outcome.sh build/cpu/gray_scott_cpu

Official Docker images (CPU: interactive, jupyter, vscode, code_server) are listed in the repo README.

On video — the official replay

Replay — Kokkos On CPU and GPU (Gray Scott Thursdays)

Sources & official material

Copyright © 2026