The Gray Scott School

Day 6 — SIMD with EVE + GPU architecture

June 29, two sessions: Joël Falcou opens the week with EVE and Kiwaku (explicit, portable C++20 SIMD), Pierre Aubert follows with the GPU architecture that carries the last three days.

June 29, 2026 — week 2 opens · Morning: Joël Falcou (LISN, CodeReckons) — C++ 20 Computing with EVE + Kiwaku · Afternoon (2 pm): Pierre Aubert (LAPP) — GPU Architecture, massively parallel computing · Marcel Vivargent Auditorium + satellites (including CINERI). The hands-on lives in GrayScott2026/day-6/ — Falcou's exercise files + the cloned EVE library.

Morning session — EVE + Kiwaku, SIMD owned

1. The problem: a fragile parallelism

A single core already processes several floats per instruction (8 with AVX2, 16 with AVX-512) — orthogonal to multithreading. But this parallelism is fragile: compiler auto-vectorization is not guaranteed (Day 3 showed refused loops), and hand-written intrinsics do not survive a change of instruction set. Falcou's answer: make it a type.

2. `eve::wide` — the register becomes a C++ type

First hands-on exercise, basic.cpp — seven lines that hold the whole thesis:

eve::wide<float, eve::fixed<8>> x( [](auto i, auto) { return 1.f + i; } );

std::cout << "EVE is optimizing for: " << eve::current_api << "\n";
std::cout << eve::sqrt(eve::abs(1 - x)) << "\n";   // all 8 lanes at once

eve::current_api prints the ISA detected at compile time — the same source emits AVX2 here, AVX-512 or NEON elsewhere.

The wide type IS the register: write the computation once, EVE emits the right instructions for the target machine

3. The hands-on progression

File	What it teaches
`basic.cpp`	`wide`, vector math functions, `current_api`
`math.cpp`	EVE's function families over a real array
`hypot.cpp`	accuracy and performance: naive vs robust hypotenuse, vectorized
`bilateral.cpp`	a bilateral image filter — SIMD on a real algorithm
`gray_scott.cpp`	the course stencil, hand-vectorized

4. The EVE Gray-Scott — guaranteed vectorization

The final kernel loads the nine neighbors as wide and chains explicit FMAs:

auto u = eve::load(&su[i]);
auto full_u1 = w00 * (eve::load(&su[i - W - 1]) - u);   // 8 cells at a time
full_u1 = eve::fma(w12, (eve::load(&su[i + 1]) - u), full_u1);
// … the stencil's 9 terms, fused into FMA chains

What Day 3 obtained by negotiating with -fopt-info-vec, EVE obtains by construction — vectorization is no longer a hope, the type enforces it.

5. Kiwaku — the matching containers

Same author, one level up: Kiwaku provides the multidimensional containers and views (shape, strides, hardware adaptation) designed to plug EVE in — the "containers" duo announced by the session. Still young, but the direction is clear: one algorithm code, many targets.

Afternoon session — GPU architecture (2 pm)

6. Switching philosophy

At 2 pm Pierre Aubert flips the perspective: the CPU hides latency (big caches, few powerful cores); the GPU drowns it in numbers — dozens of Streaming Multiprocessors, hundreds of simple cores each, threads advancing in warps of 32.

Two philosophies: the CPU hides latency with cache, the GPU drowns it under thousands of threads — provided global memory is fed with coalesced accesses

7. The GPU memory hierarchy

The session details the three levels that command the rest of the week: registers (per thread), shared memory (per SM — the "hand-managed cache"), and global memory (HBM) whose accesses must be coalesced: the 32 threads of a warp read neighboring addresses, or throughput collapses. It is the layout lesson of Days 2-4, transposed to 32 lanes.

8. nvc++ as a scout, then Days 7-9

The day closes (3 pm, Pierre Aubert again) with a scout: nvc++ compiles standard C++17 directly for the GPU since 2020 — std::transform + parallel execution policies, no external library. It is the stdpar spirit we will meet again in Fortran on Day 8 (do concurrent).

Everything is in place: Python on GPU tomorrow (Day 7), Fortran on GPU (Day 8), Kokkos on GPU (Day 9) — three languages, one single target architecture: this afternoon's.

The hands-on — `GrayScott2026/day-6/`

EVE is header-only: clone and compile.

cd GrayScott2026/day-6
g++ -std=c++20 -O3 -march=native -I eve/include basic.cpp -o basic
./basic          # → "EVE is optimizing for: X86 AVX2" (per your machine)

Then unroll math.cpp, hypot.cpp, bilateral.cpp and gray_scott.cpp in the same mold. On the local hands-on machine: AVX2 detected, 8 floats per instruction.

On video — the official replays

Replay — EVE, a C++20 computing library on CPU (Gray Scott Thursdays)

Replay — GPU Architecture (Gray Scott Thursdays)

Replay — Modern C++ GPU computing with std::algorithm and CUDA (Gray Scott Thursdays)

Sources & official material

The EVE slides (Joël Falcou's online lecture): events.codereckons.com/eve
The day's PDF (school GitLab wiki): gray_scott.pdf
The libraries: github.com/jfalcou/eve · github.com/jfalcou/kiwaku
Video replays (YouTube): Gray Scott Thursdays
School website: GrayScott2026

Edit this pageorReport an issue

Day 5 — Python on CPU

June 26, with Alice Faure, Jean-Marc Colley, Sébastien Valat and Nabil Garroum: profile Python, vectorize with NumPy, compile with Numba, then trace with JAX — up to ×18 without leaving Python.

Day 7 — Python on GPU

June 30, four sessions with Alice Faure, Jean-Marc Colley, Sébastien Valat and Nabil Garroum: CuPy, cuPyNumeric and JAX port Day 5's Gray-Scott to the accelerator — official A100 numbers included.