The Gray Scott School

Day 6 — SIMD with EVE + GPU architecture

June 29, two sessions: Joël Falcou opens the week with EVE and Kiwaku (explicit, portable C++20 SIMD), Pierre Aubert follows with the GPU architecture that carries the last three days.

June 29, 2026 — week 2 opens · Morning: Joël Falcou (LISN, CodeReckons) — C++ 20 Computing with EVE + Kiwaku · Afternoon (2 pm): Pierre Aubert (LAPP) — GPU Architecture, massively parallel computing · Marcel Vivargent Auditorium + satellites (including CINERI). The hands-on lives in GrayScott2026/day-6/ — Falcou's exercise files + the cloned EVE library.

Morning session — EVE + Kiwaku, SIMD owned

1. The problem: a fragile parallelism

A single core already processes several floats per instruction (8 with AVX2, 16 with AVX-512) — orthogonal to multithreading. But this parallelism is fragile: compiler auto-vectorization is not guaranteed (Day 3 showed refused loops), and hand-written intrinsics do not survive a change of instruction set. Falcou's answer: make it a type.

2. eve::wide — the register becomes a C++ type

First hands-on exercise, basic.cpp — seven lines that hold the whole thesis:

eve::wide<float, eve::fixed<8>> x( [](auto i, auto) { return 1.f + i; } );

std::cout << "EVE is optimizing for: " << eve::current_api << "\n";
std::cout << eve::sqrt(eve::abs(1 - x)) << "\n";   // all 8 lanes at once

eve::current_api prints the ISA detected at compile time — the same source emits AVX2 here, AVX-512 or NEON elsewhere.

eve::wide<float, fixed<8>> x1.02.03.04.05.06.07.08.0eve::sqrt(eve::abs(1 - x))one expression — all 8 lanes compute togethereve::current_api decides at compile time:SSE2AVX2AVX-512NEONSVE
The wide type IS the register: write the computation once, EVE emits the right instructions for the target machine

3. The hands-on progression

FileWhat it teaches
basic.cppwide, vector math functions, current_api
math.cppEVE's function families over a real array
hypot.cppaccuracy and performance: naive vs robust hypotenuse, vectorized
bilateral.cppa bilateral image filter — SIMD on a real algorithm
gray_scott.cppthe course stencil, hand-vectorized

4. The EVE Gray-Scott — guaranteed vectorization

The final kernel loads the nine neighbors as wide and chains explicit FMAs:

auto u = eve::load(&su[i]);
auto full_u1 = w00 * (eve::load(&su[i - W - 1]) - u);   // 8 cells at a time
full_u1 = eve::fma(w12, (eve::load(&su[i + 1]) - u), full_u1);
// … the stencil's 9 terms, fused into FMA chains

What Day 3 obtained by negotiating with -fopt-info-vec, EVE obtains by construction — vectorization is no longer a hope, the type enforces it.

5. Kiwaku — the matching containers

Same author, one level up: Kiwaku provides the multidimensional containers and views (shape, strides, hardware adaptation) designed to plug EVE in — the "containers" duo announced by the session. Still young, but the direction is clear: one algorithm code, many targets.

Afternoon session — GPU architecture (2 pm)

6. Switching philosophy

At 2 pm Pierre Aubert flips the perspective: the CPU hides latency (big caches, few powerful cores); the GPU drowns it in numbers — dozens of Streaming Multiprocessors, hundreds of simple cores each, threads advancing in warps of 32.

CPU — latency-optimizedfew powerful cores, big cachescore 1core 2core 3core 4big cache (L3)GPU — throughput-optimizeddozens of SMs × hundreds of simple coresSM 1shared mem.SM 2shared mem.SM 3shared mem.SM 4shared mem.SM 5shared mem.SM 6shared mem.SM 7shared mem.SM 8shared mem.global memory (HBM) — coalesced accesses requiredthreads advance in warps of 32 — latency is hidden by sheer numbers
Two philosophies: the CPU hides latency with cache, the GPU drowns it under thousands of threads — provided global memory is fed with coalesced accesses

7. The GPU memory hierarchy

The session details the three levels that command the rest of the week: registers (per thread), shared memory (per SM — the "hand-managed cache"), and global memory (HBM) whose accesses must be coalesced: the 32 threads of a warp read neighboring addresses, or throughput collapses. It is the layout lesson of Days 2-4, transposed to 32 lanes.

8. nvc++ as a scout, then Days 7-9

The day closes (3 pm, Pierre Aubert again) with a scout: nvc++ compiles standard C++17 directly for the GPU since 2020 — std::transform + parallel execution policies, no external library. It is the stdpar spirit we will meet again in Fortran on Day 8 (do concurrent).

Everything is in place: Python on GPU tomorrow (Day 7), Fortran on GPU (Day 8), Kokkos on GPU (Day 9) — three languages, one single target architecture: this afternoon's.

The hands-on — GrayScott2026/day-6/

EVE is header-only: clone and compile.

cd GrayScott2026/day-6
g++ -std=c++20 -O3 -march=native -I eve/include basic.cpp -o basic
./basic          # → "EVE is optimizing for: X86 AVX2" (per your machine)

Then unroll math.cpp, hypot.cpp, bilateral.cpp and gray_scott.cpp in the same mold. On the local hands-on machine: AVX2 detected, 8 floats per instruction.

On video — the official replays

Replay — EVE, a C++20 computing library on CPU (Gray Scott Thursdays)
Replay — GPU Architecture (Gray Scott Thursdays)
Replay — Modern C++ GPU computing with std::algorithm and CUDA (Gray Scott Thursdays)

Sources & official material

Copyright © 2026