Day 6 — SIMD with EVE + GPU architecture
June 29, 2026 — week 2 opens · Morning: Joël Falcou (LISN, CodeReckons) — C++ 20 Computing with EVE + Kiwaku · Afternoon (2 pm): Pierre Aubert (LAPP) — GPU Architecture, massively parallel computing · Marcel Vivargent Auditorium + satellites (including CINERI). The hands-on lives in
GrayScott2026/day-6/— Falcou's exercise files + the cloned EVE library.
Morning session — EVE + Kiwaku, SIMD owned
1. The problem: a fragile parallelism
A single core already processes several floats per instruction (8 with AVX2, 16 with AVX-512) — orthogonal to multithreading. But this parallelism is fragile: compiler auto-vectorization is not guaranteed (Day 3 showed refused loops), and hand-written intrinsics do not survive a change of instruction set. Falcou's answer: make it a type.
2. eve::wide — the register becomes a C++ type
First hands-on exercise, basic.cpp — seven lines that hold the whole thesis:
eve::wide<float, eve::fixed<8>> x( [](auto i, auto) { return 1.f + i; } );
std::cout << "EVE is optimizing for: " << eve::current_api << "\n";
std::cout << eve::sqrt(eve::abs(1 - x)) << "\n"; // all 8 lanes at once
eve::current_api prints the ISA detected at compile time — the same source emits AVX2
here, AVX-512 or NEON elsewhere.
3. The hands-on progression
| File | What it teaches |
|---|---|
basic.cpp | wide, vector math functions, current_api |
math.cpp | EVE's function families over a real array |
hypot.cpp | accuracy and performance: naive vs robust hypotenuse, vectorized |
bilateral.cpp | a bilateral image filter — SIMD on a real algorithm |
gray_scott.cpp | the course stencil, hand-vectorized |
4. The EVE Gray-Scott — guaranteed vectorization
The final kernel loads the nine neighbors as wide and chains explicit FMAs:
auto u = eve::load(&su[i]);
auto full_u1 = w00 * (eve::load(&su[i - W - 1]) - u); // 8 cells at a time
full_u1 = eve::fma(w12, (eve::load(&su[i + 1]) - u), full_u1);
// … the stencil's 9 terms, fused into FMA chains
What Day 3 obtained by negotiating with -fopt-info-vec, EVE obtains by construction —
vectorization is no longer a hope, the type enforces it.
5. Kiwaku — the matching containers
Same author, one level up: Kiwaku provides the multidimensional containers and views (shape, strides, hardware adaptation) designed to plug EVE in — the "containers" duo announced by the session. Still young, but the direction is clear: one algorithm code, many targets.
Afternoon session — GPU architecture (2 pm)
6. Switching philosophy
At 2 pm Pierre Aubert flips the perspective: the CPU hides latency (big caches, few powerful cores); the GPU drowns it in numbers — dozens of Streaming Multiprocessors, hundreds of simple cores each, threads advancing in warps of 32.
7. The GPU memory hierarchy
The session details the three levels that command the rest of the week: registers (per thread), shared memory (per SM — the "hand-managed cache"), and global memory (HBM) whose accesses must be coalesced: the 32 threads of a warp read neighboring addresses, or throughput collapses. It is the layout lesson of Days 2-4, transposed to 32 lanes.
8. nvc++ as a scout, then Days 7-9
The day closes (3 pm, Pierre Aubert again) with a scout: nvc++ compiles standard C++17
directly for the GPU since 2020 — std::transform + parallel execution policies, no
external library. It is the stdpar spirit we will meet again in Fortran on Day 8
(do concurrent).
Everything is in place: Python on GPU tomorrow (Day 7), Fortran on GPU (Day 8), Kokkos on GPU (Day 9) — three languages, one single target architecture: this afternoon's.
The hands-on — GrayScott2026/day-6/
EVE is header-only: clone and compile.
cd GrayScott2026/day-6
g++ -std=c++20 -O3 -march=native -I eve/include basic.cpp -o basic
./basic # → "EVE is optimizing for: X86 AVX2" (per your machine)
Then unroll math.cpp, hypot.cpp, bilateral.cpp and gray_scott.cpp in the same mold.
On the local hands-on machine: AVX2 detected, 8 floats per instruction.
On video — the official replays
Sources & official material
- The EVE slides (Joël Falcou's online lecture): events.codereckons.com/eve
- The day's PDF (school GitLab wiki): gray_scott.pdf
- The libraries: github.com/jfalcou/eve · github.com/jfalcou/kiwaku
- Video replays (YouTube): Gray Scott Thursdays
- School website: GrayScott2026
Day 5 — Python on CPU
June 26, with Alice Faure, Jean-Marc Colley, Sébastien Valat and Nabil Garroum: profile Python, vectorize with NumPy, compile with Numba, then trace with JAX — up to ×18 without leaving Python.
Day 7 — Python on GPU
June 30, four sessions with Alice Faure, Jean-Marc Colley, Sébastien Valat and Nabil Garroum: CuPy, cuPyNumeric and JAX port Day 5's Gray-Scott to the accelerator — official A100 numbers included.