Day 7 — Python on GPU
June 30, 2026 · Speakers: Alice Faure, Jean-Marc Colley, Sébastien Valat & Nabil Garroum · four Python-GPU sessions in a row (CuPy → cuPyNumeric → JAX at 2 pm → wrap-up) · Marcel Vivargent Auditorium + satellites (including CINERI). The hands-on lives in
GrayScott2026/day-5/GPU/— three tutorials + solutions, and the official A100 benchmarks.
1. What really changes: the PCIe toll booth
Day 5's array-first code transposes almost as-is — what changes is the memory geography. The GPU computes at ~2 TB/s but is fed through a ~32 GB/s pipe:
The whole day applies the same rule: cp.asarray / jax.device_put once at the start,
the whole time loop on the device, asnumpy once at the end.
2. Three routes to the same GPU
CuPy session — NumPy on CUDA, no rewrite
Tutorial 3_Python_GPU_Cupy.md: CuPy mirrors the NumPy API on CUDA — swapping numpy for
cupy runs the same stencil on the GPU.
import cupy as cp
u = cp.asarray(u_host) # host → device, ONCE
# … same stencil expressions as NumPy …
u_host = cp.asnumpy(u) # device → host, only when needed
Hands-on bonus: CuPy is the only version with parallel HDF5 I/O implemented — writing the 1000 images while the GPU computes saves ~4 s (12 s total on A100).
cuPyNumeric session — distributed NumPy
Tutorial 2_Python_GPU_cuPyNumeric.md: cuPyNumeric (NVIDIA, Legate engine) runs NumPy
code on several GPUs and several nodes without MPI and without rewriting — the same
script, a bigger machine. The price of generality shows in the benchmark: the generic
convolve version is the slowest of the field (128 s).
JAX session (2 pm) — Day 5's code, re-jitted
Tutorial 1_Python_GPU_JAX.md: Day 5's JAX Gray-Scott replays unchanged — XLA compiles
the traced stencil into a CUDA kernel, and JAX places arrays on the device by default. The
hands-on solutions show the three tools that make the difference:
| Hands-on solution | What it teaches |
|---|---|
jax_vmap_solutions.py | vectorize a function over a whole axis (batch) |
jax_fori_loop_solutions.py | fuse the time loop inside the compiled graph |
jax_scan_solutions.py | accumulate states without returning to Python between steps |
This is exactly the toolbox of SenLand's JAX port (lax.fori_loop to
fuse steps, device-resident batches).
The verdict — official A100 numbers
The repo's GPU/Benchmarks.md: 32×1000 iterations, 1920×1080 grid in float32:
| cuPyNumeric (convolve) | JAX (generic) | JAX (3×3) | CuPy | PyTorch |
|---|---|---|---|---|
| 128 s | 47 s | 18 s | 18 s | 22 s |
Official numbers from the repo (GPU/Benchmarks.md). CuPy + parallel HDF5 I/O: 12 s.
The loop closes: Day 5's best-CPU 377 s drop to 18 s on A100 — ×21, still in Python. And the ranking echoes the week's lessons: stencil specialization (Day 5) and data residency (today) weigh more than the library choice.
The hands-on — GrayScott2026/day-5/GPU/
# Locally (NVIDIA, CUDA ≥ 12, Python 3.10-3.12)
git clone https://gitlab.in2p3.fr/alice.faure/gray-scott-python.git
python -m venv gpu-env && source gpu-env/bin/activate
pip install h5py opencv-python numpy matplotlib scipy \
"jax[cuda12]" cupy-cuda12x nvidia-cupynumeric
Official alternatives: the course Docker image, or apptainer on the MUST cluster
(Install_satellite_sites.md for satellite sites like CINERI). AMD: CuPy and JAX have
experimental ROCm routes — cuPyNumeric does not. On a small local card (GTX 1650, 4 GB),
shrink the grid: the data-residency lesson stays identical.
On video — the official replay
Sources & official material
- The course repository (
GPU/tutorial/tutorials, solutions, A100 benchmarks): gitlab.in2p3.fr/alice.faure/gray-scott-python - The libraries: docs.cupy.dev · docs.nvidia.com/cupynumeric · docs.jax.dev
- The MUST platform: jupyter.must-dc.cloud
- Video replays (YouTube): Gray Scott Thursdays
- School website: GrayScott2026
Day 6 — SIMD with EVE + GPU architecture
June 29, two sessions: Joël Falcou opens the week with EVE and Kiwaku (explicit, portable C++20 SIMD), Pierre Aubert follows with the GPU architecture that carries the last three days.
Day 8 — Fortran on GPU
Standard Fortran on the GPU via do concurrent, compared with OpenACC and OpenMP target from a single source — plus the closing polyglot session.