The Gray Scott School

Day 7 — Python on GPU

June 30, four sessions with Alice Faure, Jean-Marc Colley, Sébastien Valat and Nabil Garroum: CuPy, cuPyNumeric and JAX port Day 5's Gray-Scott to the accelerator — official A100 numbers included.

June 30, 2026 · Speakers: Alice Faure, Jean-Marc Colley, Sébastien Valat & Nabil Garroum · four Python-GPU sessions in a row (CuPy → cuPyNumeric → JAX at 2 pm → wrap-up) · Marcel Vivargent Auditorium + satellites (including CINERI). The hands-on lives in GrayScott2026/day-5/GPU/ — three tutorials + solutions, and the official A100 benchmarks.

1. What really changes: the PCIe toll booth

Day 5's array-first code transposes almost as-is — what changes is the memory geography. The GPU computes at ~2 TB/s but is fed through a ~32 GB/s pipe:

The GPU computes fast but is fed through a narrow pipe: data must stay resident on the device for the whole time loop

The whole day applies the same rule: cp.asarray / jax.device_put once at the start, the whole time loop on the device, asnumpy once at the end.

2. Three routes to the same GPU

Three saddles for the same horse: whichever Python style you pick, everything lands on the SMs and global memory seen the day before

CuPy session — NumPy on CUDA, no rewrite

Tutorial 3_Python_GPU_Cupy.md: CuPy mirrors the NumPy API on CUDA — swapping numpy for cupy runs the same stencil on the GPU.

import cupy as cp
u = cp.asarray(u_host)      # host → device, ONCE
# … same stencil expressions as NumPy …
u_host = cp.asnumpy(u)      # device → host, only when needed

Hands-on bonus: CuPy is the only version with parallel HDF5 I/O implemented — writing the 1000 images while the GPU computes saves ~4 s (12 s total on A100).

cuPyNumeric session — distributed NumPy

Tutorial 2_Python_GPU_cuPyNumeric.md: cuPyNumeric (NVIDIA, Legate engine) runs NumPy code on several GPUs and several nodes without MPI and without rewriting — the same script, a bigger machine. The price of generality shows in the benchmark: the generic convolve version is the slowest of the field (128 s).

JAX session (2 pm) — Day 5's code, re-jitted

Tutorial 1_Python_GPU_JAX.md: Day 5's JAX Gray-Scott replays unchanged — XLA compiles the traced stencil into a CUDA kernel, and JAX places arrays on the device by default. The hands-on solutions show the three tools that make the difference:

Hands-on solution	What it teaches
`jax_vmap_solutions.py`	vectorize a function over a whole axis (batch)
`jax_fori_loop_solutions.py`	fuse the time loop inside the compiled graph
`jax_scan_solutions.py`	accumulate states without returning to Python between steps

This is exactly the toolbox of SenLand's JAX port (lax.fori_loop to fuse steps, device-resident batches).

The verdict — official A100 numbers

The repo's GPU/Benchmarks.md: 32×1000 iterations, 1920×1080 grid in float32:

cuPyNumeric (convolve)	JAX (generic)	JAX (3×3)	CuPy	PyTorch
128 s	47 s	18 s	18 s	22 s

Gray-Scott Python on A100 (32×1000 iterations, 1920×1080 float32)

cuPyNumeric

128 s

JAX (generic)

47 s

JAX (3×3)

18 s

CuPy

18 s

PyTorch

22 s

Official numbers from the repo (GPU/Benchmarks.md). CuPy + parallel HDF5 I/O: 12 s.

The loop closes: Day 5's best-CPU 377 s drop to 18 s on A100 — ×21, still in Python. And the ranking echoes the week's lessons: stencil specialization (Day 5) and data residency (today) weigh more than the library choice.

The hands-on — `GrayScott2026/day-5/GPU/`

# Locally (NVIDIA, CUDA ≥ 12, Python 3.10-3.12)
git clone https://gitlab.in2p3.fr/alice.faure/gray-scott-python.git
python -m venv gpu-env && source gpu-env/bin/activate
pip install h5py opencv-python numpy matplotlib scipy \
            "jax[cuda12]" cupy-cuda12x nvidia-cupynumeric

Official alternatives: the course Docker image, or apptainer on the MUST cluster (Install_satellite_sites.md for satellite sites like CINERI). AMD: CuPy and JAX have experimental ROCm routes — cuPyNumeric does not. On a small local card (GTX 1650, 4 GB), shrink the grid: the data-residency lesson stays identical.

On video — the official replay

Replay — Python On GPU (Gray Scott Thursdays)

Sources & official material

The course repository (GPU/tutorial/ tutorials, solutions, A100 benchmarks): gitlab.in2p3.fr/alice.faure/gray-scott-python
The libraries: docs.cupy.dev · docs.nvidia.com/cupynumeric · docs.jax.dev
The MUST platform: jupyter.must-dc.cloud
Video replays (YouTube): Gray Scott Thursdays
School website: GrayScott2026

Edit this pageorReport an issue

Day 6 — SIMD with EVE + GPU architecture

June 29, two sessions: Joël Falcou opens the week with EVE and Kiwaku (explicit, portable C++20 SIMD), Pierre Aubert follows with the GPU architecture that carries the last three days.

Day 8 — Fortran on GPU

Standard Fortran on the GPU via do concurrent, compared with OpenACC and OpenMP target from a single source — plus the closing polyglot session.