The Gray Scott School

Day 7 — Python on GPU

June 30, four sessions with Alice Faure, Jean-Marc Colley, Sébastien Valat and Nabil Garroum: CuPy, cuPyNumeric and JAX port Day 5's Gray-Scott to the accelerator — official A100 numbers included.

June 30, 2026 · Speakers: Alice Faure, Jean-Marc Colley, Sébastien Valat & Nabil Garroum · four Python-GPU sessions in a row (CuPy → cuPyNumeric → JAX at 2 pm → wrap-up) · Marcel Vivargent Auditorium + satellites (including CINERI). The hands-on lives in GrayScott2026/day-5/GPU/ — three tutorials + solutions, and the official A100 benchmarks.

1. What really changes: the PCIe toll booth

Day 5's array-first code transposes almost as-is — what changes is the memory geography. The GPU computes at ~2 TB/s but is fed through a ~32 GB/s pipe:

CPU · host RAM~100 GB/sGPU · HBM~2,000 GB/sPCIe — the toll booth~32 GB/sthe rule: upload arrays ONCE, compute in place, download the result ONCE✗ round-trip every step = all the gain swallowed
The GPU computes fast but is fed through a narrow pipe: data must stay resident on the device for the whole time loop

The whole day applies the same rule: cp.asarray / jax.device_put once at the start, the whole time loop on the device, asnumpy once at the end.

2. Three routes to the same GPU

CuPycp.asarray, same slicescuPyNumericmulti-GPU without MPI (Legate)JAXDay 5 code, re-jittedCUDA — Day 6's architecture
Three saddles for the same horse: whichever Python style you pick, everything lands on the SMs and global memory seen the day before

CuPy session — NumPy on CUDA, no rewrite

Tutorial 3_Python_GPU_Cupy.md: CuPy mirrors the NumPy API on CUDA — swapping numpy for cupy runs the same stencil on the GPU.

import cupy as cp
u = cp.asarray(u_host)      # host → device, ONCE
# … same stencil expressions as NumPy …
u_host = cp.asnumpy(u)      # device → host, only when needed

Hands-on bonus: CuPy is the only version with parallel HDF5 I/O implemented — writing the 1000 images while the GPU computes saves ~4 s (12 s total on A100).

cuPyNumeric session — distributed NumPy

Tutorial 2_Python_GPU_cuPyNumeric.md: cuPyNumeric (NVIDIA, Legate engine) runs NumPy code on several GPUs and several nodes without MPI and without rewriting — the same script, a bigger machine. The price of generality shows in the benchmark: the generic convolve version is the slowest of the field (128 s).

JAX session (2 pm) — Day 5's code, re-jitted

Tutorial 1_Python_GPU_JAX.md: Day 5's JAX Gray-Scott replays unchanged — XLA compiles the traced stencil into a CUDA kernel, and JAX places arrays on the device by default. The hands-on solutions show the three tools that make the difference:

Hands-on solutionWhat it teaches
jax_vmap_solutions.pyvectorize a function over a whole axis (batch)
jax_fori_loop_solutions.pyfuse the time loop inside the compiled graph
jax_scan_solutions.pyaccumulate states without returning to Python between steps

This is exactly the toolbox of SenLand's JAX port (lax.fori_loop to fuse steps, device-resident batches).

The verdict — official A100 numbers

The repo's GPU/Benchmarks.md: 32×1000 iterations, 1920×1080 grid in float32:

cuPyNumeric (convolve)JAX (generic)JAX (3×3)CuPyPyTorch
128 s47 s18 s18 s22 s
Gray-Scott Python on A100 (32×1000 iterations, 1920×1080 float32)
cuPyNumeric
128 s
JAX (generic)
47 s
JAX (3×3)
18 s
CuPy
18 s
PyTorch
22 s

Official numbers from the repo (GPU/Benchmarks.md). CuPy + parallel HDF5 I/O: 12 s.

The loop closes: Day 5's best-CPU 377 s drop to 18 s on A100 — ×21, still in Python. And the ranking echoes the week's lessons: stencil specialization (Day 5) and data residency (today) weigh more than the library choice.

The hands-on — GrayScott2026/day-5/GPU/

# Locally (NVIDIA, CUDA ≥ 12, Python 3.10-3.12)
git clone https://gitlab.in2p3.fr/alice.faure/gray-scott-python.git
python -m venv gpu-env && source gpu-env/bin/activate
pip install h5py opencv-python numpy matplotlib scipy \
            "jax[cuda12]" cupy-cuda12x nvidia-cupynumeric

Official alternatives: the course Docker image, or apptainer on the MUST cluster (Install_satellite_sites.md for satellite sites like CINERI). AMD: CuPy and JAX have experimental ROCm routes — cuPyNumeric does not. On a small local card (GTX 1650, 4 GB), shrink the grid: the data-residency lesson stays identical.

On video — the official replay

Replay — Python On GPU (Gray Scott Thursdays)

Sources & official material

Copyright © 2026