Day 8 — Fortran on GPU
July 1, 2026 · Morning: Vincent Lafage (IJCLab) — Fortran 2018 on GPU · 5 pm: Pierre Aubert (LAPP) — Julia … in Rust … and C++ … with pixi ? · Marcel Vivargent Auditorium + satellites (including CINERI). The hands-on reuses Day 3's repo (
day-3-a/), theGPU/side this time — every measurement below is reproduced locally (nvfortran + GTX 1650).
Morning session — standard Fortran on the GPU
1. do concurrent: from CPU to GPU without changing a line
The Day 3 bridge is crossed. The nested do loops of the Laplacian become a do concurrent —
an ISO standard construct that asserts the iterations are independent, so the compiler may run
them in any order.
do concurrent (i = 2:nx-1, j = 2:ny-1)
lap_u = sum(stencil * U0(i-1:i+1, j-1:j+1))
U1(i,j) = U0(i,j) + dt * (Diffusivity_u*lap_u - U0(i,j)*V0(i,j)**2 &
+ Feed_Rate*(1.0_pr - U0(i,j)))
end do
No directive, no API — plain Fortran. With -stdpar=gpu, nvfortran generates a kernel,
allocates the arrays in unified memory and moves the data itself.
2. Three offload back-ends, one source
| Variant | Mechanism | Key flag | Target |
|---|---|---|---|
stdpar | do concurrent | -stdpar=gpu | GPU |
stdpar | do concurrent | -stdpar=multicore | CPU cores |
openacc | !$acc directives | -acc | GPU |
openmp_offload | !$omp target | -mp=gpu | GPU |
-gpu=ccnative targets the compute capability of the GPU present; -Minfo=accel makes the
compiler report which loops it offloaded — the first diagnostic reflex.
3. The HDF5 pitfall
Fortran module files (.mod) are not portable across compilers: an HDF5 built with gfortran
is unreadable by nvfortran. Either rebuild HDF5 with nvfortran, or disable output
(do_write = .false.) to time the pure GPU kernel.
4. Measure, then compare
Measured on a GeForce GTX 1650 (1024×1024 grid, 4000 steps, HDF5 off):
| Variant | Flag | Time | Speedup |
|---|---|---|---|
do concurrent → GPU | -stdpar=gpu | 1.12 s | 48× |
| OpenACC → GPU | -acc | 1.46 s | 37× |
| OpenMP target → GPU | -mp=gpu | 1.55 s | 35× |
do concurrent → CPU (≈ 7 cores) | -stdpar=multicore | 53.85 s | baseline |
The GPU crushes the multicore CPU (~48×), and the three offload back-ends sit within two tenths
of a second — standard do concurrent matches the directives while staying plain Fortran.
The CPU bar dwarfs the three GPU bars — offload cuts the time by ~48×.
Closing session (5 pm) — Julia · Rust · C++, united by pixi
The day ends with an open-ended session: can Julia, Rust and C++ coexist for the same computation, in a single reproducible toolchain? No single language owns HPC — Julia for high-level prototyping, Rust for safe systems performance, C++ the incumbent. pixi is the answer: it pins compilers, CUDA and the Julia/Rust runtimes in one lockfile, so the three worlds run side by side. The lesson outlives Gray-Scott: it is the method that transfers, not the syntax.
On video — the official replays
Sources & official material
- The day's slides (PDF, school GitLab wiki): FortranFuriousGPU — IJC dual GS 2026 · Julia in Rust and C++ with pixi
- The course repository: gitlab.in2p3.fr/lafage/GrayScottFortranTuto
- The compiler: NVIDIA HPC SDK (nvfortran)
- Video replays (YouTube): Gray Scott Thursdays
- School website: GrayScott2026
Day 7 — Python on GPU
June 30, four sessions with Alice Faure, Jean-Marc Colley, Sébastien Valat and Nabil Garroum: CuPy, cuPyNumeric and JAX port Day 5's Gray-Scott to the accelerator — official A100 numbers included.
Day 9 — Kokkos on GPU
The portable kernel reaches the accelerator. Portability is not free — layout, host↔device transfers and synchronization — then the school closes.