The Gray Scott School

Day 8 — Fortran on GPU

Standard Fortran on the GPU via do concurrent, compared with OpenACC and OpenMP target from a single source — plus the closing polyglot session.

July 1, 2026 · Morning: Vincent Lafage (IJCLab) — Fortran 2018 on GPU · 5 pm: Pierre Aubert (LAPP) — Julia … in Rust … and C++ … with pixi ? · Marcel Vivargent Auditorium + satellites (including CINERI). The hands-on reuses Day 3's repo (day-3-a/), the GPU/ side this time — every measurement below is reproduced locally (nvfortran + GTX 1650).

Morning session — standard Fortran on the GPU

1. `do concurrent`: from CPU to GPU without changing a line

The Day 3 bridge is crossed. The nested do loops of the Laplacian become a do concurrent — an ISO standard construct that asserts the iterations are independent, so the compiler may run them in any order.

do concurrent (i = 2:nx-1, j = 2:ny-1)
   lap_u = sum(stencil * U0(i-1:i+1, j-1:j+1))
   U1(i,j) = U0(i,j) + dt * (Diffusivity_u*lap_u - U0(i,j)*V0(i,j)**2 &
                             + Feed_Rate*(1.0_pr - U0(i,j)))
end do

No directive, no API — plain Fortran. With -stdpar=gpu, nvfortran generates a kernel, allocates the arrays in unified memory and moves the data itself.

2. Three offload back-ends, one source

Variant	Mechanism	Key flag	Target
`stdpar`	`do concurrent`	`-stdpar=gpu`	GPU
`stdpar`	`do concurrent`	`-stdpar=multicore`	CPU cores
`openacc`	`!$acc` directives	`-acc`	GPU
`openmp_offload`	`!$omp target`	`-mp=gpu`	GPU

-gpu=ccnative targets the compute capability of the GPU present; -Minfo=accel makes the compiler report which loops it offloaded — the first diagnostic reflex.

The same do concurrent, four flags: the ISO standard matches the directives on GPU, and the GPU crushes the multicore CPU (×48)

3. The HDF5 pitfall

Fortran module files (.mod) are not portable across compilers: an HDF5 built with gfortran is unreadable by nvfortran. Either rebuild HDF5 with nvfortran, or disable output (do_write = .false.) to time the pure GPU kernel.

4. Measure, then compare

Measured on a GeForce GTX 1650 (1024×1024 grid, 4000 steps, HDF5 off):

Variant	Flag	Time	Speedup
`do concurrent` → GPU	`-stdpar=gpu`	1.12 s	48×
OpenACC → GPU	`-acc`	1.46 s	37×
OpenMP target → GPU	`-mp=gpu`	1.55 s	35×
`do concurrent` → CPU (≈ 7 cores)	`-stdpar=multicore`	53.85 s	baseline

The GPU crushes the multicore CPU (~48×), and the three offload back-ends sit within two tenths of a second — standard do concurrent matches the directives while staying plain Fortran.

Gray-Scott: GPU vs CPU (GTX 1650, 1024×1024, 4000 steps)

do concurrent → GPU

1,12 s

OpenACC → GPU

1,46 s

OpenMP target → GPU

1,55 s

do concurrent → CPU (~7c)

53,85 s

The CPU bar dwarfs the three GPU bars — offload cuts the time by ~48×.

Closing session (5 pm) — Julia · Rust · C++, united by pixi

The day ends with an open-ended session: can Julia, Rust and C++ coexist for the same computation, in a single reproducible toolchain? No single language owns HPC — Julia for high-level prototyping, Rust for safe systems performance, C++ the incumbent. pixi is the answer: it pins compilers, CUDA and the Julia/Rust runtimes in one lockfile, so the three worlds run side by side. The lesson outlives Gray-Scott: it is the method that transfers, not the syntax.

On video — the official replays

Replay — Fortran On GPU (Gray Scott Thursdays)

Replay — Introduction to Rust (Gray Scott Thursdays), echoing the polyglot session

Sources & official material

The day's slides (PDF, school GitLab wiki): FortranFuriousGPU — IJC dual GS 2026 · Julia in Rust and C++ with pixi
The course repository: gitlab.in2p3.fr/lafage/GrayScottFortranTuto
The compiler: NVIDIA HPC SDK (nvfortran)
Video replays (YouTube): Gray Scott Thursdays
School website: GrayScott2026

Edit this pageorReport an issue

Day 7 — Python on GPU

June 30, four sessions with Alice Faure, Jean-Marc Colley, Sébastien Valat and Nabil Garroum: CuPy, cuPyNumeric and JAX port Day 5's Gray-Scott to the accelerator — official A100 numbers included.

Day 9 — Kokkos on GPU

The portable kernel reaches the accelerator. Portability is not free — layout, host↔device transfers and synchronization — then the school closes.