The Gray Scott School

Day 8 — Fortran on GPU

Standard Fortran on the GPU via do concurrent, compared with OpenACC and OpenMP target from a single source — plus the closing polyglot session.

July 1, 2026 · Morning: Vincent Lafage (IJCLab) — Fortran 2018 on GPU · 5 pm: Pierre Aubert (LAPP) — Julia … in Rust … and C++ … with pixi ? · Marcel Vivargent Auditorium + satellites (including CINERI). The hands-on reuses Day 3's repo (day-3-a/), the GPU/ side this time — every measurement below is reproduced locally (nvfortran + GTX 1650).

Morning session — standard Fortran on the GPU

1. do concurrent: from CPU to GPU without changing a line

The Day 3 bridge is crossed. The nested do loops of the Laplacian become a do concurrent — an ISO standard construct that asserts the iterations are independent, so the compiler may run them in any order.

do concurrent (i = 2:nx-1, j = 2:ny-1)
   lap_u = sum(stencil * U0(i-1:i+1, j-1:j+1))
   U1(i,j) = U0(i,j) + dt * (Diffusivity_u*lap_u - U0(i,j)*V0(i,j)**2 &
                             + Feed_Rate*(1.0_pr - U0(i,j)))
end do

No directive, no API — plain Fortran. With -stdpar=gpu, nvfortran generates a kernel, allocates the arrays in unified memory and moves the data itself.

2. Three offload back-ends, one source

VariantMechanismKey flagTarget
stdpardo concurrent-stdpar=gpuGPU
stdpardo concurrent-stdpar=multicoreCPU cores
openacc!$acc directives-accGPU
openmp_offload!$omp target-mp=gpuGPU

-gpu=ccnative targets the compute capability of the GPU present; -Minfo=accel makes the compiler report which loops it offloaded — the first diagnostic reflex.

one single Fortran sourcedo concurrent — ISO, zero directivesnvfortran — the flag picks the target-stdpar=gpu1,12 s-acc1,46 s-mp=gpu1,55 s-stdpar=multicore53,85 s×48measured on GTX 1650 · 1024² · 4000 steps
The same do concurrent, four flags: the ISO standard matches the directives on GPU, and the GPU crushes the multicore CPU (×48)

3. The HDF5 pitfall

Fortran module files (.mod) are not portable across compilers: an HDF5 built with gfortran is unreadable by nvfortran. Either rebuild HDF5 with nvfortran, or disable output (do_write = .false.) to time the pure GPU kernel.

4. Measure, then compare

Measured on a GeForce GTX 1650 (1024×1024 grid, 4000 steps, HDF5 off):

VariantFlagTimeSpeedup
do concurrent → GPU-stdpar=gpu1.12 s48×
OpenACC → GPU-acc1.46 s37×
OpenMP target → GPU-mp=gpu1.55 s35×
do concurrent → CPU (≈ 7 cores)-stdpar=multicore53.85 sbaseline

The GPU crushes the multicore CPU (~48×), and the three offload back-ends sit within two tenths of a second — standard do concurrent matches the directives while staying plain Fortran.

Gray-Scott: GPU vs CPU (GTX 1650, 1024×1024, 4000 steps)
do concurrent → GPU
1,12 s
OpenACC → GPU
1,46 s
OpenMP target → GPU
1,55 s
do concurrent → CPU (~7c)
53,85 s

The CPU bar dwarfs the three GPU bars — offload cuts the time by ~48×.

Closing session (5 pm) — Julia · Rust · C++, united by pixi

The day ends with an open-ended session: can Julia, Rust and C++ coexist for the same computation, in a single reproducible toolchain? No single language owns HPC — Julia for high-level prototyping, Rust for safe systems performance, C++ the incumbent. pixi is the answer: it pins compilers, CUDA and the Julia/Rust runtimes in one lockfile, so the three worlds run side by side. The lesson outlives Gray-Scott: it is the method that transfers, not the syntax.

On video — the official replays

Replay — Fortran On GPU (Gray Scott Thursdays)
Replay — Introduction to Rust (Gray Scott Thursdays), echoing the polyglot session

Sources & official material

Copyright © 2026