Projects

SenLand

Mapping Senegal's land cover with deep learning — the same code on CPU and GPU, compared honestly. A learner project built with the Gray Scott School techniques.

SenLand — Senegal's land cover with deep learning

A real AI project, actually run — entirely on a single laptop, with the techniques taught at the Gray Scott School (CINERI). The same neural network maps land cover — water, cropland, forest, built-up, mangroves — from open satellite imagery. The engineering question that shapes the whole project: run this code first on the CPU, then on the GPU of the same machine, and compare the two compute engines honestly — their architecture, how you exploit them, their limits, and the measured results.

SenLand in motion — from satellite imagery to a land-cover map.

The problem

Knowing where cropland, water, forests and cities are — and how they change year to year — is a national question: agriculture, water resources, urbanization, climate. Manual mapping does not scale to a country. A neural network learns to read satellite imagery and produces the map automatically, everywhere, at the same resolution.

The pipeline, end to end

The SenLand pipeline, end to end

Four stages: read the open data (imagery + labels), tile it into small patches, a U-Net segments each patch, and the land-cover map is recomposed.

The data (100% open)

Sentinel-2 imagery (cloudless, 10 m) pixel-aligned with ESA WorldCover labels (10 m). Four deliberately different landscapes, read live from public servers — no private data, no bulk download.

Lac de Guiers — lake + irrigated river valley: water, cropland, wetlands

Dakar — peninsula: dense built-up, ocean, bare soil — Lac de Guiers — lake + irrigated river valley: water, cropland, wetlands

Results — the maps

For each area: the Sentinel-2 image, the ground truth (WorldCover) and the model's prediction, side by side. The lake, the ocean and the large agricultural structures are faithfully reconstructed.

Lac de Guiers — Sentinel-2, ground truth, prediction

Sine-Saloum — Sentinel-2, ground truth, prediction

Casamance — Sentinel-2, ground truth, prediction

Dakar — Sentinel-2, ground truth, prediction

Metrics

Segmentation is evaluated on a spatial holdout (a validation area kept apart, for an honest measure of generalization).

Task	Metric	Value
Classification (EuroSAT, 10 classes, from scratch)	validation accuracy	91.8%
Segmentation (4 areas, spatial holdout)	mean IoU	0.62

Adding the Saloum delta to training lifts mangroves from 3% to 83% IoU — the demonstration that the right data beats a bigger model.

Per-class IoU — water and built-up lead, mangroves follow — Segmentation learning — mean IoU & loss

The architecture — one code, two engines

The model is a U-Net (ResNet-34 encoder) for segmentation, preceded by a ResNet-18 for the classification warm-up, all in PyTorch. The core of the project reuses the Kokkos idea from Day 4: one source, two backends. It is exactly the same code that runs on the CPU or the GPU — a hardware introspection layer (hw.py) picks the device and declares it honestly in every figure, never a CPU run labeled "GPU".

One code, two engines: a few big cores (CPU) vs a swarm of cores (GPU)

On CPU, parallelism goes through OpenMP intra-op threads (Day 1 / TBB); on GPU, through the massive SIMT parallelism of CUDA cores (Day 2 / 3). No line of science changes: only the iteration throughput does.

On CPU

No GPU required: the same code uses all cores via OpenMP threads. Strong scaling measured on the laptop (Intel i5-10300H, 1 → 8 threads). Efficiency drops beyond 4 threads: the machine has only 4 physical cores, HyperThreading doesn't help dense compute.

Multicore CPU scaling — throughput & efficiency, 1 → 8 threads

Pros — available everywhere (no specialized hardware); plenty of RAM, ideal for geo I/O (/vsicurl reads of Sentinel-2 + WorldCover); simple, reproducible debugging; real strong scaling (×2.2 from 1 to 4 cores, 54 → 117 img/s).
Cons — caps at 4 physical cores (HyperThreading adds nothing: 117 → 109 img/s); ≈ 4× slower than the same machine's modest GPU; an epoch in ~25 s vs ~5 s on GPU.

Same weights, same model quality — only slower. The CPU produces exactly the same science, at a lower throughput.

On GPU

The GPU applies massive parallelism (SIMT): thousands of CUDA cores process the batch at once. On the laptop's GTX 1650, throughput reaches 466 img/s, iteration becomes fluid — which enabled the long training runs (120 segmentation epochs).

Pros — ≈ ×4 faster than the best CPU, ×8.6 vs a single core; fluid iteration (5.5 s/epoch); architecture built for dense convolutions — SIMT fits the model.
Cons — limited VRAM (4 GB), constraining batch and tile size; mixed precision (AMP fp16) diverges to NaN on this consumer card — disabled locally (an honest limit); host→device transfer overhead.

Strictly the same model and final accuracy as on CPU — the GPU doesn't change the result, it delivers it ~4× faster. The gain is entirely engineer time recovered.

Head to head — measured throughput

Same problem, same code, three engines of the same machine. Everything is measured here, on the laptop — nothing is projected.

CPU 1 core · multicore CPU · GPU — measured throughput (log scale)

The JAX port — comparing frameworks, not just engines

A direct extension of the school's Days 5 and 7 (Python on CPU, then on GPU): the same segmentation task is reimplemented in JAX/Flax (src/senland_jax/) to compare frameworks on top of engines — same Senegal areas, same IoU yardstick. The model is a from-scratch U-Net (GroupNorm, 7.8 M parameters, no ImageNet pretraining); the training step is a pure function compiled with jax.jit and differentiated with jax.value_and_grad.

Framework	Engine	mIoU	Throughput
PyTorch — U-Net ResNet-34, ImageNet	GPU GTX 1650	0.62	470 patch/s
JAX — U-Net GroupNorm, from scratch	GPU GTX 1650	0.57	211 patch/s
JAX — same model	CPU i5-10300H	—	13 patch/s

Reading these numbers honestly:

mIoU 0.57 vs 0.62 — the gap is the ImageNet-pretrained encoder on the PyTorch side; the JAX model trains from scratch. Per-class IoU still tracks closely (permanent water 0.90 vs 0.91, mangroves 0.77 vs 0.78).
The "one code, two engines" story holds in JAX too: the same jitted code runs ≈ ×16 faster on GPU than on CPU — here via device placement (JAX_PLATFORMS) rather than hw.py.

JAX — the same jitted code, two engines (patch/s)

GPU · GTX 1650

211 patch/s

CPU · i5-10300H

13 patch/s

Same JAX code, device placement via JAX_PLATFORMS — ≈ ×16 on GPU.

The fair benchmark — identical model in both frameworks

Comparing a pretrained ResNet-34 to a small hand-written U-Net is not a fair race. scripts/bench_unet.py therefore builds the strictly identical U-Net (GroupNorm, 11 classes) in both frameworks and times the full training step (forward + CE+Dice + backward

AdamW) on a device-resident batch — isolating the framework/compiler from the architecture. GTX 1650, fp32:

Mode	batch 8	batch 16
JAX — naive (per-step `jit`)	246	—
PyTorch — eager	334	—
PyTorch — `torch.compile`	356	460
JAX — fused multi-step (`lax.fori_loop`)	371	463

(patches/s; higher is better)

Identical model, batch 8 — the framework alone (patch/s)

JAX naive (per-step jit)

246 patch/s

PyTorch eager

334 patch/s

torch.compile

356 patch/s

JAX fused (fori_loop)

371 patch/s

Identical GroupNorm U-Net in both frameworks, full training step, GTX 1650 fp32.

Takeaways:

Naive JAX (one dispatch per step) is the slowest. Once its real levers are engaged — jit, donate_argnums (buffer reuse), device-resident inputs, and above all step fusion (lax.fori_loop: 20 steps for the cost of one dispatch) — JAX edges past torch.compile at batch 8 (+4%) and ties it at batch 16.
At an identical model, fully-tuned JAX ≈ fully-tuned PyTorch on this card. The earlier ~2× gap was the architecture (smp ResNet-34 vs a hand-written U-Net), not the framework. XLA's advantage would widen on TPUs, larger batches/models, or more fusable graphs — none of which a 4 GB consumer GPU exercises.

Honest reproduction pitfall: jax[cuda12] and PyTorch pin different nvidia-cudnn-cu12 versions — in one shared venv only one of the two has working GPU at a time. Use two separate environments.

Built with the Gray Scott School techniques

Every engineering brick of SenLand reuses a technique from the Gray Scott School (CINERI) and applies it to deep learning. The through-line — one code, CPU then GPU — is the very idea of Kokkos.

Brick	School day
One code, two backends	Day 4 · Kokkos → here `hw.py` (PyTorch) and `JAX_PLATFORMS` (JAX)
Frameworks & compilers	Days 5 & 7 · Python/JAX — `jit`, XLA, step fusion (`fori_loop`)
Multicore CPU	Day 1 · parallelism + Day 2 · TBB — shared memory, strong scaling
GPU SIMT / CUDA	Day 2 · GPU + Day 3 · CUDA — massive parallelism
Benchmark & timing	Day 2 · fixed workload, img/s throughput
Vectorization / SIMD	Day 1 + Day 6 · EVE — mixed-precision (AMP) analogue
Floating-point precision	Day 3 · fp32/fp16 — explains the AMP NaN on GTX 1650
I/O & data	Day 2 · HDF5 → here geo I/O Sentinel-2 / WorldCover
Containers & repro	pixi / Apptainer — fixed seeds, versioned experiments

SenLand is an open, reproducible project: every number on this page is backed by a figure committed to the repository. The code runs without a GPU; with one, everything simply goes ~4× faster.

Edit this pageorReport an issue

Projects

Projects built by Gray Scott School learners, applying the HPC techniques taught at CINERI to real problems.

About

CINERI, the TAOUEY supercomputer and the Gray Scott School — high-performance computing in the service of science, in Senegal.