Projects
Mapping Senegal's land cover with deep learning — the same code on CPU and GPU, compared honestly. A learner project built with the Gray Scott School techniques.

SenLand — Senegal's land cover with deep learning

A real AI project, actually run — entirely on a single laptop, with the techniques taught at the Gray Scott School (CINERI). The same neural network maps land cover — water, cropland, forest, built-up, mangroves — from open satellite imagery. The engineering question that shapes the whole project: run this code first on the CPU, then on the GPU of the same machine, and compare the two compute engines honestly — their architecture, how you exploit them, their limits, and the measured results.

SenLand in motion — from satellite imagery to a land-cover map.

The problem

Knowing where cropland, water, forests and cities are — and how they change year to year — is a national question: agriculture, water resources, urbanization, climate. Manual mapping does not scale to a country. A neural network learns to read satellite imagery and produces the map automatically, everywhere, at the same resolution.

The pipeline, end to end

Open dataimagerySentinel-2+ labels · WorldCoverTiling64 × 64 pxSegmentationU-Net · ResNet-34Land-cover map
The SenLand pipeline, end to end

Four stages: read the open data (imagery + labels), tile it into small patches, a U-Net segments each patch, and the land-cover map is recomposed.

The data (100% open)

Sentinel-2 imagery (cloudless, 10 m) pixel-aligned with ESA WorldCover labels (10 m). Four deliberately different landscapes, read live from public servers — no private data, no bulk download.

Lac de Guiers — lake + irrigated river valley: water, cropland, wetlands
Lac de Guiers — lake + irrigated river valley: water, cropland, wetlands
Dakar — peninsula: dense built-up, ocean, bare soil
Dakar — peninsula: dense built-up, ocean, bare soil
Casamance — forest, mangroves, rivers, cropland
Casamance — forest, mangroves, rivers, cropland
Sine-Saloum — the delta and its great mangrove belt
Sine-Saloum — the delta and its great mangrove belt

Results — the maps

For each area: the Sentinel-2 image, the ground truth (WorldCover) and the model's prediction, side by side. The lake, the ocean and the large agricultural structures are faithfully reconstructed.

Metrics

Segmentation is evaluated on a spatial holdout (a validation area kept apart, for an honest measure of generalization).

TaskMetricValue
Classification (EuroSAT, 10 classes, from scratch)validation accuracy91.8%
Segmentation (4 areas, spatial holdout)mean IoU0.62

Adding the Saloum delta to training lifts mangroves from 3% to 83% IoU — the demonstration that the right data beats a bigger model.

Segmentation learning — mean IoU & loss
Segmentation learning — mean IoU & loss
Per-class IoU — water and built-up lead, mangroves follow
Per-class IoU — water and built-up lead, mangroves follow
Classification warm-up (EuroSAT) — accuracy & loss
Classification warm-up (EuroSAT) — accuracy & loss
Confusion matrix (classification)
Confusion matrix (classification)

The architecture — one code, two engines

The model is a U-Net (ResNet-34 encoder) for segmentation, preceded by a ResNet-18 for the classification warm-up, all in PyTorch. The core of the project reuses the Kokkos idea from Day 4: one source, two backends. It is exactly the same code that runs on the CPU or the GPU — a hardware introspection layer (hw.py) picks the device and declares it honestly in every figure, never a CPU run labeled "GPU".

same PyTorch codeU-Net model · training loophw.py · picks the enginea batch of patchesCPUGPU4 physical cores · OpenMP threadscorecorecorecoreshared memory (RAM)54 → 117img/sstrong scaling 1 → 4 coresCUDA cores · SIMTthe whole batch at once466img/s≈ ×4 vs CPU
One code, two engines: a few big cores (CPU) vs a swarm of cores (GPU)

On CPU, parallelism goes through OpenMP intra-op threads (Day 1 / TBB); on GPU, through the massive SIMT parallelism of CUDA cores (Day 2 / 3). No line of science changes: only the iteration throughput does.

On CPU

No GPU required: the same code uses all cores via OpenMP threads. Strong scaling measured on the laptop (Intel i5-10300H, 1 → 8 threads). Efficiency drops beyond 4 threads: the machine has only 4 physical cores, HyperThreading doesn't help dense compute.

  • Pros — available everywhere (no specialized hardware); plenty of RAM, ideal for geo I/O (/vsicurl reads of Sentinel-2 + WorldCover); simple, reproducible debugging; real strong scaling (×2.2 from 1 to 4 cores, 54 → 117 img/s).
  • Cons — caps at 4 physical cores (HyperThreading adds nothing: 117 → 109 img/s); ≈ 4× slower than the same machine's modest GPU; an epoch in ~25 s vs ~5 s on GPU.

Same weights, same model quality — only slower. The CPU produces exactly the same science, at a lower throughput.

On GPU

The GPU applies massive parallelism (SIMT): thousands of CUDA cores process the batch at once. On the laptop's GTX 1650, throughput reaches 466 img/s, iteration becomes fluid — which enabled the long training runs (120 segmentation epochs).

  • Pros — ≈ ×4 faster than the best CPU, ×8.6 vs a single core; fluid iteration (5.5 s/epoch); architecture built for dense convolutions — SIMT fits the model.
  • Cons — limited VRAM (4 GB), constraining batch and tile size; mixed precision (AMP fp16) diverges to NaN on this consumer card — disabled locally (an honest limit); host→device transfer overhead.

Strictly the same model and final accuracy as on CPU — the GPU doesn't change the result, it delivers it ~4× faster. The gain is entirely engineer time recovered.

Head to head — measured throughput

Same problem, same code, three engines of the same machine. Everything is measured here, on the laptop — nothing is projected.

The JAX port — comparing frameworks, not just engines

A direct extension of the school's Days 5 and 7 (Python on CPU, then on GPU): the same segmentation task is reimplemented in JAX/Flax (src/senland_jax/) to compare frameworks on top of engines — same Senegal areas, same IoU yardstick. The model is a from-scratch U-Net (GroupNorm, 7.8 M parameters, no ImageNet pretraining); the training step is a pure function compiled with jax.jit and differentiated with jax.value_and_grad.

FrameworkEnginemIoUThroughput
PyTorch — U-Net ResNet-34, ImageNetGPU GTX 16500.62470 patch/s
JAX — U-Net GroupNorm, from scratchGPU GTX 16500.57211 patch/s
JAX — same modelCPU i5-10300H13 patch/s

Reading these numbers honestly:

  • mIoU 0.57 vs 0.62 — the gap is the ImageNet-pretrained encoder on the PyTorch side; the JAX model trains from scratch. Per-class IoU still tracks closely (permanent water 0.90 vs 0.91, mangroves 0.77 vs 0.78).
  • The "one code, two engines" story holds in JAX too: the same jitted code runs ≈ ×16 faster on GPU than on CPU — here via device placement (JAX_PLATFORMS) rather than hw.py.
JAX — the same jitted code, two engines (patch/s)
GPU · GTX 1650
211 patch/s
CPU · i5-10300H
13 patch/s

Same JAX code, device placement via JAX_PLATFORMS — ≈ ×16 on GPU.

The fair benchmark — identical model in both frameworks

Comparing a pretrained ResNet-34 to a small hand-written U-Net is not a fair race. scripts/bench_unet.py therefore builds the strictly identical U-Net (GroupNorm, 11 classes) in both frameworks and times the full training step (forward + CE+Dice + backward

  • AdamW) on a device-resident batch — isolating the framework/compiler from the architecture. GTX 1650, fp32:
Modebatch 8batch 16
JAX — naive (per-step jit)246
PyTorch — eager334
PyTorch — torch.compile356460
JAX — fused multi-step (lax.fori_loop)371463

(patches/s; higher is better)

Identical model, batch 8 — the framework alone (patch/s)
JAX naive (per-step jit)
246 patch/s
PyTorch eager
334 patch/s
torch.compile
356 patch/s
JAX fused (fori_loop)
371 patch/s

Identical GroupNorm U-Net in both frameworks, full training step, GTX 1650 fp32.

Takeaways:

  • Naive JAX (one dispatch per step) is the slowest. Once its real levers are engaged — jit, donate_argnums (buffer reuse), device-resident inputs, and above all step fusion (lax.fori_loop: 20 steps for the cost of one dispatch) — JAX edges past torch.compile at batch 8 (+4%) and ties it at batch 16.
  • At an identical model, fully-tuned JAX ≈ fully-tuned PyTorch on this card. The earlier ~2× gap was the architecture (smp ResNet-34 vs a hand-written U-Net), not the framework. XLA's advantage would widen on TPUs, larger batches/models, or more fusable graphs — none of which a 4 GB consumer GPU exercises.

Honest reproduction pitfall: jax[cuda12] and PyTorch pin different nvidia-cudnn-cu12 versions — in one shared venv only one of the two has working GPU at a time. Use two separate environments.

Built with the Gray Scott School techniques

Every engineering brick of SenLand reuses a technique from the Gray Scott School (CINERI) and applies it to deep learning. The through-line — one code, CPU then GPU — is the very idea of Kokkos.

BrickSchool day
One code, two backendsDay 4 · Kokkos → here hw.py (PyTorch) and JAX_PLATFORMS (JAX)
Frameworks & compilersDays 5 & 7 · Python/JAX — jit, XLA, step fusion (fori_loop)
Multicore CPUDay 1 · parallelism + Day 2 · TBB — shared memory, strong scaling
GPU SIMT / CUDADay 2 · GPU + Day 3 · CUDA — massive parallelism
Benchmark & timingDay 2 · fixed workload, img/s throughput
Vectorization / SIMDDay 1 + Day 6 · EVE — mixed-precision (AMP) analogue
Floating-point precisionDay 3 · fp32/fp16 — explains the AMP NaN on GTX 1650
I/O & dataDay 2 · HDF5 → here geo I/O Sentinel-2 / WorldCover
Containers & repropixi / Apptainer — fixed seeds, versioned experiments

SenLand is an open, reproducible project: every number on this page is backed by a figure committed to the repository. The code runs without a GPU; with one, everything simply goes ~4× faster.

Copyright © 2026