The Hyperspectral Representation That Keeps Improving
Published:
The Hyperspectral Representation That Keeps Improving
Back in 2022, during my postdoc, I presented a small idea at the 18th International Conference on Mineral Processing and Geometallurgy: treat a hyperspectral mineral image like a document. Each pixel or region is a document, the recurring spectral patterns are its vocabulary, and an unsupervised topic model — LDA and its descendants — discovers mineral “topics” with no labelled training data at all. That label-free property is the whole point: it’s what lets the method travel to a new ore body or a new sensor without someone first hand-drawing ground-truth masks. Three years later that idea is a live platform, and it finally has a clean result worth writing down.
The question the original paper only gestured at is the one I think actually matters. Everyone tunes the model — LDA versus ProdLDA versus ETM, how many topics, which priors. Almost nobody asks the upstream question: which representation of the spectrum should the topic model even see? Raw bands? A wavelet transform? A learned dictionary? A UMAP embedding? Each one reshapes what “co-occurrence” means before the model runs a single iteration.
So I built the sweep — nineteen spectral representations (V1 through V20), each scored on a multi-axis evaluation battery across four topic-model backbones (LDA, HDP, ProdLDA, ETM) and extended across topic counts Q = 8, 16, 32, on six labelled scenes with Indian Pines as the headline.
On the topic–label coupling axis (F-7 NMI) it rises 0.520 → 0.534 → 0.563 across Q = 8/16/32.
At Q=8 it actually trails V12; then the ranking inverts and by Q=32 V20 leads V12 by +0.030 (on 5 of 6 scenes). MI weighting front-loads the bands that carry mineral-discriminative signal, so the extra topics have somewhere useful to go instead of fragmenting noise.
Most recipes plateau, which is the boring expected behaviour: past some point, extra topics just split hairs at finer granularity with no real gain. V20 doesn’t — it gets more separable under LDA precisely where others saturate. That’s the genuinely interesting finding: not “topic models work on spectra” (we knew that) but “the representation you feed them decides whether more capacity helps or just fragments noise.” And it isn’t the whole story — V8 (NFINDR endmembers) is the more portable recipe, leading across all four backbones and staying reliable across reseeds, which is what you’d actually reach for when the backbone is uncertain. V20 is the pick when LDA is fixed and you can afford to scale Q.
One honesty note, because it’s the kind of thing that’s easy to bury. An earlier version of this — on the live web app and in my own notes — claimed a cleaner “triple-axis win”, including the F-1 classification axis and a clean F-2 coherence win. An internal audit knocked both down: F-1 is a tie (every recipe lands ~0.86–0.92, and on Indian Pines V2 edges V20), and on F-2 coherence V20 ties V12 rather than beating it. So the claim is now the narrow, true one — V20 is the LDA Q-scaling peak on topic–label coupling — not “wins everything.” A research surface that overclaims is worse than one that’s a little behind, and the correction belongs in the open, not in a footnote.
The part I’m quietly proudest of is the most mundane: the sweep answers in production. A Q-extension API now serves topic counts at Q = 8 / 16 / 32 behind a React/Vite frontend on a FastAPI backend, with a companion manuscript documenting the methodology. The 2022 conference slide is an endpoint you can click. Live at lda-hsi.fasl-work.com.
