Mathematical Foundations¶
This page explains the model that spectral-library implements.
It answers four questions:
- how sensor bands are simulated from hyperspectral library rows
- how nearest neighbors are retrieved segment by segment
- how target-sensor reflectance is produced from the retrieved spectra
- how the regression benchmark is defined
Use this page together with:
- Getting Started for the workflow
- Python API Reference for the public objects
- Prepared Runtime Contract for the runtime standard
Model Summary¶
- Forward model
Every sensor band is simulated by integrating hyperspectral reflectance against a normalized resampled SRF.
- Retrieval model
Each segment uses nearest-neighbor search in source-sensor feature space, then averages the matched hyperspectral rows.
- Output model
Retrieved hyperspectral segments are either convolved to a target sensor or
blended into a full 400-2500 nm spectrum.
Problem Setup¶
The package starts from a row-aligned hyperspectral library sampled on the canonical integer grid:
Let:
- \(N\) be the number of library rows
- \(L = 2101\) be the number of wavelengths on the canonical grid
- \(H \in \mathbb{R}^{N \times L}\) be the hyperspectral reflectance matrix
- \(H_i \in \mathbb{R}^{L}\) be the \(i\)th library spectrum
- \(q\) be one observed source-sensor reflectance vector
The current runtime splits the spectrum into two overlapping segments:
VNIR = 400-1000 nmSWIR = 800-2500 nm
The overlap is intentional. It preserves continuity near the detector
transition and lets the system reconstruct a stable full 400-2500 nm
spectrum.
For retrieval, the SWIR query can also reuse the semantic nir source band
when that band exists in the sensor schema. This gives the SWIR neighbor search
one bridge feature near the VNIR-SWIR transition instead of relying only on
swir1 and swir2.
Sensor Forward Model¶
For each sensor band, the package uses the resampled relative spectral response (SRF) as a discrete weighting function on the canonical wavelength grid.
For sensor \(s\), band \(b\), and library spectrum \(H_i\), the simulated band reflectance is:
where:
- \(r_{s,b,\ell}\) is the resampled SRF weight for wavelength index \(\ell\)
- the denominator normalizes the band response so the output stays in reflectance units
This is the same forward model used in:
build_mapping_library(...)when precomputing source-sensor matrices- target-sensor simulation during mapping
- the regression baseline inside
benchmark_mapping(...)
Prepared Runtime Matrices¶
The prepared runtime stores three aligned data products:
- the segment hyperspectral arrays
- the source-sensor simulation matrices
- the row metadata table that keeps every array on the same row index
For each source sensor \(s\) and segment \(p \in \{\text{vnir}, \text{swir}\}\), the runtime contains a precomputed matrix:
where \(B_{s,p}\) is the number of source bands assigned to segment \(p\).
This is the reason mapping stays fast at runtime: the package does not re-convolve the entire hyperspectral library for every query.
Segment-Wise Retrieval¶
Given one source query \(q\), the package separates the input into segment-level queries:
- \(q^{(\text{vnir})}\)
- \(q^{(\text{swir})}\)
Each segment is retrieved independently. Missing VNIR support does not distort the SWIR neighborhood, and vice versa.
This remains true even when the source query provides a complete visible-to-SWIR band vector. The runtime still performs two independent nearest-neighbor searches, one for the VNIR query and one for the NIR-SWIR query, and only combines the reconstructed segments after retrieval.
The SWIR retrieval query is slightly augmented: if the source sensor defines a
semantic nir band, that band is appended to the SWIR query feature set so the
retrieval sees one overlap feature across the detector boundary.
If \(B_p^{\text{valid}}\) is the valid source-band set for segment \(p\), the implementation uses a root-mean-square Euclidean distance:
Important implementation details:
- only valid source bands are scored
- the SWIR feature set includes
nirwhen available, then the SWIR bands - all valid bands currently use uniform weight
- top-\(k\) neighbors are ordered deterministically by distance and row index
The package also exposes multiple search backends for this same metric:
numpy: direct dense distance evaluationscipy_ckdtree: SciPycKDTreesearch, withknn_eps > 0allowing an approximate shortlistfaiss: FAISS HNSW searchpynndescent: PyNNDescent approximate searchscann: ScaNN approximate search
The backend changes how the shortlist is found, not the metric itself. The returned neighbors are still re-ordered by exact RMS distance before the estimator runs.
For faiss, pynndescent, and scann, the prepare step can also persist
full-feature ANN indexes for each prepared source sensor and segment. Those
saved indexes are only used when a query uses the full segment feature set and
the full candidate pool. Queries with missing bands or per-query exclusions
fall back to on-the-fly search in the reduced feature space.
Pre-sorting the full library rows by global spectral similarity usually does not improve this retrieval. The runtime searches two different segment spaces, and each query may activate a different valid-band subset, so there is no single fixed row order that makes the real KNN computation cheap. Indexing or tree-based search is the useful acceleration path instead.
If a segment has fewer than min_valid_bands valid inputs, that segment is
reported as unavailable instead of forcing a low-information retrieval.
Neighbor Estimator¶
Let \(\mathcal{N}_k^{(p)}\) be the top-\(k\) nearest neighbors for segment \(p\). The default retrieved hyperspectral segment is the unweighted neighbor mean:
This is intentionally simple in v1:
- it is stable
- it has no extra tuning parameter
- it exposes the retrieval behavior directly in diagnostics
- it is query-centric, so neighbors are not required to be mutually similar or from the same landcover class
An optional estimator mode is also available:
distance_weighted_mean
For non-zero distances, the implementation uses inverse-distance weights
\(w_i = 1 / d_i\). If one or more neighbors have distance 0, only those exact
matches are averaged.
simplex_mixture
This estimator keeps the same top-\(k\) shortlist, but then solves a constrained local mixture fit in source-band space:
subject to:
So the reconstructed segment remains a convex combination of the retrieved library spectra. In practice, this makes the shortlist more interpretable: the diagnostics now expose both the KNN ranking and the fitted estimator weights.
The returned diagnostics carry the chosen neighbor_estimator, plus:
- segment query band ids
- query band values and validity masks
- neighbor identities and neighbor distances
- neighbor weights and source-fit RMSE
- heuristic confidence scores and their components
- the selected neighbors' simulated source-band values for the same segment
The current confidence score is a bounded heuristic, not a calibrated probability. It combines:
- valid-band coverage
- mean neighbor distance
- source-space fit RMSE after estimator weighting
- estimator weight concentration
Current production interpretation policy:
high/acceptforconfidence_score >= 0.85medium/manual_reviewfor0.60 <= confidence_score < 0.85low/rejectforconfidence_score < 0.60
These thresholds are intentionally conservative. They are useful as QA routing rules, not as a substitute for benchmark validation on the target sensor/library regime.
Target-Sensor Mapping¶
Once a segment-level hyperspectral reconstruction is available, target bands in the same segment are simulated with the same normalized SRF integral:
This produces the target_sensor output mode.
If only one segment is retrievable, target bands assigned to that segment are still emitted. The public batch CLI leaves the unmapped target-band cells blank for the missing segment.
Full-Spectrum Assembly¶
When both segment reconstructions are available, the package assembles the full
400-2500 nm spectrum piecewise:
with a linear overlap weight:
So the blend is fully VNIR at 800 nm, fully SWIR at 1000 nm, and linear in
between.
Regression Baseline¶
The benchmark report compares retrieval against a traditional regression baseline fitted on the prepared-library rows.
For the source design matrix \(X^{(src)}\) and target matrix \(Y^{(tgt)}\), the baseline model is:
where:
- \(\alpha\) is the intercept vector
- \(B\) is the regression coefficient matrix
- \(E\) is the residual matrix
benchmark_mapping(...) fits this linear model on the train split and compares
it against retrieval on a held-out test split from the same prepared runtime.
Error Metrics¶
The benchmark reports three families of metrics.
For predictions \(\hat{y}_{j,b}\) and truth \(y_{j,b}\) over test rows \(j = 1, \ldots, M\):
Root-mean-square error:
Mean absolute error:
Signed bias:
The report also returns the mean of each metric across all target bands.
Why The Method Is Physically Constrained¶
The package is not doing an unconstrained band-to-band interpolation. It uses the hyperspectral library as a prior at every stage:
- source observations retrieve spectrally similar library rows
- the mapped target output is derived from retrieved hyperspectral structure
- target reflectance is generated through the target sensor's own SRFs
That gives the method two useful properties:
- it stays tied to spectra that already exist in the library manifold
- it can reconstruct hyperspectral outputs and target-sensor outputs from the same retrieval step
Practical Interpretation¶
- Larger
ksmooths the reconstruction and reduces neighbor sensitivity. - Smaller
kkeeps sharper library detail but can increase local variance. - Missing source bands only reduce information within their own segment.
- Full-spectrum output is available only when both VNIR and SWIR retrievals succeed.
For worked examples using official MODIS, Sentinel-2A, Landsat 8, and Landsat 9 response functions, see Official Sensor Examples.