--- title: "Selecting representative calibration samples" author: - name: Antoine Stevens and Leonardo Ramirez-Lopez email: ramirez.lopez.leo@gmail.com date: today bibliography: prospectr.bib csl: elsevier-harvard.csl format: html: toc: true toc-depth: 3 number-sections: true toc-location: left code-overflow: wrap smooth-scroll: true html-math-method: mathjax vignette: > %\VignetteIndexEntry{Selecting representative calibration samples} %\VignetteEncoding{UTF-8} %\VignetteEngine{quarto::html} --- :::: {.columns} ::: {.column width="70%"} > *What is the meaning of the adjective representative? Answer: The word has no meaning.* -- [@deming1986out] ::: ::: {.column width="30%"} ::: :::: # Introduction Calibration models are usually developed on a *representative* portion of the data (training set) and validated on the remaining samples (test/validation set). There are several approaches for selecting calibration samples: - random selection (e.g. `base::sample`) - stratified random sampling on percentiles of the response $y$ - selection based on the spectral data The `prospectr` package provides functions for the third approach. The available functions are listed in @tbl-sampling. ```{r} #| label: tbl-sampling #| tbl-cap: "Calibration sampling functions available in the package." #| echo: false #| message: false library(prospectr) sampling_fns <- data.frame( Function = c("`naes`", "`kenStone`", "`duplex`", "`puchwein`", "`shenkWest`", "`honigs`"), Algorithm = c( "k-means sampling [@naes2002]", "Kennard-Stone (CADEX) sampling [@kennard1969]", "DUPLEX sampling [@snee1977]", "Puchwein sampling [@puchwein1988]", "SELECT algorithm [@shenk1991]", "Honigs sampling [@honigs1985]" ) ) knitr::kable(sampling_fns) ``` Here we use the Near-Infrared (NIR) dataset (`NIRsoil`) included in the package [@fernandez2008] to demonstrate how the calibration sampling functions work: ```{r} #| label: load-NIRsoil #| message: false library(prospectr) data(NIRsoil) # NIRsoil is a data.frame with 825 observations and 5 variables: # Nt (Total Nitrogen), Ciso (Carbon), CEC (Cation Exchange Capacity), # train (0/1 indicator for training (1) and validation (0) samples), # spc (spectral matrix) str(NIRsoil) ``` # $k$-means sampling (`naes`) The $k$-means sampling algorithm partitions the data into $n$ clusters and selects one representative sample per cluster. Formally, to select a subset of $n$ samples $X_{tr} = \left\{ {x_{tr}}_j \right\}_{j=1}^{n}$ from a set of $N$ samples $X = \left\{ x_i \right\}_{i=1}^{N}$ (where $N > n$), the algorithm proceeds as follows (@fig-naes): 1. Perform $k$-means clustering of $X$ using $n$ clusters. 2. Extract the $n$ centroids $c$ (prototypes). Alternatively, the selected sample can be the one farthest from the data centre, or a random draw within the cluster — see the `method` argument in `naes`. 3. Compute the distance from each sample to each centroid $c$. 4. For each $c$, allocate to $X_{tr}$ the closest sample in $X$. ```{r} #| label: fig-naes #| fig-cap: "Selection of representative samples by k-means sampling on a #| synthetic grid dataset (left, $n = 10$) and on the NIRsoil dataset in #| PC space (right, $n = 5$)." #| fig-height: 4.5 #| fig-width: 8 #| fig-align: center #| echo: true # Synthetic grid dataset with jitter set.seed(1002) grid_xy <- expand.grid(x1 = 1:30, x2 = 1:30) X_synthetic <- data.frame( x1 = grid_xy$x1 + rnorm(nrow(grid_xy), 0, 0.3), x2 = grid_xy$x2 + rnorm(nrow(grid_xy), 0, 0.3) ) kms_synthetic <- naes(X = X_synthetic, k = 40, iter.max = 100) # NIRsoil dataset in PC space kms <- naes(X = NIRsoil$spc, k = 5, pc = 2, iter.max = 100) par(mfrow = c(1, 2)) plot( X_synthetic, col = rgb(0, 0, 0, 0.3), pch = 19, main = "k-means (synthetic)" ) grid() points(X_synthetic[kms_synthetic$model, ], col = "red", pch = 19) legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("All samples", "Selected"), pch = 19, col = c(rgb(0, 0, 0, 0.3), "red"), bg = rgb(1, 1, 1, 0.8) ) plot( kms$pc, col = rgb(0, 0, 0, 0.3), pch = 19, main = "k-means (NIRsoil)" ) grid() points(kms$pc[kms$model, ], col = "red", pch = 19) legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("All samples", "Selected"), pch = 19, col = c(rgb(0, 0, 0, 0.3), "red"), bg = rgb(1, 1, 1, 0.8) ) par(mfrow = c(1, 1)) ``` # Kennard-Stone sampling (`kenStone`) To select a subset of $n$ samples $X_{tr} = \left\{ {x_{tr}}_j \right\}_{j=1}^{n}$ from a set of $N$ samples $X = \left\{ x_i \right\}_{i=1}^{N}$ (where $N > n$), the Kennard-Stone algorithm [@kennard1969] proceeds as follows (@fig-kenstone): 1. Find the pair of samples ${x_{tr}}_1$ and ${x_{tr}}_2$ in $X$ that are farthest apart, allocate them to $X_{tr}$, and remove them from $X$. 2. Find the sample ${x_{tr}}_3$ in $X$ with the maximum dissimilarity to $X_{tr}$, where dissimilarity is defined as the minimum distance from any sample already in $X_{tr}$. Allocate ${x_{tr}}_3$ to $X_{tr}$ and remove it from $X$. 3. Repeat step 2 a further $n - 3$ times to select ${x_{tr}}_4, \ldots, {x_{tr}}_n$. The algorithm produces a calibration set with uniform coverage of the spectral space. The distance metric can be either Euclidean or Mahalanobis. A known limitation is that the algorithm is prone to selecting outliers [@ramirez2014]; outlier screening before sample selection is therefore recommended. ```{r} #| label: fig-kenstone #| fig-cap: "Kennard-Stone sampling on a synthetic two-variable dataset (left, #| $n = 40$) and on the NIRsoil dataset in Mahalanobis PC space (right, #| $n = 20$)." #| fig-height: 4.5 #| fig-width: 8 #| echo: true # Synthetic dataset: grid with jitter grid_xy <- expand.grid(x1 = 1:30, x2 = 1:30) set.seed(1014) X_synthetic <- data.frame( x1 = grid_xy$x1 + rnorm(nrow(grid_xy), 0, 0.3), x2 = grid_xy$x2 + rnorm(nrow(grid_xy), 0, 0.3) ) ken <- kenStone(X_synthetic, k = 40) # NIRsoil dataset — Mahalanobis distance in PC space ken_mahal <- kenStone(X = NIRsoil$spc, k = 20, metric = "mahal", pc = 2) par(mfrow = c(1, 2)) plot( X_synthetic, col = rgb(0, 0, 0, 0.3), pch = 19, main = "Kennard-Stone (synthetic)" ) grid() points(X_synthetic[ken$model, ], col = "red", pch = 19, cex = 1.2) legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("All samples", "Selected"), pch = 19, col = c(rgb(0, 0, 0, 0.3), "red"), bg = rgb(1, 1, 1, 0.8) ) plot( ken_mahal$pc[, 1], ken_mahal$pc[, 2], col = rgb(0, 0, 0, 0.3), pch = 19, xlab = "PC1", ylab = "PC2", main = "Kennard-Stone (NIRsoil)" ) grid() points( ken_mahal$pc[ken_mahal$model, 1], ken_mahal$pc[ken_mahal$model, 2], pch = 19, col = "red" ) legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("All samples", "Selected"), pch = 19, col = c(rgb(0, 0, 0, 0.3), "red"), bg = rgb(1, 1, 1, 0.8) ) par(mfrow = c(1, 1)) ``` When a subset of samples must be included in the calibration set a priori, the `init` argument allows the algorithm to be initialised with those indices, incorporating them before the iterative selection begins (@fig-kenstone-init). ```{r} #| label: kenstone-init #| echo: true initialization_ind <- c(486, 702, 722, 728) ken_mahal_init <- kenStone( X = NIRsoil$spc, k = 20, metric = "mahal", pc = 2, init = initialization_ind ) ken_mahal_init$model ``` ```{r} #| label: fig-kenstone-init #| fig-cap: "Kennard-Stone sampling with 4 forced initialisation samples (blue) #| and the remaining selected samples (red) in PC space." #| fig-height: 4.5 #| fig-width: 4 #| fig-align: center #| echo: true plot( ken_mahal_init$pc[, 1], ken_mahal_init$pc[, 2], col = rgb(0, 0, 0, 0.3), pch = 19, xlab = "PC1", ylab = "PC2", main = "Kennard-Stone with initialisation" ) grid() points( ken_mahal_init$pc[ken_mahal_init$model, 1], ken_mahal_init$pc[ken_mahal_init$model, 2], pch = 19, col = "red" ) points( ken_mahal_init$pc[initialization_ind, 1], ken_mahal_init$pc[initialization_ind, 2], pch = 19, cex = 1.5, col = "dodgerblue" ) legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("All samples", "Selected", "Initialisation"), pch = 19, col = c(rgb(0, 0, 0, 0.3), "red", "dodgerblue"), bg = rgb(1, 1, 1, 0.8) ) ``` # DUPLEX (`duplex`) The Kennard-Stone algorithm selects *calibration* samples only. When a *validation* set is also required, the DUPLEX algorithm [@snee1977] provides a suitable alternative. It is a modification of Kennard-Stone that assigns points alternately to the calibration and validation sets, ensuring both have similar coverage of the spectral space (@fig-duplex). ```{r} #| label: fig-duplex #| fig-cap: "Selection of 15 calibration and 15 validation samples with the #| DUPLEX algorithm on a synthetic grid dataset (left) and on the NIRsoil #| dataset in Mahalanobis PC space (right)." #| fig-height: 4.5 #| fig-width: 8 #| fig-align: center #| echo: true dup <- duplex(X = X_synthetic, k = 15) dup_nir <- duplex(X = NIRsoil$spc, k = 20, metric = "mahal", pc = 2) par(mfrow = c(1, 2)) plot( X_synthetic, col = rgb(0, 0, 0, 0.3), pch = 19, main = "DUPLEX (synthetic)" ) grid() points(X_synthetic[dup$model, ], col = "red", pch = 19) points(X_synthetic[dup$test, ], col = "dodgerblue", pch = 19) legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("All samples", "Calibration", "Validation"), pch = 19, col = c(rgb(0, 0, 0, 0.3), "red", "dodgerblue") ) plot( dup_nir$pc[, 1], dup_nir$pc[, 2], col = rgb(0, 0, 0, 0.3), pch = 19, xlab = "PC1", ylab = "PC2", main = "DUPLEX (NIRsoil)" ) grid() points(dup_nir$pc[dup_nir$model, 1], dup_nir$pc[dup_nir$model, 2], col = "red", pch = 19) points(dup_nir$pc[dup_nir$test, 1], dup_nir$pc[dup_nir$test, 2], col = "dodgerblue", pch = 19) legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("All samples", "Calibration", "Validation"), pch = 19, col = c(rgb(0, 0, 0, 0.3), "red", "dodgerblue") ) par(mfrow = c(1, 1)) ``` # SELECT algorithm (`shenkWest`) The SELECT algorithm [@shenk1991] is an iterative procedure that selects the sample with the maximum number of neighbours within a given distance threshold (`d.min`, default = 0.6), then removes those neighbours from the candidate pool. The number of selected samples is not fixed in advance but depends on `d.min` and the density of the data. The distance metric is the Mahalanobis distance divided by the number of PC dimensions used (@fig-shenk). ```{r} #| label: fig-shenk #| fig-cap: "Samples selected by the SELECT algorithm on a synthetic grid #| dataset (left, `d.min` = 0.1) and on the NIRsoil dataset in PC space (right, #| `d.min` = 0.6)." #| fig-height: 4.5 #| fig-width: 8 #| fig-align: center #| echo: true shenk_synthetic <- shenkWest(X = X_synthetic, d.min = 0.1) shenk <- shenkWest(X = NIRsoil$spc, d.min = 0.6, pc = 2) par(mfrow = c(1, 2), mar = c(5, 4, 6, 2)) plot( X_synthetic, col = rgb(0, 0, 0, 0.3), pch = 19, main = "SELECT (synthetic)" ) grid() points(X_synthetic[shenk_synthetic$model, ], col = "red", pch = 19) legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("All samples", "Selected"), pch = 19, col = c(rgb(0, 0, 0, 0.3), "red") ) plot( shenk$pc, col = rgb(0, 0, 0, 0.3), pch = 19, xlab = "PC1", ylab = "PC2", main = "SELECT (NIRsoil)" ) grid() points(shenk$pc[shenk$model, ], col = "red", pch = 19) legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("All samples", "Selected"), pch = 19, col = c(rgb(0, 0, 0, 0.3), "red") ) par(mfrow = c(1, 1)) ``` ## Puchwein algorithm (`puchwein`) The Puchwein algorithm [@puchwein1988] produces a calibration set with uniform coverage of the spectral space. A useful property is that it allows an objective determination of the required number of calibration samples through diagnostic plots. The data are first reduced by PCA and the Mahalanobis distance ($H$) from each sample to the centre of the matrix is computed. Samples are then sorted in decreasing order of $H$ and inter-sample distances in PC space are computed. The algorithm proceeds as follows (@fig-puchwein): 1. Define a limiting distance. 2. Find the sample with $\max(H)$. 3. Remove all samples within the limiting distance of the selected sample. 4. Return to step 2 and find the sample with $\max(H)$ among the remaining samples. 5. When no samples remain, return to step 1 and increase the limiting distance. ```{r} #| label: fig-puchwein #| fig-cap: "Samples selected by the Puchwein algorithm on a synthetic grid #| dataset (left) and on the NIRsoil dataset in PC space (right)." #| fig-height: 4.5 #| fig-width: 8 #| fig-align: center #| echo: true pu_synthetic <- puchwein(X = X_synthetic, k = 0.2) pu <- puchwein(X = NIRsoil$spc, k = 0.2, pc = 2) par(mfrow = c(1, 2), mar = c(5, 4, 6, 2)) plot( X_synthetic, col = rgb(0, 0, 0, 0.3), pch = 19, main = "Puchwein (synthetic)" ) grid() points(X_synthetic[pu_synthetic$model, ], col = "red", pch = 19) legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("All samples", "Selected"), pch = 19, col = c(rgb(0, 0, 0, 0.3), "red") ) plot( pu$pc, col = rgb(0, 0, 0, 0.3), pch = 19, xlab = "PC1", ylab = "PC2", main = "Puchwein (NIRsoil)" ) grid() points(pu$pc[pu$model, ], col = "red", pch = 19) legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("All samples", "Selected"), pch = 19, col = c(rgb(0, 0, 0, 0.3), "red") ) par(mfrow = c(1, 1)) ``` A distinctive feature of the Puchwein algorithm is the leverage diagnostic, which helps identify the optimal number of loops (and hence calibration samples). @fig-puchwein-leverage shows the difference between the theoretical and observed sum of leverages as a function of samples removed, and the number of samples retained per loop. ```{r} #| label: fig-puchwein-leverage #| fig-cap: "Puchwein leverage diagnostics: difference between theoretical and #| observed sum of leverages (top) and number of samples retained per loop #| (bottom). The first loop is typically optimal." #| fig-height: 5 #| fig-width: 6 #| fig-align: center #| echo: true par(mfrow = c(2, 1)) plot( pu$leverage$removed, pu$leverage$diff, type = "l", xlab = "Samples removed", ylab = "Difference (theoretical vs observed leverages)" ) plot( pu$leverage$loop, nrow(NIRsoil) - pu$leverage$removed, type = "l", xlab = "Loop", ylab = "Samples retained" ) par(mfrow = c(1, 1)) ``` ## Honigs sampling (`honigs`) The Honigs algorithm [@honigs1985] selects samples based on the magnitude of their absorption features. It operates on either absorbance or continuum-removed spectra. The algorithm proceeds as follows (@fig-honigs): 1. Select the sample with the largest absorption feature (in absolute value). 2. Subtract that absorption from all remaining spectra. 3. Repeat steps 1–2 until the desired number of samples $k$ is reached. ```{r} #| label: fig-honigs #| fig-cap: "All NIRsoil spectra (left) and the 10 samples selected by the #| Honigs algorithm with the wavelength bands used during selection marked #| (right)." #| fig-height: 4.5 #| fig-width: 8 #| fig-align: center #| echo: true ho <- honigs(X = NIRsoil$spc, k = 10, type = "A") wav <- as.numeric(colnames(NIRsoil$spc)) par(mfrow = c(1, 2), mar = c(5, 4, 6, 2)) matplot( wav, t(NIRsoil$spc), type = "l", lty = 1, lwd = 0.5, col = rgb(0, 0, 0, 0.1), xlab = "Wavelength (nm)", ylab = "Absorbance", main = "All spectra" ) matlines(wav, t(NIRsoil$spc[ho$model, ]), lty = 1, lwd = 1.5, col = "red") legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("All spectra", "Selected"), lty = 1, col = c(rgb(0, 0, 0, 0.3), "red") ) matplot( wav, t(NIRsoil$spc[ho$model, ]), type = "l", lty = 1, lwd = 1.5, xlab = "Wavelength (nm)", ylab = "Absorbance", main = "Selected spectra" ) abline(v = wav[ho$bands], lty = 2, col = "grey40") grid() legend( x = "top", inset = c(0, -0.15), xpd = TRUE, horiz = TRUE, bty = "n", legend = c("Selected spectra", "Bands used"), lty = c(1, 2), col = c("black", "grey40") ) par(mfrow = c(1, 1)) ``` # References {-}