Comparing PAI Methods and Validation Strategies
pai_validation.Rmd
This vignette is for users who are comfortable with the basic
workflow of mapAI
and want to explore more advanced topics.
Here, we will address two key questions that arise in any real-world
project:
- Which PAI model (
lm
,rf
, orgam
) is best for my data? - How do I reliably estimate the accuracy of my chosen model?
To answer these, we will use a real-world dataset from the 1925 Kastoria Cadastral Map, which is included with the package. We will focus on comparing the performance of the three main correction methods and demonstrate the crucial difference between standard random cross-validation and spatial cross-validation.
# Load the necessary libraries
library(mapAI)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(patchwork)
Step 1: Load and Explore the Package Data
The mapAI
package comes with two datasets from the
Kastoria study:
-
parcels
: An sf object of cadastral polygons from the 1925 map. -
gcps
: An sf object of the homologous points (GCPs) for this map, including the calculated dx and dy displacements.
Since this data is already included in the package, we can load it directly with the data() function.
# Load the built-in datasets
data(parcels)
data(gcps)
# Let's calculate the baseline error of the uncorrected map
# This "Identity RMSE" is our starting point.
identity_rmse <- sqrt(mean(gcps$dx^2 + gcps$dy^2))
print(paste("Initial average error (RMSE) of the Kastoria map:", round(identity_rmse, 2), "meters"))
#> [1] "Initial average error (RMSE) of the Kastoria map: 0.69 meters"
Step 2: Comparing Validation Strategies
A key challenge in spatial data science is spatial autocorrelation, which means that points closer together are more likely to be similar than points farther apart. Standard random cross-validation ignores this. It might randomly put a test point right next to a training point, leading to an overly optimistic (and misleadingly low) error estimate.
Spatial Cross-Validation (SCV) solves this by creating validation folds that are geographically separate blocks. This provides a more realistic estimate of how the model will perform when predicting for entirely new areas.
Let’s compare the results from both methods for all three models
(lm
, rf
, and gam
).
# Helper function to run validation for all models
validate_all_methods <- function(gcp_data, validation_type) {
message(paste("\nRunning", validation_type, "cross-validation..."))
# Validate each model type
lm_results <- assess_pai_model(gcp_data, method = "lm",k_folds = 10, seed = 1, validation_type = validation_type)
rf_results <- assess_pai_model(gcp_data, method = "rf",k_folds = 10, seed = 1, validation_type = validation_type)
gam_results <- assess_pai_model(gcp_data, method = "gam",k_folds = 10, seed = 1, validation_type = validation_type)
hlm_results <- assess_pai_model(gcp_data, method = "helmert",k_folds = 10, seed = 1, validation_type = validation_type)
# Combine and return the results
rbind(lm_results, rf_results, gam_results, hlm_results)
}
# Run both random and spatial cross-validation
random_cv_results <- validate_all_methods(gcps, "random")
#>
#> Running random cross-validation...
#> Starting 10 -fold random cross-validation for method: lm on 300 complete points.
#> Creating random folds...
#> Processing Fold 1 of 10 ...
#> Training 'lm' model...
#> Processing Fold 2 of 10 ...
#> Training 'lm' model...
#> Processing Fold 3 of 10 ...
#> Training 'lm' model...
#> Processing Fold 4 of 10 ...
#> Training 'lm' model...
#> Processing Fold 5 of 10 ...
#> Training 'lm' model...
#> Processing Fold 6 of 10 ...
#> Training 'lm' model...
#> Processing Fold 7 of 10 ...
#> Training 'lm' model...
#> Processing Fold 8 of 10 ...
#> Training 'lm' model...
#> Processing Fold 9 of 10 ...
#> Training 'lm' model...
#> Processing Fold 10 of 10 ...
#> Training 'lm' model...
#> Assessment complete.
#> Starting 10 -fold random cross-validation for method: rf on 300 complete points.
#> Creating random folds...
#> Processing Fold 1 of 10 ...
#> Training 'rf' model...
#> Processing Fold 2 of 10 ...
#> Training 'rf' model...
#> Processing Fold 3 of 10 ...
#> Training 'rf' model...
#> Processing Fold 4 of 10 ...
#> Training 'rf' model...
#> Processing Fold 5 of 10 ...
#> Training 'rf' model...
#> Processing Fold 6 of 10 ...
#> Training 'rf' model...
#> Processing Fold 7 of 10 ...
#> Training 'rf' model...
#> Processing Fold 8 of 10 ...
#> Training 'rf' model...
#> Processing Fold 9 of 10 ...
#> Training 'rf' model...
#> Processing Fold 10 of 10 ...
#> Training 'rf' model...
#> Assessment complete.
#> Starting 10 -fold random cross-validation for method: gam on 300 complete points.
#> Creating random folds...
#> Processing Fold 1 of 10 ...
#> Training 'gam' model...
#> Processing Fold 2 of 10 ...
#> Training 'gam' model...
#> Processing Fold 3 of 10 ...
#> Training 'gam' model...
#> Processing Fold 4 of 10 ...
#> Training 'gam' model...
#> Processing Fold 5 of 10 ...
#> Training 'gam' model...
#> Processing Fold 6 of 10 ...
#> Training 'gam' model...
#> Processing Fold 7 of 10 ...
#> Training 'gam' model...
#> Processing Fold 8 of 10 ...
#> Training 'gam' model...
#> Processing Fold 9 of 10 ...
#> Training 'gam' model...
#> Processing Fold 10 of 10 ...
#> Training 'gam' model...
#> Assessment complete.
#> Starting 10 -fold random cross-validation for method: helmert on 300 complete points.
#> Creating random folds...
#> Processing Fold 1 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 2 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 3 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 4 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 5 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 6 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 7 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 8 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 9 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 10 of 10 ...
#> Fitting Helmert transformation...
#> Assessment complete.
spatial_cv_results <- validate_all_methods(gcps, "spatial")
#>
#> Running spatial cross-validation...
#> Starting 10 -fold spatial cross-validation for method: lm on 300 complete points.
#> Creating spatial folds...
#> Processing Fold 1 of 10 ...
#> Training 'lm' model...
#> Processing Fold 2 of 10 ...
#> Training 'lm' model...
#> Processing Fold 3 of 10 ...
#> Training 'lm' model...
#> Processing Fold 4 of 10 ...
#> Training 'lm' model...
#> Processing Fold 5 of 10 ...
#> Training 'lm' model...
#> Processing Fold 6 of 10 ...
#> Training 'lm' model...
#> Processing Fold 7 of 10 ...
#> Training 'lm' model...
#> Processing Fold 8 of 10 ...
#> Training 'lm' model...
#> Processing Fold 9 of 10 ...
#> Training 'lm' model...
#> Processing Fold 10 of 10 ...
#> Training 'lm' model...
#> Assessment complete.
#> Starting 10 -fold spatial cross-validation for method: rf on 300 complete points.
#> Creating spatial folds...
#> Processing Fold 1 of 10 ...
#> Training 'rf' model...
#> Processing Fold 2 of 10 ...
#> Training 'rf' model...
#> Processing Fold 3 of 10 ...
#> Training 'rf' model...
#> Processing Fold 4 of 10 ...
#> Training 'rf' model...
#> Processing Fold 5 of 10 ...
#> Training 'rf' model...
#> Processing Fold 6 of 10 ...
#> Training 'rf' model...
#> Processing Fold 7 of 10 ...
#> Training 'rf' model...
#> Processing Fold 8 of 10 ...
#> Training 'rf' model...
#> Processing Fold 9 of 10 ...
#> Training 'rf' model...
#> Processing Fold 10 of 10 ...
#> Training 'rf' model...
#> Assessment complete.
#> Starting 10 -fold spatial cross-validation for method: gam on 300 complete points.
#> Creating spatial folds...
#> Processing Fold 1 of 10 ...
#> Training 'gam' model...
#> Processing Fold 2 of 10 ...
#> Training 'gam' model...
#> Processing Fold 3 of 10 ...
#> Training 'gam' model...
#> Processing Fold 4 of 10 ...
#> Training 'gam' model...
#> Processing Fold 5 of 10 ...
#> Training 'gam' model...
#> Processing Fold 6 of 10 ...
#> Training 'gam' model...
#> Processing Fold 7 of 10 ...
#> Training 'gam' model...
#> Processing Fold 8 of 10 ...
#> Training 'gam' model...
#> Processing Fold 9 of 10 ...
#> Training 'gam' model...
#> Processing Fold 10 of 10 ...
#> Training 'gam' model...
#> Assessment complete.
#> Starting 10 -fold spatial cross-validation for method: helmert on 300 complete points.
#> Creating spatial folds...
#> Processing Fold 1 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 2 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 3 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 4 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 5 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 6 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 7 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 8 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 9 of 10 ...
#> Fitting Helmert transformation...
#> Processing Fold 10 of 10 ...
#> Fitting Helmert transformation...
#> Assessment complete.
# Combine into one table for comparison
all_results <- rbind(random_cv_results, spatial_cv_results)
# Print the results table
knitr::kable(all_results, caption = "Comparison of Validation Results")
Method | ValidationType | Mean_RMSE_2D | SD_RMSE_2D |
---|---|---|---|
lm | random | 0.5247283 | 0.0494486 |
rf | random | 0.5129937 | 0.0544782 |
gam | random | 0.5012551 | 0.0449526 |
helmert | random | 0.5420174 | 0.0478073 |
lm | spatial | 0.5600557 | 0.1610211 |
rf | spatial | 0.6385157 | 0.1572618 |
gam | spatial | 0.6390275 | 0.1043313 |
helmert | spatial | 0.5824065 | 0.1590879 |
Now, let’s visualize these results to make the comparison clear.
ggplot(all_results, aes(x = Method, y = Mean_RMSE_2D, fill = ValidationType)) +
geom_col(position = "dodge") +
geom_errorbar(
aes(ymin = Mean_RMSE_2D - SD_RMSE_2D, ymax = Mean_RMSE_2D + SD_RMSE_2D),
width = 0.2, position = position_dodge(0.9)
) +
labs(
title = "Random vs. Spatial Cross-Validation Results",
subtitle = "Note how spatial CV provides a higher error estimate",
x = "Correction Model",
y = "Mean 2D RMSE (meters)",
fill = "Validation Type"
) +
theme_minimal()
Step 3: Train the Final Model and Analyze It
Based on our validation, we choose rf
as our final model
and train it on the entire set of GCPs.
# Train the final GAM model on all available data
pai_model_gam <- train_pai_model(gcps, method = "gam")
#> Training 'gam' model...
# Let's look at the statistical summary of our trained model
model_summary <- summary(pai_model_gam$model)
print(model_summary)
#>
#> Family: Multivariate normal
#> Link function:
#>
#> Formula:
#> dx ~ s(source_x, source_y)
#> <environment: 0x55f1462ae570>
#> dy ~ s(source_x, source_y)
#> <environment: 0x55f1462ae570>
#>
#> Parametric coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.05257 0.01722 3.053 0.00227 **
#> (Intercept).1 -0.31945 0.02119 -15.074 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df Chi.sq p-value
#> s(source_x,source_y) 23.845 27.45 234.5 <2e-16 ***
#> s.1(source_x,source_y) 7.624 10.63 156.8 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Deviance explained = 39.9%
#> -REML = -305.73 Scale est. = 1 n = 300
Step 4: Apply and Visualize the Final Correction
Finally, we apply our chosen and analyzed gam model to the Kastoria parcels data.
# Apply the model to the parcel polygons
corrected_parcels <- apply_pai_model(pai_model = pai_model_gam, map = parcels)
#> Applying PAI model to map features...
#> Calculating area of corrected polygons...
#> Correction complete.
# Inspect the output - note the new 'area_new' column
print("Original vs. Corrected Dataframe Head:")
#> [1] "Original vs. Corrected Dataframe Head:"
head(corrected_parcels)
#> Simple feature collection with 6 features and 3 fields
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: 268542.1 ymin: 4488641 xmax: 268607.7 ymax: 4488718
#> Projected CRS: GGRS87 / Greek Grid
#> KAK geometry area_old area_new
#> 1 151415 POLYGON ((268546.6 4488664,... 455.2082 [m^2] 453.3685 [m^2]
#> 2 151416 POLYGON ((268542.1 4488652,... 259.0621 [m^2] 257.9246 [m^2]
#> 3 151419 POLYGON ((268595.9 4488651,... 191.7725 [m^2] 191.1334 [m^2]
#> 4 151418 POLYGON ((268583.4 4488669,... 209.8621 [m^2] 209.2530 [m^2]
#> 5 151417 POLYGON ((268578.9 4488670,... 188.5859 [m^2] 187.9030 [m^2]
#> 6 151412 POLYGON ((268602.4 4488682,... 1352.2642 [m^2] 1348.8842 [m^2]
# Create the final comparison plot
ggplot() +
geom_sf(data = parcels, aes(color = "Original"), fill = "grey75", linetype = "dashed") +
geom_sf(data = corrected_parcels, aes(color = "Corrected"), fill = NA) +
scale_color_manual(
name = "Parcel Status",
values = c("Original" = "grey25", "Corrected" = "#e41a1c")
) +
labs(
title = "Positional Correction of 1925 Kastoria Parcels",
subtitle = "Overlay of original (dashed) and corrected (solid) polygons"
) +
coord_sf(datum = sf::st_crs(2100)) +
theme_minimal()
This final plot shows the tangible result of our work: the corrected parcels are shifted and slightly reshaped, representing a more accurate historical reality. This vignette has demonstrated how to move from a basic workflow to a robust comparative analysis, giving you confidence in both your choice of model and its final accuracy.