Catalina Jerez
catalina.jerez@colorado.edu
The goal of this analysis is to estimate the underlying precipitation surface in India using sparse spatial observations. This is crucial for applications such as floodplain management, natural hazard mitigation, and climate modeling. The study utilizes regression models, crossvalidation, and spatial mapping techniques.
We’ll fit a Generalized Linear Model (and Local GLM) using binomial family. Then, let \(P_i\) be the precipitation at station \(i\). We define the binary variable \(B_{P_i}\) as:
\[ B_{P_i} = \begin{cases} 1 & \text{if } P_i \geq 150 \text{ cm} \\ 0 & \text{if } P_i < 150 \text{ cm} \end{cases} \]
Regressions will be executed with the fitBHSM() function, which returns:
Binned empirical plot.
Best fitted model.
MCMC (fit, samples, predicted values)
Graph for posterior distributions of parameters (nugget effect, sill variance) and covariables.
Diagnostics
Spatially map for fitted and SE values.
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 543056 29.1 1201447 64.2 NA 700282 37.4
## Vcells 1002658 7.7 8388608 64.0 36864 1963597 15.0
rm(list = ls())
# Load necessary libraries
spatialAnalysis.Lib = c("sp", "sf","geosphere", "geoR", "maps", "fields",
"rnaturalearth", "rnaturalearthdata")
smoothing.Lib = c("mgcv", "locfit", "akima")
statisticalAnalysis.Lib = c("MASS", "psych", "hydroGOF", "fields", "leaps", "car")
dataManipulation.Lib = c("dplyr", "reshape2", "reshape", "tidyr")
dataVisualization.Lib = c("ggplot2", "GGally", "ggpubr", "scales",
"RColorBrewer")
modeling.Lib = c("nnet", "MASS", "verification", "spBayes", "rstan")
list.packages = unique(c(spatialAnalysis.Lib, statisticalAnalysis.Lib,
smoothing.Lib, modeling.Lib,
dataManipulation.Lib, dataVisualization.Lib))
# Load all required packages
sapply(list.packages, require, character.only = TRUE)
## sp sf geosphere geoR
## TRUE TRUE TRUE TRUE
## maps fields rnaturalearth rnaturalearthdata
## TRUE TRUE TRUE TRUE
## MASS psych hydroGOF leaps
## TRUE TRUE TRUE TRUE
## car mgcv locfit akima
## TRUE TRUE TRUE TRUE
## nnet verification spBayes rstan
## TRUE TRUE TRUE TRUE
## dplyr reshape2 reshape tidyr
## TRUE TRUE TRUE TRUE
## ggplot2 GGally ggpubr scales
## TRUE TRUE TRUE TRUE
## RColorBrewer
## TRUE
# Path -------------------------------------------------------------------------
path = "/Users/caje3244/OneDrive - UCB-O365/2025 Spring/CVEN 6833/"
path.code = file.path(path, "Codes/")
path.plot = file.path(path, "Figures/")
# load functions to perform regression
source(file = file.path(path.code, "libraryRegression.R"))
# Data -------------------------------------------------------------------------
topography = read.table("http://civil.colorado.edu/~balajir/CVEN6833/HWs/HW-1/india-grid-topo.txt")
colnames(topography) = c("lon", "lat", "elev")
# new climatol data with calculated distances from Geodesic method
climaGeo = readRDS(file = paste0(path.code, "RData/hw1_climatol_distGeodesic.rds"))
climaGeo = na.omit(climaGeo)
climaGeo$prec = climaGeo$prec/10 # mm to cm
coords = dplyr::select(climaGeo, c(lon, lat))
# scale all vars dependent of lon and lat
climaGeo = climaGeo %>%
dplyr::mutate(across(!c(prec), scale, center = TRUE, scale = TRUE)) %>%
dplyr::mutate(across(!c(prec), as.numeric))
scale.color = c("#d73027","#313695")
scale.label = unique(cut_interval(climaGeo$prec, n=2))
pairs.ggplot(climaGeo)
## Set threshold
climaGeo$prec = ifelse(climaGeo$prec > 150, 1, 0)
set.seed(2526321)
# v0) basic linear model
str(data)
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"),
## envir = .GlobalEnv, overwrite = TRUE)
basic.model = lm(prec ~ ., data = climaGeo)
# If the VIF is larger than 1/(1-R2)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity
## lon lat elev mdCoast mdArabian mdBengal
## TRUE FALSE FALSE FALSE TRUE TRUE
cat("Since for some variables VIF is larger than 1/(1-R2), let's remove:
mdArabian and mdBengal and just keep the mdCoast (minimun distance to any coast) \n")
## Since for some variables VIF is larger than 1/(1-R2), let's remove:
## mdArabian and mdBengal and just keep the mdCoast (minimun distance to any coast)
# v1) linear model
climaGeo = dplyr::select(climaGeo, -c(mdArabian, mdBengal))
basic.model = lm(prec ~ ., data = climaGeo)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity
## lon lat elev mdCoast
## FALSE FALSE FALSE FALSE
# v2) basic linear + interacterions
cat("Cool, for all the variables VIF < 1/(1-R2)
Let's add some interactiones between the variables \n")
## Cool, for all the variables VIF < 1/(1-R2)
## Let's add some interactiones between the variables
climaGeo = climaGeo %>%
dplyr::mutate(lld = lon * lat * mdCoast,
ll2d = lon * lat^2 * mdCoast,
l2ld = lon^2 * lat * mdCoast,
ed = elev * mdCoast
)
basic.model = lm(prec ~ ., data = climaGeo)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity
## lon lat elev mdCoast lld ll2d l2ld ed
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
cat("Since for all the variables VIF < 1/(1-R2)
Let's proceed with BHSM \n")
## Since for all the variables VIF < 1/(1-R2)
## Let's proceed with BHSM
source(file = file.path(path.code, "libraryRegression.R"))
fitting.BHSM = fit_BHSM(model.type = "GLM",
family = "binomial",
data = climaGeo,
coords = coords,
topography = topography,
type = "variogram")
##
## ------------------------ Fitting GLM model
##
## ------------ Fitting GLM with binomial + logit link
##
## ------------ Fitting GLM with binomial + probit link
##
## ------------ Fitting GLM with binomial + cauchit link
##
## Model link comparison:
## Model summary:
## model MSE RMSE R2 AIC BIC
## logit logit 22.065688 4.697413 -175.63058 1471.248 1492.280
## probit probit 5.538158 2.353329 -43.33163 1131.187 1152.219
## cauchit cauchit 175.929668 13.263848 -1407.27510 1981.958 2002.990
##
## Best link function: probit
## MSE = 5.538
## RMSE = 2.353
## R2 = -43.332
## AIC = 1131.187
## BIC = 1152.219
##
## ------------------------ Summary
##
## ------------ Fitting BHSM with exp potential function
##
## ------------ Fitting BHSM with spherical potential function
##
## ------------ Fitting BHSM with gaussian potential function
##
## Model link comparison:
## Model summary:
## model MSE RMSE R2
## exp exp 2.842053e-06 0.001685839 0.5043667
## spherical spherical 2.842047e-06 0.001685837 0.5043677
## gaussian gaussian 2.777806e-06 0.001666675 0.5155710
##
## Best potential function: gaussian
## MSE = 0.000
## RMSE = 0.002
## R2 = 0.516
##
## ------------------------ Running MCMC for BHSM
## Note 1: If your coordinates are normalized, phi in (0.05, 3) usually works.
## Note 2: Modify if required the params for binomial family----------------------------------------
## General model description
## ----------------------------------------
## Model fit with 246 observations.
##
## Number of covariates 6 (including intercept if specified).
##
## Using the exponential spatial correlation model.
##
## Number of MCMC samples 2000.
##
## Priors and hyperpriors:
## beta flat.
## sigma.sq IG hyperpriors shape=2.00000 and scale=1.00000
## tau.sq IG hyperpriors shape=0.01000 and scale=0.00100
## phi Unif hyperpriors a=0.05000 and b=3.00000
## -------------------------------------------------
## Sampling
## -------------------------------------------------
## Sampled: 200 of 2000, 10.00%
## Report interval Metrop. Acceptance rate: 61.00%
## Overall Metrop. Acceptance rate: 61.00%
## -------------------------------------------------
## Sampled: 400 of 2000, 20.00%
## Report interval Metrop. Acceptance rate: 68.50%
## Overall Metrop. Acceptance rate: 64.75%
## -------------------------------------------------
## Sampled: 600 of 2000, 30.00%
## Report interval Metrop. Acceptance rate: 67.00%
## Overall Metrop. Acceptance rate: 65.50%
## -------------------------------------------------
## Sampled: 800 of 2000, 40.00%
## Report interval Metrop. Acceptance rate: 63.00%
## Overall Metrop. Acceptance rate: 64.88%
## -------------------------------------------------
## Sampled: 1000 of 2000, 50.00%
## Report interval Metrop. Acceptance rate: 70.50%
## Overall Metrop. Acceptance rate: 66.00%
## -------------------------------------------------
## Sampled: 1200 of 2000, 60.00%
## Report interval Metrop. Acceptance rate: 74.50%
## Overall Metrop. Acceptance rate: 67.42%
## -------------------------------------------------
## Sampled: 1400 of 2000, 70.00%
## Report interval Metrop. Acceptance rate: 66.50%
## Overall Metrop. Acceptance rate: 67.29%
## -------------------------------------------------
## Sampled: 1600 of 2000, 80.00%
## Report interval Metrop. Acceptance rate: 70.00%
## Overall Metrop. Acceptance rate: 67.62%
## -------------------------------------------------
## Sampled: 1800 of 2000, 90.00%
## Report interval Metrop. Acceptance rate: 72.50%
## Overall Metrop. Acceptance rate: 68.17%
## -------------------------------------------------
## Sampled: 2000 of 2000, 100.00%
## Report interval Metrop. Acceptance rate: 69.50%
## Overall Metrop. Acceptance rate: 68.30%
## -------------------------------------------------
## -------------------------------------------------
## Recovering beta and w
## -------------------------------------------------
## Sampled: 99 of 1001, 9.89%
## Sampled: 199 of 1001, 19.88%
## Sampled: 299 of 1001, 29.87%
## Sampled: 399 of 1001, 39.86%
## Sampled: 499 of 1001, 49.85%
## Sampled: 599 of 1001, 59.84%
## Sampled: 699 of 1001, 69.83%
## Sampled: 799 of 1001, 79.82%
## Sampled: 899 of 1001, 89.81%
## Sampled: 999 of 1001, 99.80%
## -------------------------------------------------
## Recovering beta
## -------------------------------------------------
## Sampled: 99 of 1001, 9.89%
## Sampled: 199 of 1001, 19.88%
## Sampled: 299 of 1001, 29.87%
## Sampled: 399 of 1001, 39.86%
## Sampled: 499 of 1001, 49.85%
## Sampled: 599 of 1001, 59.84%
## Sampled: 699 of 1001, 69.83%
## Sampled: 799 of 1001, 79.82%
## Sampled: 899 of 1001, 89.81%
## Sampled: 999 of 1001, 99.80%
##
## ------------------------ Spatial Mapping
Even the best GLM (probit) had a very poor \(R^2\) and no spatial structure, motivating the application of BHSM.
fitting.BHSM$emp.plot
Three potential functions were tested (for spatial covariance). The best
kernel was get it by Gaussian model, where BHSM explains over 51.6% of
variance (a significant leap from GLM).
fitting.BHSM$model
## Nonlinear regression model
## model: value ~ model(distance, c0, c1, a)
## data: data
## c0 c1 a
## 9.717e-03 5.457e-03 1.088e+02
## residual sum-of-squares: 9.722e-05
##
## Number of iterations to convergence: 8
## Achieved convergence tolerance: 1.49e-08
fitting.BHSM$plot.pparms
From figure, well-formed posterior densities suggest stable MCMC convergence.
\(\sigma^2\) (sigma.sq): sill (spatial variance) is between 0.1 and 0.6
\(\tau2\) (tau.sq): small nugget effect (0.01–0.03).
\(\phi\) (phi): show a moderate spatial range (0.05–0.25).
fitting.BHSM$plot.pcovars
The covariates explain part of the variability, but spatial effects carry additional predictive power.
Intercept is centered around 0, suggesting a balanced baseline probability.
lon: slight positive (location-based).
lat and elev shown mixed weak effects.
Nonlinear interaction effects with low-to-moderate impact.
fitting.BHSM$mod.diag
Diagnostics confirm model residuals are well-behaved:
a) Normality: Residuals tightly follow normal quantiles.
b) Histogram: Centered and symmetrical.
c) Independence: No significant autocorrelation remaining.
d) Homoscedasticity: Centered in zero.
fitting.BHSM$pred.sptMap
fitting.BHSM$se.sptMap
* Standard error map shows very low SE (\(\sim
e{-16}\)) across India. This suggests extremely high model
confidence, though may indicate over-smoothing or high model certainty
due to small-scale nugget.
BHSM outperforms GLM in capturing spatial variability and improving predictive accuracy. So, use BHSM for spatial prediction and applications needing uncertainty quantification.
Is good idea to keep GLM as a baseline but not for practical use.
BHSM should be complemented with cross-validation to verify predictive robustness.