Catalina Jerez
catalina.jerez@colorado.edu
Perform a Bayesian Hierarchical Spatial Model (see Verdin et al. 2015, for tips).
Regressions will be executed with the fitBHSM() function, which returns:
Binned empirical plot.
Best fitted model.
MCMC (fit, samples, predicted values)
Graph for posterior distributions of parameters (nugget effect, sill variance) and covariables.
Scatterplot of simulated vs predicted.
Diagnostics
Spatially map for fitted and SE values.
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 543052 29.1 1201435 64.2 NA 700282 37.4
## Vcells 1002800 7.7 8388608 64.0 36864 1963597 15.0
rm(list = ls())
# Load necessary libraries
spatialAnalysis.Lib = c("sp", "sf","geosphere", "geoR", "maps", "fields",
"rnaturalearth", "rnaturalearthdata")
smoothing.Lib = c("mgcv", "locfit", "akima")
statisticalAnalysis.Lib = c("MASS", "psych", "hydroGOF", "fields", "leaps", "car")
dataManipulation.Lib = c("dplyr", "reshape2", "reshape", "tidyr")
dataVisualization.Lib = c("ggplot2", "GGally", "ggpubr", "scales",
"RColorBrewer")
modeling.Lib = c("nnet", "MASS", "verification", "spBayes", "rstan")
list.packages = unique(c(spatialAnalysis.Lib, statisticalAnalysis.Lib,
smoothing.Lib, modeling.Lib,
dataManipulation.Lib, dataVisualization.Lib))
# Load all required packages
sapply(list.packages, require, character.only = TRUE)
## sp sf geosphere geoR
## TRUE TRUE TRUE TRUE
## maps fields rnaturalearth rnaturalearthdata
## TRUE TRUE TRUE TRUE
## MASS psych hydroGOF leaps
## TRUE TRUE TRUE TRUE
## car mgcv locfit akima
## TRUE TRUE TRUE TRUE
## nnet verification spBayes rstan
## TRUE TRUE TRUE TRUE
## dplyr reshape2 reshape tidyr
## TRUE TRUE TRUE TRUE
## ggplot2 GGally ggpubr scales
## TRUE TRUE TRUE TRUE
## RColorBrewer
## TRUE
# Path -------------------------------------------------------------------------
path = "/Users/caje3244/OneDrive - UCB-O365/2025 Spring/CVEN 6833/"
path.code = file.path(path, "Codes/")
path.plot = file.path(path, "Figures/")
# load functions to perform regression
source(file = file.path(path.code, "libraryRegression.R"))
# Data -------------------------------------------------------------------------
topography = read.table("http://civil.colorado.edu/~balajir/CVEN6833/HWs/HW-1/india-grid-topo.txt")
colnames(topography) = c("lon", "lat", "elev")
# new climatol data with calculated distances from Geodesic method
climaGeo = readRDS(file = paste0(path.code, "RData/hw1_climatol_distGeodesic.rds"))
climaGeo = na.omit(climaGeo)
climaGeo$prec = climaGeo$prec/10 # mm to cm
coords = dplyr::select(climaGeo, c(lon, lat))
# scale all vars dependent of lon and lat
climaGeo = climaGeo %>%
dplyr::mutate(across(!c(prec), scale, center = TRUE, scale = TRUE)) %>%
dplyr::mutate(across(!c(prec), as.numeric))
scale.color = c("#d73027","#313695")
scale.label = unique(cut_interval(climaGeo$prec, n=2))
pairs.ggplot(climaGeo)
## Basic analysis
set.seed(2526321)
# v0) basic linear model
str(data)
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"),
## envir = .GlobalEnv, overwrite = TRUE)
basic.model = lm(prec ~ ., data = climaGeo)
# If the VIF is larger than 1/(1-R2)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity
## lon lat elev mdCoast mdArabian mdBengal
## TRUE FALSE FALSE FALSE TRUE TRUE
cat("Since for some variables VIF is larger than 1/(1-R2), let's remove:
mdArabian and mdBengal and just keep the mdCoast (minimun distance to any coast) \n")
## Since for some variables VIF is larger than 1/(1-R2), let's remove:
## mdArabian and mdBengal and just keep the mdCoast (minimun distance to any coast)
# v1) linear model
climaGeo = dplyr::select(climaGeo, -c(mdArabian, mdBengal))
basic.model = lm(prec ~ ., data = climaGeo)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity
## lon lat elev mdCoast
## FALSE FALSE FALSE FALSE
# v2) basic linear + interacterions
cat("Cool, for all the variables VIF < 1/(1-R2)
Let's add some interactiones between the variables \n")
## Cool, for all the variables VIF < 1/(1-R2)
## Let's add some interactiones between the variables
climaGeo = climaGeo %>%
dplyr::mutate(lld = lon * lat * mdCoast,
ll2d = lon * lat^2 * mdCoast,
l2ld = lon^2 * lat * mdCoast,
ed = elev * mdCoast
)
basic.model = lm(prec ~ ., data = climaGeo)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity
## lon lat elev mdCoast lld ll2d l2ld ed
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
cat("Since for all the variables VIF < 1/(1-R2)
Let's proceed with BHSM \n")
## Since for all the variables VIF < 1/(1-R2)
## Let's proceed with BHSM
fitting.BHSM = fit_BHSM(model.type = "GLM",
family = "gaussian",
data = climaGeo,
coords = coords,
topography = topography,
type = "variogram")
##
## ------------------------ Fitting GLM model
##
## ------------ Fitting GLM with gaussian + identity link
##
## Model link comparison:
## Model summary:
## model MSE RMSE R2 AIC BIC
## identity identity 1313.579 36.24333 0.6681692 2474.523 2492.05
##
## Best link function: identity
## MSE = 1313.579
## RMSE = 36.243
## R2 = 0.668
## AIC = 2474.523
## BIC = 2492.050
##
## ------------------------ Summary
##
## ------------ Fitting BHSM with exp potential function
##
## ------------ Fitting BHSM with spherical potential function
##
## ------------ Fitting BHSM with gaussian potential function
##
## Model link comparison:
## Model summary:
## model MSE RMSE R2
## exp exp 15521.30 124.5845 0.8110962
## spherical spherical 16685.15 129.1710 0.7969314
## gaussian gaussian 22221.87 149.0700 0.7295460
##
## Best potential function: exp
## MSE = 15521.295
## RMSE = 124.584
## R2 = 0.811
##
## ------------------------ Running MCMC for BHSM
## Note 1: If your coordinates are normalized, phi in (0.05, 3) usually works.
## Note 2: Modify if required the params for binomial family----------------------------------------
## General model description
## ----------------------------------------
## Model fit with 246 observations.
##
## Number of covariates 6 (including intercept if specified).
##
## Using the exponential spatial correlation model.
##
## Number of MCMC samples 2000.
##
## Priors and hyperpriors:
## beta flat.
## sigma.sq IG hyperpriors shape=2.00000 and scale=1.00000
## tau.sq IG hyperpriors shape=2.00000 and scale=1.00000
## phi Unif hyperpriors a=0.05000 and b=3.00000
## -------------------------------------------------
## Sampling
## -------------------------------------------------
## Sampled: 200 of 2000, 10.00%
## Report interval Metrop. Acceptance rate: 37.50%
## Overall Metrop. Acceptance rate: 37.50%
## -------------------------------------------------
## Sampled: 400 of 2000, 20.00%
## Report interval Metrop. Acceptance rate: 26.50%
## Overall Metrop. Acceptance rate: 32.00%
## -------------------------------------------------
## Sampled: 600 of 2000, 30.00%
## Report interval Metrop. Acceptance rate: 24.00%
## Overall Metrop. Acceptance rate: 29.33%
## -------------------------------------------------
## Sampled: 800 of 2000, 40.00%
## Report interval Metrop. Acceptance rate: 29.50%
## Overall Metrop. Acceptance rate: 29.38%
## -------------------------------------------------
## Sampled: 1000 of 2000, 50.00%
## Report interval Metrop. Acceptance rate: 29.50%
## Overall Metrop. Acceptance rate: 29.40%
## -------------------------------------------------
## Sampled: 1200 of 2000, 60.00%
## Report interval Metrop. Acceptance rate: 34.50%
## Overall Metrop. Acceptance rate: 30.25%
## -------------------------------------------------
## Sampled: 1400 of 2000, 70.00%
## Report interval Metrop. Acceptance rate: 23.50%
## Overall Metrop. Acceptance rate: 29.29%
## -------------------------------------------------
## Sampled: 1600 of 2000, 80.00%
## Report interval Metrop. Acceptance rate: 24.50%
## Overall Metrop. Acceptance rate: 28.69%
## -------------------------------------------------
## Sampled: 1800 of 2000, 90.00%
## Report interval Metrop. Acceptance rate: 26.50%
## Overall Metrop. Acceptance rate: 28.44%
## -------------------------------------------------
## Sampled: 2000 of 2000, 100.00%
## Report interval Metrop. Acceptance rate: 23.50%
## Overall Metrop. Acceptance rate: 27.95%
## -------------------------------------------------
## -------------------------------------------------
## Recovering beta and w
## -------------------------------------------------
## Sampled: 99 of 1001, 9.89%
## Sampled: 199 of 1001, 19.88%
## Sampled: 299 of 1001, 29.87%
## Sampled: 399 of 1001, 39.86%
## Sampled: 499 of 1001, 49.85%
## Sampled: 599 of 1001, 59.84%
## Sampled: 699 of 1001, 69.83%
## Sampled: 799 of 1001, 79.82%
## Sampled: 899 of 1001, 89.81%
## Sampled: 999 of 1001, 99.80%
## -------------------------------------------------
## Recovering beta
## -------------------------------------------------
## Sampled: 99 of 1001, 9.89%
## Sampled: 199 of 1001, 19.88%
## Sampled: 299 of 1001, 29.87%
## Sampled: 399 of 1001, 39.86%
## Sampled: 499 of 1001, 49.85%
## Sampled: 599 of 1001, 59.84%
## Sampled: 699 of 1001, 69.83%
## Sampled: 799 of 1001, 79.82%
## Sampled: 899 of 1001, 89.81%
## Sampled: 999 of 1001, 99.80%
##
## ------------------------ Spatial Mapping
fitting.BHSM$emp.plot
The binned empirical variogram shows an increasing trend, where the
closer points exhibit lower variance, while the farther apart points
show greater variability.
The exponential model (dashed red line on figure above) was selected based on lowest RMSE (124.58) and highest \(R^2\) (0.81), indicating a good spatial fit.
fitting.BHSM$model
## Nonlinear regression model
## model: value ~ model(distance, c0, c1, a)
## data: data
## c0 c1 a
## 488.5 1151.8 111.3
## residual sum-of-squares: 543245
##
## Number of iterations to convergence: 16
## Achieved convergence tolerance: 1.49e-08
fitting.BHSM$plot.pparms
From posterior distributions of parameters (\(\sigma^2\), \(\tau^2\), \(\phi\)), most of the variability is due to spatial structure rather than random noise.
\(\sigma^2\) (sigma.sq): sill is high, reflecting an important spatial variability.
\(\tau2\) (tau.sq): nugget effect (random noise) is low, supporting spatial smoothness.
\(\phi\) (phi): control spatial decay, it’s concentrated between 0.05 and 0.25
fitting.BHSM$plot.pcovars
Posterior distributions of Latitude (lat) and Elevation (elev) are narrow and centered away from zero, indicating these are strong and statistically significant predictors of precipitation.
Longitude (lon) and Intercept, also show narrow and peaked distributions, which suggest a contribution to the model.
The interaction term ed, show a posterior density wide and flatter. Then, ed implies a high uncertainty and likely a weak influence on precipitation.
And lld is slightly more peaked, i.e., moderate impact, but not as influential as lat or elev.
fitting.BHSM$scatter
The fit show a very tight clustering along the diagonal, with near-perfect linear fit.
Metrics results (RMSE = 0 and \(R^2\) = 1) likely reflect overfitting or high model confidence, possibly due to spatial correlation being fully captured.
fitting.BHSM$mod.diag
a) Normality: The Q-Q plot through both tails indicates
heavy-tailed residuals. The central section fits well, but since there
are outliers in both tails, it means that some observations are not well
described by the model and are probably due to localized spatial
effects.
b) Histogram: Shows a peak strongly centered on zero, reflecting that the majority of residuals are near zero, validating high accuracy. But thin spikes indicate the model has the potential to overfit, fitting data too closely. Residuals are skewed in their distribution, reflecting potential non-linearity or issues of variance that might need to be transformed.
c) Independence: ACF plot indicates strong autocorrelation at lag 1, with spatial correlation not being fully explained in the residuals. Periodic structure indicates some spatial dependency is left unmodeled.
d) Homoscedasticity: The residual plot indicates an uneven spread, with greater variance in extreme values of precipitation. It has a slight funnel shape, so there is heteroscedasticity, i.e., the errors grow as precipitation values increase.
fitting.BHSM$pred.sptMap
The upper figure shows that a higher precipitation is observed in the northern and coastal regions, which is in line with the expected climatic patterns. Central and western India exhibit lower precipitation, consistent with the drier climatic conditions of these regions.
fitting.BHSM$se.sptMap
Standard Error Map show extremely low values (in the order of 1e-14), indicating high model certainty.