Catalina Jerez
catalina.jerez@colorado.edu

Perform a Bayesian Hierarchical Spatial Model (see Verdin et al. 2015, for tips).

Plot the posterior histogram/PDF of the parameters and
Spatial maps of posterior mean and standard error

Regressions will be executed with the fitBHSM() function, which returns:

Binned empirical plot.
Best fitted model.
MCMC (fit, samples, predicted values)
Graph for posterior distributions of parameters (nugget effect, sill variance) and covariables.
Scatterplot of simulated vs predicted.
Diagnostics
Spatially map for fitted and SE values.

1 Library

gc()

##           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells  543052 29.1    1201435 64.2         NA   700282 37.4
## Vcells 1002800  7.7    8388608 64.0      36864  1963597 15.0

rm(list = ls())
# Load necessary libraries
spatialAnalysis.Lib     = c("sp", "sf","geosphere", "geoR", "maps", "fields",
                             "rnaturalearth", "rnaturalearthdata")
smoothing.Lib           = c("mgcv", "locfit", "akima")
statisticalAnalysis.Lib = c("MASS", "psych", "hydroGOF", "fields", "leaps", "car")
dataManipulation.Lib    = c("dplyr", "reshape2", "reshape", "tidyr")
dataVisualization.Lib   = c("ggplot2", "GGally", "ggpubr", "scales",
                            "RColorBrewer")
modeling.Lib            = c("nnet", "MASS", "verification", "spBayes", "rstan")
list.packages           = unique(c(spatialAnalysis.Lib, statisticalAnalysis.Lib,
                                   smoothing.Lib, modeling.Lib,
                                   dataManipulation.Lib, dataVisualization.Lib))
# Load all required packages
sapply(list.packages, require, character.only = TRUE)

##                sp                sf         geosphere              geoR 
##              TRUE              TRUE              TRUE              TRUE 
##              maps            fields     rnaturalearth rnaturalearthdata 
##              TRUE              TRUE              TRUE              TRUE 
##              MASS             psych          hydroGOF             leaps 
##              TRUE              TRUE              TRUE              TRUE 
##               car              mgcv            locfit             akima 
##              TRUE              TRUE              TRUE              TRUE 
##              nnet      verification           spBayes             rstan 
##              TRUE              TRUE              TRUE              TRUE 
##             dplyr          reshape2           reshape             tidyr 
##              TRUE              TRUE              TRUE              TRUE 
##           ggplot2            GGally            ggpubr            scales 
##              TRUE              TRUE              TRUE              TRUE 
##      RColorBrewer 
##              TRUE

2 Load data and pre-process

# Path -------------------------------------------------------------------------
path      = "/Users/caje3244/OneDrive - UCB-O365/2025 Spring/CVEN 6833/"
path.code = file.path(path, "Codes/")
path.plot = file.path(path, "Figures/")

# load functions to perform regression
source(file = file.path(path.code, "libraryRegression.R"))

# Data -------------------------------------------------------------------------
topography = read.table("http://civil.colorado.edu/~balajir/CVEN6833/HWs/HW-1/india-grid-topo.txt")
colnames(topography)    = c("lon", "lat", "elev")

# new climatol data with calculated distances from Geodesic method
climaGeo = readRDS(file = paste0(path.code, "RData/hw1_climatol_distGeodesic.rds"))
climaGeo = na.omit(climaGeo)
climaGeo$prec = climaGeo$prec/10 # mm to cm
coords = dplyr::select(climaGeo, c(lon, lat))
# scale all vars dependent of lon and lat 
climaGeo = climaGeo %>%
  dplyr::mutate(across(!c(prec), scale, center = TRUE, scale = TRUE)) %>%
  dplyr::mutate(across(!c(prec), as.numeric))

scale.color  = c("#d73027","#313695")
scale.label  = unique(cut_interval(climaGeo$prec, n=2))
pairs.ggplot(climaGeo)

## Basic analysis

set.seed(2526321)
# v0) basic linear model 
str(data)

## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), 
##     envir = .GlobalEnv, overwrite = TRUE)

basic.model = lm(prec ~ ., data = climaGeo)
# If the VIF is larger than 1/(1-R2)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity

##       lon       lat      elev   mdCoast mdArabian  mdBengal 
##      TRUE     FALSE     FALSE     FALSE      TRUE      TRUE

cat("Since for some variables VIF is larger than 1/(1-R2), let's remove:
mdArabian and mdBengal and just keep the mdCoast (minimun distance to any coast) \n")

## Since for some variables VIF is larger than 1/(1-R2), let's remove:
## mdArabian and mdBengal and just keep the mdCoast (minimun distance to any coast)

# v1) linear model 
climaGeo    = dplyr::select(climaGeo, -c(mdArabian, mdBengal))
basic.model = lm(prec ~ ., data = climaGeo)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity

##     lon     lat    elev mdCoast 
##   FALSE   FALSE   FALSE   FALSE

# v2) basic linear + interacterions
cat("Cool, for all the variables VIF < 1/(1-R2)
Let's add some interactiones between the variables \n")

## Cool, for all the variables VIF < 1/(1-R2)
## Let's add some interactiones between the variables

climaGeo    = climaGeo %>%
  dplyr::mutate(lld  = lon * lat * mdCoast,
                ll2d = lon * lat^2 * mdCoast, 
                l2ld = lon^2 * lat * mdCoast,
                ed   = elev * mdCoast
                )

basic.model = lm(prec ~ ., data = climaGeo)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity

##     lon     lat    elev mdCoast     lld    ll2d    l2ld      ed 
##   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE

cat("Since for all the variables VIF < 1/(1-R2)
Let's proceed with BHSM \n")

## Since for all the variables VIF < 1/(1-R2)
## Let's proceed with BHSM

3 Bayesian Hierarquical Spatial Model

3.1 Fitting model

fitting.BHSM = fit_BHSM(model.type = "GLM", 
                        family     = "gaussian",
                        data       = climaGeo, 
                        coords     = coords, 
                        topography = topography,
                        type       = "variogram")

## 
##  ------------------------ Fitting GLM model
## 
## ------------ Fitting GLM with gaussian + identity link
## 
## Model link comparison:
## Model summary:
##             model      MSE     RMSE        R2      AIC     BIC
## identity identity 1313.579 36.24333 0.6681692 2474.523 2492.05
## 
## Best link function: identity
##   MSE = 1313.579
##   RMSE = 36.243
##   R2 = 0.668
##   AIC = 2474.523
##   BIC = 2492.050
## 
##  ------------------------ Summary
## 
## ------------ Fitting BHSM with exp potential function 
## 
## ------------ Fitting BHSM with spherical potential function 
## 
## ------------ Fitting BHSM with gaussian potential function 
## 
## Model link comparison:
## Model summary:
##               model      MSE     RMSE        R2
## exp             exp 15521.30 124.5845 0.8110962
## spherical spherical 16685.15 129.1710 0.7969314
## gaussian   gaussian 22221.87 149.0700 0.7295460
## 
## Best potential function: exp
##   MSE = 15521.295
##   RMSE = 124.584
##   R2 = 0.811
## 
##  ------------------------ Running MCMC for BHSM
##    Note 1: If your coordinates are normalized, phi in (0.05, 3) usually works.
##    Note 2: Modify if required the params for binomial family----------------------------------------
##  General model description
## ----------------------------------------
## Model fit with 246 observations.
## 
## Number of covariates 6 (including intercept if specified).
## 
## Using the exponential spatial correlation model.
## 
## Number of MCMC samples 2000.
## 
## Priors and hyperpriors:
##  beta flat.
##  sigma.sq IG hyperpriors shape=2.00000 and scale=1.00000
##  tau.sq IG hyperpriors shape=2.00000 and scale=1.00000
##  phi Unif hyperpriors a=0.05000 and b=3.00000
## -------------------------------------------------
##      Sampling
## -------------------------------------------------
## Sampled: 200 of 2000, 10.00%
## Report interval Metrop. Acceptance rate: 37.50%
## Overall Metrop. Acceptance rate: 37.50%
## -------------------------------------------------
## Sampled: 400 of 2000, 20.00%
## Report interval Metrop. Acceptance rate: 26.50%
## Overall Metrop. Acceptance rate: 32.00%
## -------------------------------------------------
## Sampled: 600 of 2000, 30.00%
## Report interval Metrop. Acceptance rate: 24.00%
## Overall Metrop. Acceptance rate: 29.33%
## -------------------------------------------------
## Sampled: 800 of 2000, 40.00%
## Report interval Metrop. Acceptance rate: 29.50%
## Overall Metrop. Acceptance rate: 29.38%
## -------------------------------------------------
## Sampled: 1000 of 2000, 50.00%
## Report interval Metrop. Acceptance rate: 29.50%
## Overall Metrop. Acceptance rate: 29.40%
## -------------------------------------------------
## Sampled: 1200 of 2000, 60.00%
## Report interval Metrop. Acceptance rate: 34.50%
## Overall Metrop. Acceptance rate: 30.25%
## -------------------------------------------------
## Sampled: 1400 of 2000, 70.00%
## Report interval Metrop. Acceptance rate: 23.50%
## Overall Metrop. Acceptance rate: 29.29%
## -------------------------------------------------
## Sampled: 1600 of 2000, 80.00%
## Report interval Metrop. Acceptance rate: 24.50%
## Overall Metrop. Acceptance rate: 28.69%
## -------------------------------------------------
## Sampled: 1800 of 2000, 90.00%
## Report interval Metrop. Acceptance rate: 26.50%
## Overall Metrop. Acceptance rate: 28.44%
## -------------------------------------------------
## Sampled: 2000 of 2000, 100.00%
## Report interval Metrop. Acceptance rate: 23.50%
## Overall Metrop. Acceptance rate: 27.95%
## -------------------------------------------------
## -------------------------------------------------
##      Recovering beta and w
## -------------------------------------------------
## Sampled: 99 of 1001, 9.89%
## Sampled: 199 of 1001, 19.88%
## Sampled: 299 of 1001, 29.87%
## Sampled: 399 of 1001, 39.86%
## Sampled: 499 of 1001, 49.85%
## Sampled: 599 of 1001, 59.84%
## Sampled: 699 of 1001, 69.83%
## Sampled: 799 of 1001, 79.82%
## Sampled: 899 of 1001, 89.81%
## Sampled: 999 of 1001, 99.80%

## -------------------------------------------------
##      Recovering beta
## -------------------------------------------------
## Sampled: 99 of 1001, 9.89%
## Sampled: 199 of 1001, 19.88%
## Sampled: 299 of 1001, 29.87%
## Sampled: 399 of 1001, 39.86%
## Sampled: 499 of 1001, 49.85%
## Sampled: 599 of 1001, 59.84%
## Sampled: 699 of 1001, 69.83%
## Sampled: 799 of 1001, 79.82%
## Sampled: 899 of 1001, 89.81%
## Sampled: 999 of 1001, 99.80%
## 
##  ------------------------ Spatial Mapping

3.2 Binned empirical (semi)variogram

fitting.BHSM$emp.plot

The binned empirical variogram shows an increasing trend, where the closer points exhibit lower variance, while the farther apart points show greater variability.

3.3 Best model for variogram

The exponential model (dashed red line on figure above) was selected based on lowest RMSE (124.58) and highest \(R^2\) (0.81), indicating a good spatial fit.

fitting.BHSM$model

## Nonlinear regression model
##   model: value ~ model(distance, c0, c1, a)
##    data: data
##     c0     c1      a 
##  488.5 1151.8  111.3 
##  residual sum-of-squares: 543245
## 
## Number of iterations to convergence: 16 
## Achieved convergence tolerance: 1.49e-08

3.4 Posterior distribution of parameters

fitting.BHSM$plot.pparms

From posterior distributions of parameters (\(\sigma^2\), \(\tau^2\), \(\phi\)), most of the variability is due to spatial structure rather than random noise.

\(\sigma^2\) (sigma.sq): sill is high, reflecting an important spatial variability.
\(\tau2\) (tau.sq): nugget effect (random noise) is low, supporting spatial smoothness.
\(\phi\) (phi): control spatial decay, it’s concentrated between 0.05 and 0.25

3.5 Posterior distribution of covariables

fitting.BHSM$plot.pcovars

Posterior distributions of Latitude (lat) and Elevation (elev) are narrow and centered away from zero, indicating these are strong and statistically significant predictors of precipitation.
Longitude (lon) and Intercept, also show narrow and peaked distributions, which suggest a contribution to the model.
The interaction term ed, show a posterior density wide and flatter. Then, ed implies a high uncertainty and likely a weak influence on precipitation.
And lld is slightly more peaked, i.e., moderate impact, but not as influential as lat or elev.

3.6 Scatterplot of observed vs modeled precipitation

fitting.BHSM$scatter

The fit show a very tight clustering along the diagonal, with near-perfect linear fit.
Metrics results (RMSE = 0 and \(R^2\) = 1) likely reflect overfitting or high model confidence, possibly due to spatial correlation being fully captured.

3.7 Model diagnostics

fitting.BHSM$mod.diag

a) Normality: The Q-Q plot through both tails indicates heavy-tailed residuals. The central section fits well, but since there are outliers in both tails, it means that some observations are not well described by the model and are probably due to localized spatial effects.

b) Histogram: Shows a peak strongly centered on zero, reflecting that the majority of residuals are near zero, validating high accuracy. But thin spikes indicate the model has the potential to overfit, fitting data too closely. Residuals are skewed in their distribution, reflecting potential non-linearity or issues of variance that might need to be transformed.

c) Independence: ACF plot indicates strong autocorrelation at lag 1, with spatial correlation not being fully explained in the residuals. Periodic structure indicates some spatial dependency is left unmodeled.

d) Homoscedasticity: The residual plot indicates an uneven spread, with greater variance in extreme values of precipitation. It has a slight funnel shape, so there is heteroscedasticity, i.e., the errors grow as precipitation values increase.

3.8 Spatial mapping

fitting.BHSM$pred.sptMap

The upper figure shows that a higher precipitation is observed in the northern and coastal regions, which is in line with the expected climatic patterns. Central and western India exhibit lower precipitation, consistent with the drier climatic conditions of these regions.

fitting.BHSM$se.sptMap

Standard Error Map show extremely low values (in the order of 1e-14), indicating high model certainty.

CVEN 6833: Advanced Data Analysis Techniques

Homework 2
Problem 1: Bayesian Hierarquical Spatial Model

1 Library

2 Load data and pre-process

3 Bayesian Hierarquical Spatial Model

3.1 Fitting model

3.2 Binned empirical (semi)variogram

3.3 Best model for variogram

3.4 Posterior distribution of parameters

3.5 Posterior distribution of covariables

3.6 Scatterplot of observed vs modeled precipitation

3.7 Model diagnostics

3.8 Spatial mapping

CVEN 6833: Advanced Data Analysis Techniques

Homework 2 Problem 1: Bayesian Hierarquical Spatial Model

1 Library

2 Load data and pre-process

3 Bayesian Hierarquical Spatial Model

3.1 Fitting model

3.2 Binned empirical (semi)variogram

3.3 Best model for variogram

3.4 Posterior distribution of parameters

3.5 Posterior distribution of covariables

3.6 Scatterplot of observed vs modeled precipitation

3.7 Model diagnostics

3.8 Spatial mapping

Homework 2
Problem 1: Bayesian Hierarquical Spatial Model