Catalina Jerez
catalina.jerez@colorado.edu


The goal of this analysis is to estimate the underlying precipitation surface in India using sparse spatial observations. This is crucial for applications such as floodplain management, natural hazard mitigation, and climate modeling. The study utilizes regression models, crossvalidation, and spatial mapping techniques.

We’ll fit a Generalized Linear Model (and Local GLM) using binomial family. Then, let \(P_i\) be the precipitation at station \(i\). We define the binary variable \(B_{P_i}\) as:

\[ B_{P_i} = \begin{cases} 1 & \text{if } P_i \geq 150 \text{ cm} \\ 0 & \text{if } P_i < 150 \text{ cm} \end{cases} \]

Regressions will be executed with the fitBHSM() function, which returns:

  1. Binned empirical plot.

  2. Best fitted model.

  3. MCMC (fit, samples, predicted values)

  4. Graph for posterior distributions of parameters (nugget effect, sill variance) and covariables.

  5. Diagnostics

  6. Spatially map for fitted and SE values.

1 Library

gc()
##           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells  543056 29.1    1201447 64.2         NA   700282 37.4
## Vcells 1002658  7.7    8388608 64.0      36864  1963597 15.0
rm(list = ls())
# Load necessary libraries
spatialAnalysis.Lib     = c("sp", "sf","geosphere", "geoR", "maps", "fields",
                             "rnaturalearth", "rnaturalearthdata")
smoothing.Lib           = c("mgcv", "locfit", "akima")
statisticalAnalysis.Lib = c("MASS", "psych", "hydroGOF", "fields", "leaps", "car")
dataManipulation.Lib    = c("dplyr", "reshape2", "reshape", "tidyr")
dataVisualization.Lib   = c("ggplot2", "GGally", "ggpubr", "scales",
                            "RColorBrewer")
modeling.Lib            = c("nnet", "MASS", "verification", "spBayes", "rstan")
list.packages           = unique(c(spatialAnalysis.Lib, statisticalAnalysis.Lib,
                                   smoothing.Lib, modeling.Lib,
                                   dataManipulation.Lib, dataVisualization.Lib))
# Load all required packages
sapply(list.packages, require, character.only = TRUE)
##                sp                sf         geosphere              geoR 
##              TRUE              TRUE              TRUE              TRUE 
##              maps            fields     rnaturalearth rnaturalearthdata 
##              TRUE              TRUE              TRUE              TRUE 
##              MASS             psych          hydroGOF             leaps 
##              TRUE              TRUE              TRUE              TRUE 
##               car              mgcv            locfit             akima 
##              TRUE              TRUE              TRUE              TRUE 
##              nnet      verification           spBayes             rstan 
##              TRUE              TRUE              TRUE              TRUE 
##             dplyr          reshape2           reshape             tidyr 
##              TRUE              TRUE              TRUE              TRUE 
##           ggplot2            GGally            ggpubr            scales 
##              TRUE              TRUE              TRUE              TRUE 
##      RColorBrewer 
##              TRUE

2 Load data and pre-process

# Path -------------------------------------------------------------------------
path      = "/Users/caje3244/OneDrive - UCB-O365/2025 Spring/CVEN 6833/"
path.code = file.path(path, "Codes/")
path.plot = file.path(path, "Figures/")

# load functions to perform regression
source(file = file.path(path.code, "libraryRegression.R"))

# Data -------------------------------------------------------------------------
topography = read.table("http://civil.colorado.edu/~balajir/CVEN6833/HWs/HW-1/india-grid-topo.txt")
colnames(topography)    = c("lon", "lat", "elev")

# new climatol data with calculated distances from Geodesic method
climaGeo = readRDS(file = paste0(path.code, "RData/hw1_climatol_distGeodesic.rds"))
climaGeo = na.omit(climaGeo)
climaGeo$prec = climaGeo$prec/10 # mm to cm
coords = dplyr::select(climaGeo, c(lon, lat))
# scale all vars dependent of lon and lat 
climaGeo = climaGeo %>%
  dplyr::mutate(across(!c(prec), scale, center = TRUE, scale = TRUE)) %>%
  dplyr::mutate(across(!c(prec), as.numeric))

scale.color  = c("#d73027","#313695")
scale.label  = unique(cut_interval(climaGeo$prec, n=2))
pairs.ggplot(climaGeo)

## Set threshold

climaGeo$prec = ifelse(climaGeo$prec > 150, 1, 0)

2.1 Basic analysis

set.seed(2526321)
# v0) basic linear model 
str(data)
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), 
##     envir = .GlobalEnv, overwrite = TRUE)
basic.model = lm(prec ~ ., data = climaGeo)
# If the VIF is larger than 1/(1-R2)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity 
##       lon       lat      elev   mdCoast mdArabian  mdBengal 
##      TRUE     FALSE     FALSE     FALSE      TRUE      TRUE
cat("Since for some variables VIF is larger than 1/(1-R2), let's remove:
mdArabian and mdBengal and just keep the mdCoast (minimun distance to any coast) \n")
## Since for some variables VIF is larger than 1/(1-R2), let's remove:
## mdArabian and mdBengal and just keep the mdCoast (minimun distance to any coast)
# v1) linear model 
climaGeo    = dplyr::select(climaGeo, -c(mdArabian, mdBengal))
basic.model = lm(prec ~ ., data = climaGeo)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity 
##     lon     lat    elev mdCoast 
##   FALSE   FALSE   FALSE   FALSE
# v2) basic linear + interacterions
cat("Cool, for all the variables VIF < 1/(1-R2)
Let's add some interactiones between the variables \n")
## Cool, for all the variables VIF < 1/(1-R2)
## Let's add some interactiones between the variables
climaGeo    = climaGeo %>%
  dplyr::mutate(lld  = lon * lat * mdCoast,
                ll2d = lon * lat^2 * mdCoast, 
                l2ld = lon^2 * lat * mdCoast,
                ed   = elev * mdCoast
                )

basic.model = lm(prec ~ ., data = climaGeo)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity 
##     lon     lat    elev mdCoast     lld    ll2d    l2ld      ed 
##   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE
cat("Since for all the variables VIF < 1/(1-R2)
Let's proceed with BHSM \n")
## Since for all the variables VIF < 1/(1-R2)
## Let's proceed with BHSM

3 Bayesian Hierarquical Spatial Model

3.1 Fitting model

source(file = file.path(path.code, "libraryRegression.R"))
fitting.BHSM = fit_BHSM(model.type = "GLM", 
                        family     = "binomial",
                        data       = climaGeo, 
                        coords     = coords, 
                        topography = topography,
                        type       = "variogram")
## 
##  ------------------------ Fitting GLM model
## 
## ------------ Fitting GLM with binomial + logit link
## 
## ------------ Fitting GLM with binomial + probit link
## 
## ------------ Fitting GLM with binomial + cauchit link
## 
## Model link comparison:
## Model summary:
##           model        MSE      RMSE          R2      AIC      BIC
## logit     logit  22.065688  4.697413  -175.63058 1471.248 1492.280
## probit   probit   5.538158  2.353329   -43.33163 1131.187 1152.219
## cauchit cauchit 175.929668 13.263848 -1407.27510 1981.958 2002.990
## 
## Best link function: probit
##   MSE = 5.538
##   RMSE = 2.353
##   R2 = -43.332
##   AIC = 1131.187
##   BIC = 1152.219
## 
##  ------------------------ Summary
## 
## ------------ Fitting BHSM with exp potential function 
## 
## ------------ Fitting BHSM with spherical potential function 
## 
## ------------ Fitting BHSM with gaussian potential function 
## 
## Model link comparison:
## Model summary:
##               model          MSE        RMSE        R2
## exp             exp 2.842053e-06 0.001685839 0.5043667
## spherical spherical 2.842047e-06 0.001685837 0.5043677
## gaussian   gaussian 2.777806e-06 0.001666675 0.5155710
## 
## Best potential function: gaussian
##   MSE = 0.000
##   RMSE = 0.002
##   R2 = 0.516
## 
##  ------------------------ Running MCMC for BHSM
##    Note 1: If your coordinates are normalized, phi in (0.05, 3) usually works.
##    Note 2: Modify if required the params for binomial family----------------------------------------
##  General model description
## ----------------------------------------
## Model fit with 246 observations.
## 
## Number of covariates 6 (including intercept if specified).
## 
## Using the exponential spatial correlation model.
## 
## Number of MCMC samples 2000.
## 
## Priors and hyperpriors:
##  beta flat.
##  sigma.sq IG hyperpriors shape=2.00000 and scale=1.00000
##  tau.sq IG hyperpriors shape=0.01000 and scale=0.00100
##  phi Unif hyperpriors a=0.05000 and b=3.00000
## -------------------------------------------------
##      Sampling
## -------------------------------------------------
## Sampled: 200 of 2000, 10.00%
## Report interval Metrop. Acceptance rate: 61.00%
## Overall Metrop. Acceptance rate: 61.00%
## -------------------------------------------------
## Sampled: 400 of 2000, 20.00%
## Report interval Metrop. Acceptance rate: 68.50%
## Overall Metrop. Acceptance rate: 64.75%
## -------------------------------------------------
## Sampled: 600 of 2000, 30.00%
## Report interval Metrop. Acceptance rate: 67.00%
## Overall Metrop. Acceptance rate: 65.50%
## -------------------------------------------------
## Sampled: 800 of 2000, 40.00%
## Report interval Metrop. Acceptance rate: 63.00%
## Overall Metrop. Acceptance rate: 64.88%
## -------------------------------------------------
## Sampled: 1000 of 2000, 50.00%
## Report interval Metrop. Acceptance rate: 70.50%
## Overall Metrop. Acceptance rate: 66.00%
## -------------------------------------------------
## Sampled: 1200 of 2000, 60.00%
## Report interval Metrop. Acceptance rate: 74.50%
## Overall Metrop. Acceptance rate: 67.42%
## -------------------------------------------------
## Sampled: 1400 of 2000, 70.00%
## Report interval Metrop. Acceptance rate: 66.50%
## Overall Metrop. Acceptance rate: 67.29%
## -------------------------------------------------
## Sampled: 1600 of 2000, 80.00%
## Report interval Metrop. Acceptance rate: 70.00%
## Overall Metrop. Acceptance rate: 67.62%
## -------------------------------------------------
## Sampled: 1800 of 2000, 90.00%
## Report interval Metrop. Acceptance rate: 72.50%
## Overall Metrop. Acceptance rate: 68.17%
## -------------------------------------------------
## Sampled: 2000 of 2000, 100.00%
## Report interval Metrop. Acceptance rate: 69.50%
## Overall Metrop. Acceptance rate: 68.30%
## -------------------------------------------------
## -------------------------------------------------
##      Recovering beta and w
## -------------------------------------------------
## Sampled: 99 of 1001, 9.89%
## Sampled: 199 of 1001, 19.88%
## Sampled: 299 of 1001, 29.87%
## Sampled: 399 of 1001, 39.86%
## Sampled: 499 of 1001, 49.85%
## Sampled: 599 of 1001, 59.84%
## Sampled: 699 of 1001, 69.83%
## Sampled: 799 of 1001, 79.82%
## Sampled: 899 of 1001, 89.81%
## Sampled: 999 of 1001, 99.80%
## -------------------------------------------------
##      Recovering beta
## -------------------------------------------------
## Sampled: 99 of 1001, 9.89%
## Sampled: 199 of 1001, 19.88%
## Sampled: 299 of 1001, 29.87%
## Sampled: 399 of 1001, 39.86%
## Sampled: 499 of 1001, 49.85%
## Sampled: 599 of 1001, 59.84%
## Sampled: 699 of 1001, 69.83%
## Sampled: 799 of 1001, 79.82%
## Sampled: 899 of 1001, 89.81%
## Sampled: 999 of 1001, 99.80%
## 
##  ------------------------ Spatial Mapping

Even the best GLM (probit) had a very poor \(R^2\) and no spatial structure, motivating the application of BHSM.

3.2 Binned empirical variogram

fitting.BHSM$emp.plot

Three potential functions were tested (for spatial covariance). The best kernel was get it by Gaussian model, where BHSM explains over 51.6% of variance (a significant leap from GLM).

3.3 Best model for variogram

fitting.BHSM$model
## Nonlinear regression model
##   model: value ~ model(distance, c0, c1, a)
##    data: data
##        c0        c1         a 
## 9.717e-03 5.457e-03 1.088e+02 
##  residual sum-of-squares: 9.722e-05
## 
## Number of iterations to convergence: 8 
## Achieved convergence tolerance: 1.49e-08

3.4 Posterior distribution of parameters

fitting.BHSM$plot.pparms

From figure, well-formed posterior densities suggest stable MCMC convergence.

  • \(\sigma^2\) (sigma.sq): sill (spatial variance) is between 0.1 and 0.6

  • \(\tau2\) (tau.sq): small nugget effect (0.01–0.03).

  • \(\phi\) (phi): show a moderate spatial range (0.05–0.25).

3.5 Posterior distribution of covariables

fitting.BHSM$plot.pcovars

The covariates explain part of the variability, but spatial effects carry additional predictive power.

  • Intercept is centered around 0, suggesting a balanced baseline probability.

  • lon: slight positive (location-based).

  • lat and elev shown mixed weak effects.

  • Nonlinear interaction effects with low-to-moderate impact.

3.6 Model diagnostics

fitting.BHSM$mod.diag

Diagnostics confirm model residuals are well-behaved:

a) Normality: Residuals tightly follow normal quantiles.

b) Histogram: Centered and symmetrical.

c) Independence: No significant autocorrelation remaining.

d) Homoscedasticity: Centered in zero.

3.7 Spatial mapping

fitting.BHSM$pred.sptMap

  • Predicted precipitation map shows smooth probability surfaces with high-probability areas align with known wet regions (e.g., northeastern and southeastern of India display high probability of exceeding 150 cm).
fitting.BHSM$se.sptMap

* Standard error map shows very low SE (\(\sim e{-16}\)) across India. This suggests extremely high model confidence, though may indicate over-smoothing or high model certainty due to small-scale nugget.

4 Final remarks