Catalina Jerez
catalina.jerez@colorado.edu

The goal of this analysis is to estimate the underlying precipitation surface in India using sparse spatial observations. This is crucial for applications such as floodplain management, natural hazard mitigation, and climate modeling. The study utilizes regression models, crossvalidation, and spatial mapping techniques.

We’ll fit a Generalized Linear Model (and Local GLM) using binomial family. Then, let \(P_i\) be the precipitation at station \(i\). We define the binary variable \(B_{P_i}\) as:

\[ B_{P_i} = \begin{cases} 1 & \text{if } P_i \geq 150 \text{ cm} \\ 0 & \text{if } P_i < 150 \text{ cm} \end{cases} \]

Regressions will be executed with the fitRegression() function, which returns:

Best fitted model.
Scatterplot of simulated vs predicted.
Model diagnostics
Spatially map for fitted and SE values.

1 Library

gc()

##           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells  543113 29.1    1201611 64.2         NA   700280 37.4
## Vcells 1002380  7.7    8388608 64.0      36864  1963567 15.0

rm(list = ls())
# Load necessary libraries
spatialAnalysis.Lib     = c("sp", "sf","geosphere", "geoR", "maps", "fields",
                             "rnaturalearth", "rnaturalearthdata")
smoothing.Lib           = c("mgcv", "locfit", "akima")
statisticalAnalysis.Lib = c("MASS", "psych", "hydroGOF", "fields", "leaps", "car")
dataManipulation.Lib    = c("dplyr", "reshape2", "reshape", "tidyr")
dataVisualization.Lib   = c("ggplot2", "GGally", "ggpubr", "scales",
                            "RColorBrewer")
modeling.Lib            = c("nnet", "MASS", "verification")
list.packages           = unique(c(spatialAnalysis.Lib, statisticalAnalysis.Lib,
                                   smoothing.Lib, modeling.Lib,
                                   dataManipulation.Lib, dataVisualization.Lib))
# Load all required packages
sapply(list.packages, require, character.only = TRUE)

##                sp                sf         geosphere              geoR 
##              TRUE              TRUE              TRUE              TRUE 
##              maps            fields     rnaturalearth rnaturalearthdata 
##              TRUE              TRUE              TRUE              TRUE 
##              MASS             psych          hydroGOF             leaps 
##              TRUE              TRUE              TRUE              TRUE 
##               car              mgcv            locfit             akima 
##              TRUE              TRUE              TRUE              TRUE 
##              nnet      verification             dplyr          reshape2 
##              TRUE              TRUE              TRUE              TRUE 
##           reshape             tidyr           ggplot2            GGally 
##              TRUE              TRUE              TRUE              TRUE 
##            ggpubr            scales      RColorBrewer 
##              TRUE              TRUE              TRUE

2 Load data and pre-process

# Path -------------------------------------------------------------------------
path      = "/Users/caje3244/OneDrive - UCB-O365/2025 Spring/CVEN 6833/"
path.code = file.path(path, "Codes/")
path.plot = file.path(path, "Figures/")

# load functions to perform regression
source(file = file.path(path.code, "libraryRegression.R"))

# Data -------------------------------------------------------------------------
topography = read.table("http://civil.colorado.edu/~balajir/CVEN6833/HWs/HW-1/india-grid-topo.txt")
colnames(topography)    = c("lon", "lat", "elev")

# new climatol data with calculated distances from Geodesic method
climaGeo = readRDS(file = paste0(path.code, "RData/hw1_climatol_distGeodesic.rds"))
climaGeo = na.omit(climaGeo)
climaGeo$prec = climaGeo$prec/10 # mm to cm
coords = dplyr::select(climaGeo, c(lon, lat))
# scale all vars dependent of lon and lat 
climaGeo = climaGeo %>%
  dplyr::mutate(across(!c(prec), scale, center = TRUE, scale = TRUE)) %>%
  dplyr::mutate(across(!c(prec), as.numeric))

scale.color  = c("#d73027","#313695")
scale.label  = unique(cut_interval(climaGeo$prec, n=2))
pairs.ggplot(climaGeo)

## Set threshold

climaGeo$prec = ifelse(climaGeo$prec > 150, 1, 0)

2.1 Basic analysis

set.seed(250504)
# v0) basic linear model 
basic.model = lm(prec ~ ., data = climaGeo)
# If the VIF is larger than 1/(1-R2)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity

##       lon       lat      elev   mdCoast mdArabian  mdBengal 
##      TRUE     FALSE     FALSE     FALSE      TRUE      TRUE

cat("Since for some variables VIF is larger than 1/(1-R2), let's remove:
mdArabian and mdBengal and just keep the mdCoast (minimun distance to any coast) \n")

## Since for some variables VIF is larger than 1/(1-R2), let's remove:
## mdArabian and mdBengal and just keep the mdCoast (minimun distance to any coast)

# v1) linear model 
climaGeo    = dplyr::select(climaGeo, -c(mdArabian, mdBengal))
basic.model = lm(prec ~ ., data = climaGeo)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity

##     lon     lat    elev mdCoast 
##   FALSE   FALSE   FALSE   FALSE

# v2) basic linear + interacterions
cat("Cool, for all the variables VIF < 1/(1-R2)
Let's add some interactiones between the variables \n")

## Cool, for all the variables VIF < 1/(1-R2)
## Let's add some interactiones between the variables

climaGeo    = climaGeo %>%
  dplyr::mutate(lld  = lon * lat * mdCoast,
                ll2d = lon * lat^2 * mdCoast, 
                l2ld = lon^2 * lat * mdCoast,
                ed   = elev * mdCoast
                )

basic.model = lm(prec ~ ., data = climaGeo)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity

##     lon     lat    elev mdCoast     lld    ll2d    l2ld      ed 
##   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE

cat("Since for all the variables VIF < 1/(1-R2)
Let's proceed with other regression models \n")

## Since for all the variables VIF < 1/(1-R2)
## Let's proceed with other regression models

3 Generalized Linear Model

3.1 Fitting GLM

fitting.glm_Bin = fitRegression(model.type = "GLM", data = climaGeo, 
                                coords = coords, topography = topography,
                                suffix.fig = "HW2/P2/FigP2_GLM", family = "binomial",
                                path = path)

## 
##  ------------------------ Fitting GLM model
## 
## ------------ Fitting GLM with binomial + logit link
## 
## ------------ Fitting GLM with binomial + probit link
## 
## ------------ Fitting GLM with binomial + cauchit link
## 
## Model link comparison:
## Model summary:
##           model        MSE      RMSE          R2      AIC      BIC
## logit     logit  22.065688  4.697413  -175.63058 1471.248 1492.280
## probit   probit   5.538158  2.353329   -43.33163 1131.187 1152.219
## cauchit cauchit 175.929668 13.263848 -1407.27510 1981.958 2002.990
## 
## Best link function: probit
##   MSE = 5.538
##   RMSE = 2.353
##   R2 = -43.332
##   AIC = 1131.187
##   BIC = 1152.219
## 
##  ------------------------ Summary

## 
##  ------------------------ ANOVA
## Analysis of Deviance Table
## 
## Model: binomial, link: probit
## 
## Response: prec
## 
## Terms added sequentially (first to last)
## 
## 
##         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                      245    204.825              
## lon      1   22.978       244    181.846 1.638e-06 ***
## mdCoast  1   40.460       243    141.386 2.007e-10 ***
## lld      1   44.499       242     96.888 2.545e-11 ***
## ll2d     1    2.348       241     94.539    0.1254    
## l2ld     1    2.682       240     91.858    0.1015    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##  ------------------------ Spatial Mapping

3.2 Best fitted model

fitting.glm_Bin$model

## 
## Call:  glm(formula = prec ~ ., family = family.func(link = links[j]), 
##     data = data, start = start.values, maxit = 1000)
## 
## Coefficients:
## (Intercept)          lon      mdCoast          lld         ll2d         l2ld  
##    -2.12014      0.58005     -0.43261     -2.07166      0.04324      0.50146  
## 
## Degrees of Freedom: 245 Total (i.e. Null);  240 Residual
## Null Deviance:       204.8 
## Residual Deviance: 91.86     AIC: 103.9

3.3 Scatterplot of observed vs modeled precipitation

fitting.glm_Bin$scatter

3.4 Model diagnostics

fitting.glm_Bin$anova

fitting.glm_Bin$mod.diag

From ANOVA: lon, mdCoast, and lld are statistically significant.

And model diagnostics indicates:

Q-Q plot shows minor deviations from normality.
Histogram and ACF indicate residuals have some structure.
Residuals vs Predicted shows heteroskedasticity and curvature, which suggests nonlinearity.

3.5 Spatial mapping

fitting.glm_Bin$pred.sptMap

fitting.glm_Bin$se.sptMap

4 Local GLM

4.1 Fitting locGLM

source(file = file.path(path.code, "libraryRegression.R"))
fitting.loc_glm = fitRegression(model.type = "locPoly", data = climaGeo, 
                                coords = coords, topography = topography,
                                suffix.fig = "HW2/P2/FigP2_LocGLM", family = "binomial",
                                path = path)

## 
##  ------------------------ Fitting locPoly model
## 
## ------------ Fitting local Polynomial with binomial + logit link
## Kernel: bisq | Alpha: 0.050 | Degree: 1
## 
## ------------ Fitting local Polynomial with binomial + probit link
## Kernel: bisq | Alpha: 0.050 | Degree: 1
## 
## ------------ Fitting local Polynomial with binomial + cloglog link
## Kernel: bisq | Alpha: 0.050 | Degree: 1
## 
## Model link comparison:
## Model summary:
##           model          MSE         RMSE R2
## logit     logit 1.392645e-11 3.731816e-06  1
## probit   probit 1.392645e-11 3.731816e-06  1
## cloglog cloglog 1.392645e-11 3.731816e-06  1
## 
## Best link function: logit
##   MSE = 0.000
##   RMSE = 0.000
##   R2 = 1.000
## 
##  ------------------------ Summary

## 
##  ------------------------ Spatial Mapping

4.2 Best fitted model

fitting.loc_glm$model

## Call:
## locfit(formula = prec ~ ., data = data, alpha = best.alpha, maxk = 1000, 
##     deg = best.deg, kern = best.kern, ev = dat(), family = family, 
##     link = links[j])
## 
## Number of observations:          246 
## Family:  Logistic 
## Fitted Degrees of freedom:       0 
## Residual scale:                  1

4.2.1 Scatterplot of observed vs modeled precipitation

fitting.loc_glm$scatter

4.3 Model diagnostics

fitting.loc_glm$mod.diag

4.4 Spatial mapping

fitting.loc_glm$pred.sptMap

fitting.loc_glm$se.sptMap

5 Summary

GLM is interpretable and statistically valid, but has weak predictive performance.
The local GLM fits the data extremely well, but MSE = 0 and perfect \(R^2\) suggest zero generalization to unseen data.

CVEN 6833: Advanced Data Analysis Techniques

Homework 2
Problem 2: Underlying precipitation surface under GLM

1 Library

2 Load data and pre-process

2.1 Basic analysis

3 Generalized Linear Model

3.1 Fitting GLM

3.2 Best fitted model

3.3 Scatterplot of observed vs modeled precipitation

3.4 Model diagnostics

3.5 Spatial mapping

4 Local GLM

4.1 Fitting locGLM

4.2 Best fitted model

4.2.1 Scatterplot of observed vs modeled precipitation

4.3 Model diagnostics

4.4 Spatial mapping

5 Summary

CVEN 6833: Advanced Data Analysis Techniques

Homework 2 Problem 2: Underlying precipitation surface under GLM

1 Library

2 Load data and pre-process

2.1 Basic analysis

3 Generalized Linear Model

3.1 Fitting GLM

3.2 Best fitted model

3.3 Scatterplot of observed vs modeled precipitation

3.4 Model diagnostics

3.5 Spatial mapping

4 Local GLM

4.1 Fitting locGLM

4.2 Best fitted model

4.2.1 Scatterplot of observed vs modeled precipitation

4.3 Model diagnostics

4.4 Spatial mapping

5 Summary

Homework 2
Problem 2: Underlying precipitation surface under GLM