Catalina Jerez
catalina.jerez@colorado.edu
The goal of this analysis is to estimate the underlying precipitation surface in India using sparse spatial observations. This is crucial for applications such as floodplain management, natural hazard mitigation, and climate modeling. The study utilizes regression models, crossvalidation, and spatial mapping techniques.
We’ll fit a Generalized Linear Model (and Local GLM) using binomial family. Then, let \(P_i\) be the precipitation at station \(i\). We define the binary variable \(B_{P_i}\) as:
\[ B_{P_i} = \begin{cases} 1 & \text{if } P_i \geq 150 \text{ cm} \\ 0 & \text{if } P_i < 150 \text{ cm} \end{cases} \]
Regressions will be executed with the fitRegression() function, which returns:
Best fitted model.
Scatterplot of simulated vs predicted.
Model diagnostics
Spatially map for fitted and SE values.
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 543113 29.1 1201611 64.2 NA 700280 37.4
## Vcells 1002380 7.7 8388608 64.0 36864 1963567 15.0
rm(list = ls())
# Load necessary libraries
spatialAnalysis.Lib = c("sp", "sf","geosphere", "geoR", "maps", "fields",
"rnaturalearth", "rnaturalearthdata")
smoothing.Lib = c("mgcv", "locfit", "akima")
statisticalAnalysis.Lib = c("MASS", "psych", "hydroGOF", "fields", "leaps", "car")
dataManipulation.Lib = c("dplyr", "reshape2", "reshape", "tidyr")
dataVisualization.Lib = c("ggplot2", "GGally", "ggpubr", "scales",
"RColorBrewer")
modeling.Lib = c("nnet", "MASS", "verification")
list.packages = unique(c(spatialAnalysis.Lib, statisticalAnalysis.Lib,
smoothing.Lib, modeling.Lib,
dataManipulation.Lib, dataVisualization.Lib))
# Load all required packages
sapply(list.packages, require, character.only = TRUE)
## sp sf geosphere geoR
## TRUE TRUE TRUE TRUE
## maps fields rnaturalearth rnaturalearthdata
## TRUE TRUE TRUE TRUE
## MASS psych hydroGOF leaps
## TRUE TRUE TRUE TRUE
## car mgcv locfit akima
## TRUE TRUE TRUE TRUE
## nnet verification dplyr reshape2
## TRUE TRUE TRUE TRUE
## reshape tidyr ggplot2 GGally
## TRUE TRUE TRUE TRUE
## ggpubr scales RColorBrewer
## TRUE TRUE TRUE
# Path -------------------------------------------------------------------------
path = "/Users/caje3244/OneDrive - UCB-O365/2025 Spring/CVEN 6833/"
path.code = file.path(path, "Codes/")
path.plot = file.path(path, "Figures/")
# load functions to perform regression
source(file = file.path(path.code, "libraryRegression.R"))
# Data -------------------------------------------------------------------------
topography = read.table("http://civil.colorado.edu/~balajir/CVEN6833/HWs/HW-1/india-grid-topo.txt")
colnames(topography) = c("lon", "lat", "elev")
# new climatol data with calculated distances from Geodesic method
climaGeo = readRDS(file = paste0(path.code, "RData/hw1_climatol_distGeodesic.rds"))
climaGeo = na.omit(climaGeo)
climaGeo$prec = climaGeo$prec/10 # mm to cm
coords = dplyr::select(climaGeo, c(lon, lat))
# scale all vars dependent of lon and lat
climaGeo = climaGeo %>%
dplyr::mutate(across(!c(prec), scale, center = TRUE, scale = TRUE)) %>%
dplyr::mutate(across(!c(prec), as.numeric))
scale.color = c("#d73027","#313695")
scale.label = unique(cut_interval(climaGeo$prec, n=2))
pairs.ggplot(climaGeo)
## Set threshold
climaGeo$prec = ifelse(climaGeo$prec > 150, 1, 0)
set.seed(250504)
# v0) basic linear model
basic.model = lm(prec ~ ., data = climaGeo)
# If the VIF is larger than 1/(1-R2)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity
## lon lat elev mdCoast mdArabian mdBengal
## TRUE FALSE FALSE FALSE TRUE TRUE
cat("Since for some variables VIF is larger than 1/(1-R2), let's remove:
mdArabian and mdBengal and just keep the mdCoast (minimun distance to any coast) \n")
## Since for some variables VIF is larger than 1/(1-R2), let's remove:
## mdArabian and mdBengal and just keep the mdCoast (minimun distance to any coast)
# v1) linear model
climaGeo = dplyr::select(climaGeo, -c(mdArabian, mdBengal))
basic.model = lm(prec ~ ., data = climaGeo)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity
## lon lat elev mdCoast
## FALSE FALSE FALSE FALSE
# v2) basic linear + interacterions
cat("Cool, for all the variables VIF < 1/(1-R2)
Let's add some interactiones between the variables \n")
## Cool, for all the variables VIF < 1/(1-R2)
## Let's add some interactiones between the variables
climaGeo = climaGeo %>%
dplyr::mutate(lld = lon * lat * mdCoast,
ll2d = lon * lat^2 * mdCoast,
l2ld = lon^2 * lat * mdCoast,
ed = elev * mdCoast
)
basic.model = lm(prec ~ ., data = climaGeo)
vif(basic.model) > (1/(1-summary(basic.model)$r.squared)) # check multicolinearity
## lon lat elev mdCoast lld ll2d l2ld ed
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
cat("Since for all the variables VIF < 1/(1-R2)
Let's proceed with other regression models \n")
## Since for all the variables VIF < 1/(1-R2)
## Let's proceed with other regression models
fitting.glm_Bin = fitRegression(model.type = "GLM", data = climaGeo,
coords = coords, topography = topography,
suffix.fig = "HW2/P2/FigP2_GLM", family = "binomial",
path = path)
##
## ------------------------ Fitting GLM model
##
## ------------ Fitting GLM with binomial + logit link
##
## ------------ Fitting GLM with binomial + probit link
##
## ------------ Fitting GLM with binomial + cauchit link
##
## Model link comparison:
## Model summary:
## model MSE RMSE R2 AIC BIC
## logit logit 22.065688 4.697413 -175.63058 1471.248 1492.280
## probit probit 5.538158 2.353329 -43.33163 1131.187 1152.219
## cauchit cauchit 175.929668 13.263848 -1407.27510 1981.958 2002.990
##
## Best link function: probit
## MSE = 5.538
## RMSE = 2.353
## R2 = -43.332
## AIC = 1131.187
## BIC = 1152.219
##
## ------------------------ Summary
##
## ------------------------ ANOVA
## Analysis of Deviance Table
##
## Model: binomial, link: probit
##
## Response: prec
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 245 204.825
## lon 1 22.978 244 181.846 1.638e-06 ***
## mdCoast 1 40.460 243 141.386 2.007e-10 ***
## lld 1 44.499 242 96.888 2.545e-11 ***
## ll2d 1 2.348 241 94.539 0.1254
## l2ld 1 2.682 240 91.858 0.1015
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------ Spatial Mapping
fitting.glm_Bin$model
##
## Call: glm(formula = prec ~ ., family = family.func(link = links[j]),
## data = data, start = start.values, maxit = 1000)
##
## Coefficients:
## (Intercept) lon mdCoast lld ll2d l2ld
## -2.12014 0.58005 -0.43261 -2.07166 0.04324 0.50146
##
## Degrees of Freedom: 245 Total (i.e. Null); 240 Residual
## Null Deviance: 204.8
## Residual Deviance: 91.86 AIC: 103.9
fitting.glm_Bin$scatter
fitting.glm_Bin$anova
fitting.glm_Bin$mod.diag
From ANOVA: lon, mdCoast, and lld are statistically significant.
And model diagnostics indicates:
Q-Q plot shows minor deviations from normality.
Histogram and ACF indicate residuals have some structure.
Residuals vs Predicted shows heteroskedasticity and curvature, which suggests nonlinearity.
fitting.glm_Bin$pred.sptMap
fitting.glm_Bin$se.sptMap
source(file = file.path(path.code, "libraryRegression.R"))
fitting.loc_glm = fitRegression(model.type = "locPoly", data = climaGeo,
coords = coords, topography = topography,
suffix.fig = "HW2/P2/FigP2_LocGLM", family = "binomial",
path = path)
##
## ------------------------ Fitting locPoly model
##
## ------------ Fitting local Polynomial with binomial + logit link
## Kernel: bisq | Alpha: 0.050 | Degree: 1
##
## ------------ Fitting local Polynomial with binomial + probit link
## Kernel: bisq | Alpha: 0.050 | Degree: 1
##
## ------------ Fitting local Polynomial with binomial + cloglog link
## Kernel: bisq | Alpha: 0.050 | Degree: 1
##
## Model link comparison:
## Model summary:
## model MSE RMSE R2
## logit logit 1.392645e-11 3.731816e-06 1
## probit probit 1.392645e-11 3.731816e-06 1
## cloglog cloglog 1.392645e-11 3.731816e-06 1
##
## Best link function: logit
## MSE = 0.000
## RMSE = 0.000
## R2 = 1.000
##
## ------------------------ Summary
##
## ------------------------ Spatial Mapping
fitting.loc_glm$model
## Call:
## locfit(formula = prec ~ ., data = data, alpha = best.alpha, maxk = 1000,
## deg = best.deg, kern = best.kern, ev = dat(), family = family,
## link = links[j])
##
## Number of observations: 246
## Family: Logistic
## Fitted Degrees of freedom: 0
## Residual scale: 1
fitting.loc_glm$scatter
fitting.loc_glm$mod.diag
fitting.loc_glm$pred.sptMap
fitting.loc_glm$se.sptMap
GLM is interpretable and statistically valid, but has weak predictive performance.
The local GLM fits the data extremely well, but MSE = 0 and perfect \(R^2\) suggest zero generalization to unseen data.