Also employ any other clustering method – K-medoid clustering or Hierarchical Clustering.
Cluster the same using latitude, longitude, elevation, distance to Bay of Bengal and distance to Arabian Sea.
Compare and summarize your findings.
#### Clear Memory and Set Options
rm(list=ls())
options(warn=-1)
save <- FALSE # Change to TRUE if you want to save the plot
# Set working directory (Modify if necessary)
script_dir <- dirname(rstudioapi::getActiveDocumentContext()$path)
setwd(script_dir)
library(fpc)
source("HW2_Library.R")
## Package 'sm', version 2.2-6.0: type help(sm) for summary information
## Loading required package: lattice
## Loading required package: KernSmooth
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## Loading required package: randomForest
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: spam
## Spam version 2.11-1 (2025-01-20) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
##
## Attaching package: 'spam'
## The following objects are masked from 'package:base':
##
## backsolve, forwardsolve
## Loading required package: viridisLite
##
## Try help(fields) to get started.
## locfit 1.5-9.11 2025-01-27
##
## Attaching package: 'MASS'
## The following object is masked from 'package:MPV':
##
## cement
## The following object is masked from 'package:sm':
##
## muscle
## Loading required package: stats4
##
## Attaching package: 'stats4'
## The following object is masked from 'package:spam':
##
## mle
## Loading required package: splines
## Loading required package: boot
##
## Attaching package: 'boot'
## The following objects are masked from 'package:VGAM':
##
## logit, simplex
## The following object is masked from 'package:MPV':
##
## motor
## The following object is masked from 'package:lattice':
##
## melanoma
## The following object is masked from 'package:sm':
##
## dogs
## Loading required package: CircStats
##
## Attaching package: 'CircStats'
## The following objects are masked from 'package:VGAM':
##
## dcard, rcard
## Loading required package: dtw
## Loading required package: proxy
##
## Attaching package: 'proxy'
## The following object is masked from 'package:spam':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loaded dtw v1.23-1. See ?dtw for help, citation("dtw") for use in publication.
##
## Attaching package: 'verification'
## The following object is masked from 'package:VGAM':
##
## exponential
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
## --------------------------------------------------------------
## Analysis of Geostatistical Data
## For an Introduction to geoR go to http://www.leg.ufpr.br/geoR
## geoR version 1.9-4 (built on 2024-02-14) is now loaded
## --------------------------------------------------------------
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
## The following object is masked from 'package:maps':
##
## map
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:fields':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'cluster'
## The following object is masked from 'package:maps':
##
## votes.repub
##
## Attaching package: 'kohonen'
## The following object is masked from 'package:mclust':
##
## map
## The following object is masked from 'package:maps':
##
## map
# Color palette
myPalette <- colorRampPalette(rev(brewer.pal(9, "RdBu")), space="Lab")
nrows=68
ncols=65
ntime <- 67 #Jun-Sep 1950 - 2016
nyrs <- ntime
nglobe <- nrows*ncols
N <- nrows*ncols
### Lat - Long grid..
#ygrid=seq(6.5,38.5,by=0.25)
ygrid=seq(6.5,38.5,by=0.5)
ny=length(ygrid)
#xgrid=seq(66.5,100,by=0.25)
xgrid=seq(66.5,100,by=0.5)
nx=length(xgrid)
xygrid=matrix(0,nrow=nx*ny,ncol=2)
i=0
for(iy in 1:ny){
for(ix in 1:nx){
i=i+1
xygrid[i,1]=ygrid[iy]
xygrid[i,2]=xgrid[ix]
}
}
# Read India Gridded Rainfall data..
data=readBin("data/India-Rain-JJAS-05deg-1950-2016.r4",what="numeric", n=( nrows * ncols * ntime), size=4,endian="swap")
#data=readBin("data.r4",what="numeric", n=( nrows * ncols * ntime), size=4,endian="swap")
data <- array(data = data, dim=c( nrows, ncols, ntime ) )
data1=data[,,1]
# the lat -long data grid..
index=1:(nx*ny)
## pull only data and the locations with non-missing data
index1=index[data1 != "NaN"] # only non-missing data.
xygrid1=xygrid[index1,]
nsites=length(index1)
data2=data1[index1]
### Rain data matrix - rows are seasonal (i.e. one value per year)
## and columns are locations
raindata=matrix(NA,nrow=nyrs, ncol=nsites)
for(i in 1:nyrs){
data1=data[,,i]
index1=index[ data1 != "NaN"]
data2=data1[index1]
raindata[i,]=data2
}
index = 1:dim(raindata)[2]
xx = apply(raindata,2,mean)
index2 = index1[xx > 0]
index3 = index[xx > 0]
xygrid1=xygrid[index2,]
rainavg = raindata[,index3]
indexgrid = index2
rm("data") #remove the object data to clear up space
seasonal_avg <- colMeans(rainavg)
X1 <- cbind(xygrid1, seasonal_avg)
colnames(X1) <- c("Latitude", "Longitude", "SeasonalAvg")
distance_data <- read.table("data/Rajgrid-lon-lat-elev-dista-distb.txt", header=FALSE)
names(distance_data) <- c("Longitude", "Latitude", "Elevation", "Dist_ArabianSea", "Dist_BayOfBengal")
#### Merge the two datasets
X <- merge(X1, distance_data, by=c("Longitude", "Latitude"))
X <- X[, c("Longitude", "Latitude", "Elevation", "Dist_ArabianSea", "Dist_BayOfBengal", "SeasonalAvg")]
prec_obs <- X$SeasonalAvg # Extract precipitation column
precip <- X[,1:3]
names(precip) = c("Long","Lat","Elev")
lon = precip[,1]
lat = precip[,2]
precip$Prec = prec_obs
precip_scale = scale(precip)
Niter=50
WSS = matrix(nrow=Niter, ncol=10)
for (j in 1:Niter) {
WSS[j,1]= (nrow(precip_scale)-1)*sum(apply(precip_scale,2,var))#with elev
for (i in 2:10) {
WSS[j,i] = sum(kmeans(precip_scale, centers=i)$withinss)
}}
wss = apply(WSS, 2, FUN = mean)
g1 = ggplot() +
geom_line(mapping = aes(1:10, wss)) +
geom_point(mapping = aes(1:10, wss), shape = 21, size = 3, color = "gray30", fill ="cadetblue1")+
labs(title = "WSS vs. # of Clusters",x = "Number of Clusters",y = "WSS (Within-Group Sum of Squares)")+
theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))+
scale_x_continuous(breaks = 1:10)
show(g1)
slope=wss[2:length(wss)]-wss[1:(length(wss)-1)]
kbest = which(slope== max(slope)) # optimal number of clusters
print(sprintf("The optimal number of clusters is %s", kbest))
## [1] "The optimal number of clusters is 8"
fit <- kmeans(precip_scale, kbest)
# get cluster means
means = aggregate(precip,by=list(fit$cluster),FUN=mean)
# append cluster assignment
Cluster=factor(fit$cluster)
data_1 = data.frame(precip,Cluster)
# Plot Precipitation Clusters Over India
n_clusters <- length(unique(data_1$Cluster))
cluster_colors <- rainbow(n_clusters)
cluster_numeric <- as.numeric(data_1$Cluster)
# Plot clusters over India
plot(data_1$Long, data_1$Lat,
col = cluster_colors[cluster_numeric],
pch = 21, bg = cluster_colors[cluster_numeric],
xlim = c(65, 100), ylim = c(5, 40),
xlab = "Longitude", ylab = "Latitude",
main = "Clustering of Seasonal Average Precipitation over India")
maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5)
grid()
points(means$Long, means$Lat,
pch = 2, # triangle point
cex = 1, # size
lwd = 2, # stroke thickness
col = "black") # outline color
text(means$Long, means$Lat, labels = means$Group.1,
pos = 3, cex = 0.8)
legend("topright", legend = paste("Cluster", 1:n_clusters),
fill = cluster_colors[1:n_clusters], border = "black",
title = "Cluster", bg = "white", box.lwd = 0.5)
fit <- pamk(precip_scale, krange = 2:10, scaling = F)
print(sprintf("The optimal number of clusters is %s",fit$nc))
## [1] "The optimal number of clusters is 2"
# get cluster means
means = aggregate(precip, by = list(fit$pamobject$cluster), FUN = mean)
# append cluster assignment
Cluster=factor(fit$pamobject$clustering)
data_2 = data.frame(precip,Cluster)
# Plot Precipitation Clusters Over India
n_clusters <- length(unique(data_2$Cluster))
cluster_colors <- rainbow(n_clusters)
cluster_numeric <- as.numeric(data_2$Cluster)
# Plot clusters over India
plot(data_2$Long, data_2$Lat,
col = cluster_colors[cluster_numeric],
pch = 21, bg = cluster_colors[cluster_numeric],
xlim = c(65, 100), ylim = c(5, 40),
xlab = "Longitude", ylab = "Latitude",
main = "Clustering of Seasonal Average Precipitation over India")
maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5)
grid()
points(means$Long, means$Lat,
pch = 2, # triangle point
cex = 1, # size
lwd = 2, # stroke thickness
col = "black") # outline color
text(means$Long, means$Lat, labels = means$Group.1,
pos = 3, cex = 0.8)
legend("topright", legend = paste("Cluster", 1:n_clusters),
fill = cluster_colors[1:n_clusters], border = "black",
title = "Cluster", bg = "white", box.lwd = 0.5)
# Compute distance matrix
dist_matrix <- dist(precip_scale)
# Perform hierarchical clustering (Ward's method)
hc <- hclust(dist_matrix, method = "ward.D2")
# Find best number of clusters using silhouette width
sil_widths <- numeric(9)
for (k in 2:10) {
clusters <- cutree(hc, k)
sil <- silhouette(clusters, dist_matrix)
sil_widths[k - 1] <- mean(sil[, 3])
}
hc_kbest <- which.max(sil_widths) + 1
print(sprintf("Optimal number of clusters for hierarchical clustering is %s", hc_kbest))
## [1] "Optimal number of clusters for hierarchical clustering is 2"
# Cut dendrogram into kbest clusters
hc_clusters <- cutree(hc, k = hc_kbest)
# Get cluster means
means = aggregate(precip, by = list(hc_clusters), FUN = mean)
# Append cluster assignment
Cluster = factor(hc_clusters)
data_3 = data.frame(precip, Cluster)
# Plot Precipitation Clusters Over India
n_clusters <- length(unique(data_3$Cluster))
cluster_colors <- rainbow(n_clusters)
cluster_numeric <- as.numeric(data_3$Cluster)
# Plot clusters over India
plot(data_3$Long, data_3$Lat,
col = cluster_colors[cluster_numeric],
pch = 21, bg = cluster_colors[cluster_numeric],
xlim = c(65, 100), ylim = c(5, 40),
xlab = "Longitude", ylab = "Latitude",
main = "Hierarchical Clustering of Seasonal Average Precipitation over India")
# Add world map
maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5)
# Add grid
grid()
# Add centroids
points(means$Long, means$Lat,
pch = 2, cex = 1, lwd = 2, col = "black")
# Add labels for each centroid
text(means$Long, means$Lat, labels = means$Group.1,
pos = 3, cex = 0.8)
# Add legend
legend("topright", legend = paste("Cluster", 1:n_clusters),
fill = cluster_colors[1:n_clusters], border = "black",
title = "Cluster", bg = "white", box.lwd = 0.5)
## Cluster the seasonal average precipitation
precip_d <- X[,1:5]
names(precip_d) = c("Long", "Lat", "Elev", "DistArabian", "DistBay")
lon = precip_d[,1]
lat = precip_d[,2]
precip_d$Prec = prec_obs
precip_d_scale = scale(precip_d)
Niter=50
WSS = matrix(nrow=Niter, ncol=10)
for (j in 1:Niter) {
WSS[j,1]= (nrow(precip_d_scale)-1)*sum(apply(precip_d_scale,2,var))#with elev,distA,distB
for (i in 2:10) {
WSS[j,i] = sum(kmeans(precip_d_scale, centers=i)$withinss)
}}
wss = apply(WSS, 2, FUN = mean)
g1 = ggplot() +
geom_line(mapping = aes(1:10, wss)) +
geom_point(mapping = aes(1:10, wss), shape = 21, size = 3, color = "gray30", fill ="cadetblue1")+
labs(title = "WSS vs. # of Clusters",x = "Number of Clusters",y = "WSS (Within-Group Sum of Squares)")+
theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))+
scale_x_continuous(breaks = 1:10)
show(g1)
slope=wss[2:length(wss)]-wss[1:(length(wss)-1)]
kbest = which(slope== max(slope)) # optimal number of clusters
print(sprintf("The optimal number of clusters is %s", kbest))
## [1] "The optimal number of clusters is 9"
fit <- kmeans(precip_d_scale, kbest)
# get cluster means
means = aggregate(precip_d,by=list(fit$cluster),FUN=mean)
# append cluster assignment
Cluster=factor(fit$cluster)
data_1 = data.frame(precip_d,Cluster)
# Plot Precipitation Clusters Over India
n_clusters <- length(unique(data_1$Cluster))
cluster_colors <- rainbow(n_clusters)
cluster_numeric <- as.numeric(data_1$Cluster)
# Plot clusters over India
plot(data_1$Long, data_1$Lat,
col = cluster_colors[cluster_numeric],
pch = 21, bg = cluster_colors[cluster_numeric],
xlim = c(65, 100), ylim = c(5, 40),
xlab = "Longitude", ylab = "Latitude",
main = "Clustering of Seasonal Average Precipitation over India")
maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5)
grid()
points(means$Long, means$Lat,
pch = 2, # triangle point
cex = 1, # size
lwd = 2, # stroke thickness
col = "black") # outline color
text(means$Long, means$Lat, labels = means$Group.1,
pos = 3, cex = 0.8)
legend("topright", legend = paste("Cluster", 1:n_clusters),
fill = cluster_colors[1:n_clusters], border = "black",
title = "Cluster", bg = "white", box.lwd = 0.5)
fit <- pamk(precip_d_scale, krange = 2:10, scaling = F)
print(sprintf("The optimal number of clusters is %s",fit$nc))
## [1] "The optimal number of clusters is 4"
# get cluster means
means = aggregate(precip_d, by = list(fit$pamobject$cluster), FUN = mean)
# append cluster assignment
Cluster=factor(fit$pamobject$clustering)
data_2 = data.frame(precip_d,Cluster)
# Plot Precipitation Clusters Over India
n_clusters <- length(unique(data_2$Cluster))
cluster_colors <- rainbow(n_clusters)
cluster_numeric <- as.numeric(data_2$Cluster)
# Plot clusters over India
plot(data_2$Long, data_2$Lat,
col = cluster_colors[cluster_numeric],
pch = 21, bg = cluster_colors[cluster_numeric],
xlim = c(65, 100), ylim = c(5, 40),
xlab = "Longitude", ylab = "Latitude",
main = "Clustering of Seasonal Average Precipitation over India")
maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5)
grid()
points(means$Long, means$Lat,
pch = 2, # triangle point
cex = 1, # size
lwd = 2, # stroke thickness
col = "black") # outline color
text(means$Long, means$Lat, labels = means$Group.1,
pos = 3, cex = 0.8)
legend("topright", legend = paste("Cluster", 1:n_clusters),
fill = cluster_colors[1:n_clusters], border = "black",
title = "Cluster", bg = "white", box.lwd = 0.5)
# Compute distance matrix
dist_matrix_d <- dist(precip_d_scale)
# Perform hierarchical clustering (Ward's method)
hc <- hclust(dist_matrix_d, method = "ward.D2")
# Find best number of clusters using silhouette width
sil_widths <- numeric(9)
for (k in 2:10) {
clusters <- cutree(hc, k)
sil <- silhouette(clusters, dist_matrix_d)
sil_widths[k - 1] <- mean(sil[, 3])
}
hc_kbest_d <- which.max(sil_widths) + 1
print(sprintf("Optimal number of clusters for hierarchical clustering is %s", hc_kbest_d))
## [1] "Optimal number of clusters for hierarchical clustering is 3"
# Cut dendrogram into kbest clusters
hc_clusters <- cutree(hc, k = hc_kbest_d)
# Get cluster means
means = aggregate(precip_d, by = list(hc_clusters), FUN = mean)
# Append cluster assignment
Cluster = factor(hc_clusters)
data_3 = data.frame(precip_d, Cluster)
# Plot Precipitation Clusters Over India
n_clusters <- length(unique(data_3$Cluster))
cluster_colors <- rainbow(n_clusters)
cluster_numeric <- as.numeric(data_3$Cluster)
# Plot clusters over India
plot(data_3$Long, data_3$Lat,
col = cluster_colors[cluster_numeric],
pch = 21, bg = cluster_colors[cluster_numeric],
xlim = c(65, 100), ylim = c(5, 40),
xlab = "Longitude", ylab = "Latitude",
main = "Hierarchical Clustering of Seasonal Average Precipitation over India")
# Add world map
maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5)
# Add grid
grid()
# Add centroids
points(means$Long, means$Lat,
pch = 2, cex = 1, lwd = 2, col = "black")
# Add labels for each centroid
text(means$Long, means$Lat, labels = means$Group.1,
pos = 3, cex = 0.8)
# Add legend
legend("topright", legend = paste("Cluster", 1:n_clusters),
fill = cluster_colors[1:n_clusters], border = "black",
title = "Cluster", bg = "white", box.lwd = 0.5)
(i)Clustering based on Latitude, Longitude, Elevation Kmeans: (Optimal Clusters:9) Clusters form compact spatial groupings K-medoids: (Optimal Clusters:2) A coarse split, possibly separating east-west or coastal-inland. Hierarchical: (Optimal Clusters:2) Similar to K-medoids, divides the region into two broad spatial regimes.
K-means captured more nuanced structure, while K-medoids and Hierarchical produced broader separations. This suggests that elevation and spatial clustering benefit from K-means’ ability to capture finer granularity.
(ii)Clustering with Latitude, Longitude, Elevation and distance to Bay of Bengal and distance to Arabian Sea K-means: (Optimal Clusters:9) Similar structure as (i), but with clusters slightly aligned by coastal distance. K-medoids: (Optimal Clusters:4) More refined than (i); coastal proximity played a clearer role. Hierarchical: (Optimal Clusters:3) Broader separation, clearer separation of coastal vs inland clusters.
Adding distance to coasts improved the distinction between coastal and inland regions, which matters for rainfall patterns driven by monsoonal flows.
K-means consistently finds more detailed structure, which is good for nuanced regional classification. K-medoids and Hierarchical methods find more coarse divisions.
Including distance to Arabian Sea and Bay of Bengal enhanced clustering, capturing the influence of coastal proximity on precipitation patterns.