Cluster Analysis

  1. (i)Cluster the seasonal average precipitation using latitude, longitude and elevation (over average value at each location) The clustering involves:
  1. identify the best number of clusters, say, Kbest Select a desired number of clusters, say, j; cluster the data and computer the WSS; repeat for j=1,10; plot j versus WSS; and select the best number of clusters Kbest, where the WSS starts to saturate.
  2. cluster the data into Kbest clusters and display them spatially.
  1. Also employ any other clustering method – K-medoid clustering or Hierarchical Clustering.

  2. Cluster the same using latitude, longitude, elevation, distance to Bay of Bengal and distance to Arabian Sea.

Compare and summarize your findings.

#### Clear Memory and Set Options
rm(list=ls())
options(warn=-1)
save <- FALSE  # Change to TRUE if you want to save the plot

# Set working directory (Modify if necessary)
script_dir <- dirname(rstudioapi::getActiveDocumentContext()$path)
setwd(script_dir)

library(fpc)

source("HW2_Library.R")
## Package 'sm', version 2.2-6.0: type help(sm) for summary information
## Loading required package: lattice
## Loading required package: KernSmooth
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## Loading required package: randomForest
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: spam
## Spam version 2.11-1 (2025-01-20) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction 
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
## 
## Attaching package: 'spam'
## The following objects are masked from 'package:base':
## 
##     backsolve, forwardsolve
## Loading required package: viridisLite
## 
## Try help(fields) to get started.
## locfit 1.5-9.11   2025-01-27
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:MPV':
## 
##     cement
## The following object is masked from 'package:sm':
## 
##     muscle
## Loading required package: stats4
## 
## Attaching package: 'stats4'
## The following object is masked from 'package:spam':
## 
##     mle
## Loading required package: splines
## Loading required package: boot
## 
## Attaching package: 'boot'
## The following objects are masked from 'package:VGAM':
## 
##     logit, simplex
## The following object is masked from 'package:MPV':
## 
##     motor
## The following object is masked from 'package:lattice':
## 
##     melanoma
## The following object is masked from 'package:sm':
## 
##     dogs
## Loading required package: CircStats
## 
## Attaching package: 'CircStats'
## The following objects are masked from 'package:VGAM':
## 
##     dcard, rcard
## Loading required package: dtw
## Loading required package: proxy
## 
## Attaching package: 'proxy'
## The following object is masked from 'package:spam':
## 
##     as.matrix
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
## Loaded dtw v1.23-1. See ?dtw for help, citation("dtw") for use in publication.
## 
## Attaching package: 'verification'
## The following object is masked from 'package:VGAM':
## 
##     exponential
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
## --------------------------------------------------------------
##  Analysis of Geostatistical Data
##  For an Introduction to geoR go to http://www.leg.ufpr.br/geoR
##  geoR version 1.9-4 (built on 2024-02-14) is now loaded
## --------------------------------------------------------------
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:maps':
## 
##     map
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:fields':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## Attaching package: 'cluster'
## The following object is masked from 'package:maps':
## 
##     votes.repub
## 
## Attaching package: 'kohonen'
## The following object is masked from 'package:mclust':
## 
##     map
## The following object is masked from 'package:maps':
## 
##     map
# Color palette
myPalette <- colorRampPalette(rev(brewer.pal(9, "RdBu")), space="Lab")
Read data
nrows=68
ncols=65

ntime <- 67   #Jun-Sep 1950 - 2016

nyrs <- ntime 

nglobe <- nrows*ncols
N <- nrows*ncols


### Lat - Long grid..
#ygrid=seq(6.5,38.5,by=0.25)
ygrid=seq(6.5,38.5,by=0.5)
ny=length(ygrid)

#xgrid=seq(66.5,100,by=0.25)
xgrid=seq(66.5,100,by=0.5)
nx=length(xgrid)

xygrid=matrix(0,nrow=nx*ny,ncol=2)

i=0
for(iy in 1:ny){
for(ix in 1:nx){
i=i+1
xygrid[i,1]=ygrid[iy]
xygrid[i,2]=xgrid[ix]
}

}

# Read India Gridded Rainfall data..

data=readBin("data/India-Rain-JJAS-05deg-1950-2016.r4",what="numeric", n=( nrows * ncols * ntime), size=4,endian="swap")

#data=readBin("data.r4",what="numeric", n=( nrows * ncols * ntime), size=4,endian="swap")

data <- array(data = data, dim=c( nrows, ncols, ntime ) )

data1=data[,,1]
    
# the lat -long data grid..
index=1:(nx*ny)

## pull only data and the locations with non-missing data
index1=index[data1 != "NaN"]    # only non-missing data.
xygrid1=xygrid[index1,]

nsites=length(index1)

data2=data1[index1]

### Rain data matrix - rows are seasonal (i.e. one value per year)
## and columns are locations
raindata=matrix(NA,nrow=nyrs, ncol=nsites)

for(i in 1:nyrs){
data1=data[,,i]
index1=index[ data1 != "NaN"]
data2=data1[index1]
raindata[i,]=data2
}


index = 1:dim(raindata)[2]

xx = apply(raindata,2,mean)
index2 = index1[xx > 0]
index3 = index[xx > 0]

xygrid1=xygrid[index2,]
rainavg = raindata[,index3]

indexgrid = index2
rm("data")  #remove the object data to clear up space

seasonal_avg <- colMeans(rainavg)

X1 <- cbind(xygrid1, seasonal_avg)
colnames(X1) <- c("Latitude", "Longitude", "SeasonalAvg")

distance_data <- read.table("data/Rajgrid-lon-lat-elev-dista-distb.txt", header=FALSE)
names(distance_data) <- c("Longitude", "Latitude", "Elevation", "Dist_ArabianSea", "Dist_BayOfBengal")

#### Merge the two datasets
X <- merge(X1, distance_data, by=c("Longitude", "Latitude"))
X <- X[, c("Longitude", "Latitude", "Elevation", "Dist_ArabianSea", "Dist_BayOfBengal", "SeasonalAvg")]
prec_obs <- X$SeasonalAvg  # Extract precipitation column


precip <- X[,1:3]
names(precip) = c("Long","Lat","Elev")
lon = precip[,1]
lat = precip[,2]
precip$Prec = prec_obs
precip_scale = scale(precip)
(i)Cluster the seasonal average precipitation
Niter=50
WSS = matrix(nrow=Niter, ncol=10)
for (j in 1:Niter) {
  WSS[j,1]= (nrow(precip_scale)-1)*sum(apply(precip_scale,2,var))#with elev
  for (i in 2:10) {
    WSS[j,i] = sum(kmeans(precip_scale, centers=i)$withinss)
  }}
wss = apply(WSS, 2, FUN = mean)

g1 = ggplot() +
  geom_line(mapping = aes(1:10, wss)) +
  geom_point(mapping = aes(1:10, wss), shape = 21, size = 3, color = "gray30", fill ="cadetblue1")+
  labs(title = "WSS vs. # of Clusters",x = "Number of Clusters",y = "WSS (Within-Group Sum of Squares)")+
  theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))+
  scale_x_continuous(breaks = 1:10)
show(g1)

(a) Identify the best number of clusters Kbest
slope=wss[2:length(wss)]-wss[1:(length(wss)-1)]
kbest = which(slope== max(slope))     # optimal number of clusters
print(sprintf("The optimal number of clusters is %s", kbest))
## [1] "The optimal number of clusters is 8"
(b) Cluster the data into Kbest clusters and display them spatially
fit <- kmeans(precip_scale, kbest)
# get cluster means
means = aggregate(precip,by=list(fit$cluster),FUN=mean)
# append cluster assignment
Cluster=factor(fit$cluster)
data_1 = data.frame(precip,Cluster) 


# Plot Precipitation Clusters Over India
n_clusters <- length(unique(data_1$Cluster))
cluster_colors <- rainbow(n_clusters)
cluster_numeric <- as.numeric(data_1$Cluster)

# Plot clusters over India
plot(data_1$Long, data_1$Lat,
     col = cluster_colors[cluster_numeric],
     pch = 21, bg = cluster_colors[cluster_numeric],
     xlim = c(65, 100), ylim = c(5, 40),
     xlab = "Longitude", ylab = "Latitude",
     main = "Clustering of Seasonal Average Precipitation over India")
maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5)
grid()

points(means$Long, means$Lat,
       pch = 2,                     # triangle point
       cex = 1,                   # size
       lwd = 2,                     # stroke thickness
       col = "black")              # outline color


text(means$Long, means$Lat, labels = means$Group.1,
     pos = 3, cex = 0.8)

legend("topright", legend = paste("Cluster", 1:n_clusters),
       fill = cluster_colors[1:n_clusters], border = "black",
       title = "Cluster", bg = "white", box.lwd = 0.5)

(ii) Repeat with a different clustering method - K-medoid clustering

(a)Identify the best number of clusters Kbest
fit <- pamk(precip_scale, krange = 2:10, scaling = F)
print(sprintf("The optimal number of clusters is %s",fit$nc))
## [1] "The optimal number of clusters is 2"
(b) Cluster the data into Kbest clusters and display them spatially
# get cluster means
means = aggregate(precip, by = list(fit$pamobject$cluster), FUN = mean)
# append cluster assignment 
Cluster=factor(fit$pamobject$clustering)
data_2 = data.frame(precip,Cluster) 

# Plot Precipitation Clusters Over India
n_clusters <- length(unique(data_2$Cluster))
cluster_colors <- rainbow(n_clusters)
cluster_numeric <- as.numeric(data_2$Cluster)

# Plot clusters over India
plot(data_2$Long, data_2$Lat,
     col = cluster_colors[cluster_numeric],
     pch = 21, bg = cluster_colors[cluster_numeric],
     xlim = c(65, 100), ylim = c(5, 40),
     xlab = "Longitude", ylab = "Latitude",
     main = "Clustering of Seasonal Average Precipitation over India")
maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5)
grid()

points(means$Long, means$Lat,
       pch = 2,                     # triangle point
       cex = 1,                   # size
       lwd = 2,                     # stroke thickness
       col = "black")              # outline color

text(means$Long, means$Lat, labels = means$Group.1,
     pos = 3, cex = 0.8)

legend("topright", legend = paste("Cluster", 1:n_clusters),
       fill = cluster_colors[1:n_clusters], border = "black",
       title = "Cluster", bg = "white", box.lwd = 0.5)

(ii) Repeat with a different clustering method - Hierarchical Clustering

(a)Identify the best number of clusters Kbest
# Compute distance matrix
dist_matrix <- dist(precip_scale)

# Perform hierarchical clustering (Ward's method)
hc <- hclust(dist_matrix, method = "ward.D2")

# Find best number of clusters using silhouette width
sil_widths <- numeric(9)
for (k in 2:10) {
  clusters <- cutree(hc, k)
  sil <- silhouette(clusters, dist_matrix)
  sil_widths[k - 1] <- mean(sil[, 3])
}

hc_kbest <- which.max(sil_widths) + 1
print(sprintf("Optimal number of clusters for hierarchical clustering is %s", hc_kbest))
## [1] "Optimal number of clusters for hierarchical clustering is 2"
# Cut dendrogram into kbest clusters
hc_clusters <- cutree(hc, k = hc_kbest)

# Get cluster means
means = aggregate(precip, by = list(hc_clusters), FUN = mean)

# Append cluster assignment
Cluster = factor(hc_clusters)
data_3 = data.frame(precip, Cluster)

# Plot Precipitation Clusters Over India
n_clusters <- length(unique(data_3$Cluster))
cluster_colors <- rainbow(n_clusters)
cluster_numeric <- as.numeric(data_3$Cluster)

# Plot clusters over India
plot(data_3$Long, data_3$Lat,
     col = cluster_colors[cluster_numeric],
     pch = 21, bg = cluster_colors[cluster_numeric],
     xlim = c(65, 100), ylim = c(5, 40),
     xlab = "Longitude", ylab = "Latitude",
     main = "Hierarchical Clustering of Seasonal Average Precipitation over India")

# Add world map
maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5)

# Add grid
grid()

# Add centroids
points(means$Long, means$Lat,
       pch = 2, cex = 1, lwd = 2, col = "black")

# Add labels for each centroid
text(means$Long, means$Lat, labels = means$Group.1,
     pos = 3, cex = 0.8)

# Add legend
legend("topright", legend = paste("Cluster", 1:n_clusters),
       fill = cluster_colors[1:n_clusters], border = "black",
       title = "Cluster", bg = "white", box.lwd = 0.5)

(iii) Cluster the same using latitude, longitude, elevation, distance to Bay of Bengal and distance to Arabian Sea.

(1)k-means

## Cluster the seasonal average precipitation
precip_d <- X[,1:5]
names(precip_d) = c("Long", "Lat", "Elev", "DistArabian", "DistBay")
lon = precip_d[,1]
lat = precip_d[,2]
precip_d$Prec = prec_obs
precip_d_scale = scale(precip_d)


Niter=50
WSS = matrix(nrow=Niter, ncol=10)
for (j in 1:Niter) {
  WSS[j,1]= (nrow(precip_d_scale)-1)*sum(apply(precip_d_scale,2,var))#with elev,distA,distB
  for (i in 2:10) {
    WSS[j,i] = sum(kmeans(precip_d_scale, centers=i)$withinss)
  }}
wss = apply(WSS, 2, FUN = mean)

g1 = ggplot() +
  geom_line(mapping = aes(1:10, wss)) +
  geom_point(mapping = aes(1:10, wss), shape = 21, size = 3, color = "gray30", fill ="cadetblue1")+
  labs(title = "WSS vs. # of Clusters",x = "Number of Clusters",y = "WSS (Within-Group Sum of Squares)")+
  theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))+
  scale_x_continuous(breaks = 1:10)
show(g1)

(1a) Identify the best number of clusters Kbest
slope=wss[2:length(wss)]-wss[1:(length(wss)-1)]
kbest = which(slope== max(slope))     # optimal number of clusters
print(sprintf("The optimal number of clusters is %s", kbest))
## [1] "The optimal number of clusters is 9"
(b) Cluster the data into Kbest clusters and display them spatially
fit <- kmeans(precip_d_scale, kbest)
# get cluster means
means = aggregate(precip_d,by=list(fit$cluster),FUN=mean)
# append cluster assignment
Cluster=factor(fit$cluster)
data_1 = data.frame(precip_d,Cluster) 


# Plot Precipitation Clusters Over India
n_clusters <- length(unique(data_1$Cluster))
cluster_colors <- rainbow(n_clusters)
cluster_numeric <- as.numeric(data_1$Cluster)

# Plot clusters over India
plot(data_1$Long, data_1$Lat,
     col = cluster_colors[cluster_numeric],
     pch = 21, bg = cluster_colors[cluster_numeric],
     xlim = c(65, 100), ylim = c(5, 40),
     xlab = "Longitude", ylab = "Latitude",
     main = "Clustering of Seasonal Average Precipitation over India")
maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5)
grid()

points(means$Long, means$Lat,
       pch = 2,                     # triangle point
       cex = 1,                   # size
       lwd = 2,                     # stroke thickness
       col = "black")              # outline color


text(means$Long, means$Lat, labels = means$Group.1,
     pos = 3, cex = 0.8)

legend("topright", legend = paste("Cluster", 1:n_clusters),
       fill = cluster_colors[1:n_clusters], border = "black",
       title = "Cluster", bg = "white", box.lwd = 0.5)

(2) K-medoid
(2a) Identify the best number of clusters Kbest
fit <- pamk(precip_d_scale, krange = 2:10, scaling = F)
print(sprintf("The optimal number of clusters is %s",fit$nc))
## [1] "The optimal number of clusters is 4"
(2b) Cluster the data into Kbest clusters and display them spatially
# get cluster means
means = aggregate(precip_d, by = list(fit$pamobject$cluster), FUN = mean)
# append cluster assignment 
Cluster=factor(fit$pamobject$clustering)
data_2 = data.frame(precip_d,Cluster) 

# Plot Precipitation Clusters Over India
n_clusters <- length(unique(data_2$Cluster))
cluster_colors <- rainbow(n_clusters)
cluster_numeric <- as.numeric(data_2$Cluster)

# Plot clusters over India
plot(data_2$Long, data_2$Lat,
     col = cluster_colors[cluster_numeric],
     pch = 21, bg = cluster_colors[cluster_numeric],
     xlim = c(65, 100), ylim = c(5, 40),
     xlab = "Longitude", ylab = "Latitude",
     main = "Clustering of Seasonal Average Precipitation over India")
maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5)
grid()

points(means$Long, means$Lat,
       pch = 2,                     # triangle point
       cex = 1,                   # size
       lwd = 2,                     # stroke thickness
       col = "black")              # outline color

text(means$Long, means$Lat, labels = means$Group.1,
     pos = 3, cex = 0.8)

legend("topright", legend = paste("Cluster", 1:n_clusters),
       fill = cluster_colors[1:n_clusters], border = "black",
       title = "Cluster", bg = "white", box.lwd = 0.5)

(3) Hierarchical
(3a) Identify the best number of clusters Kbest
# Compute distance matrix
dist_matrix_d <- dist(precip_d_scale)

# Perform hierarchical clustering (Ward's method)
hc <- hclust(dist_matrix_d, method = "ward.D2")

# Find best number of clusters using silhouette width
sil_widths <- numeric(9)
for (k in 2:10) {
  clusters <- cutree(hc, k)
  sil <- silhouette(clusters, dist_matrix_d)
  sil_widths[k - 1] <- mean(sil[, 3])
}

hc_kbest_d <- which.max(sil_widths) + 1
print(sprintf("Optimal number of clusters for hierarchical clustering is %s", hc_kbest_d))
## [1] "Optimal number of clusters for hierarchical clustering is 3"
(2b) Cluster the data into Kbest clusters and display them spatially
# Cut dendrogram into kbest clusters
hc_clusters <- cutree(hc, k = hc_kbest_d)

# Get cluster means
means = aggregate(precip_d, by = list(hc_clusters), FUN = mean)

# Append cluster assignment
Cluster = factor(hc_clusters)
data_3 = data.frame(precip_d, Cluster)

# Plot Precipitation Clusters Over India
n_clusters <- length(unique(data_3$Cluster))
cluster_colors <- rainbow(n_clusters)
cluster_numeric <- as.numeric(data_3$Cluster)

# Plot clusters over India
plot(data_3$Long, data_3$Lat,
     col = cluster_colors[cluster_numeric],
     pch = 21, bg = cluster_colors[cluster_numeric],
     xlim = c(65, 100), ylim = c(5, 40),
     xlab = "Longitude", ylab = "Latitude",
     main = "Hierarchical Clustering of Seasonal Average Precipitation over India")

# Add world map
maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5)

# Add grid
grid()

# Add centroids
points(means$Long, means$Lat,
       pch = 2, cex = 1, lwd = 2, col = "black")

# Add labels for each centroid
text(means$Long, means$Lat, labels = means$Group.1,
     pos = 3, cex = 0.8)

# Add legend
legend("topright", legend = paste("Cluster", 1:n_clusters),
       fill = cluster_colors[1:n_clusters], border = "black",
       title = "Cluster", bg = "white", box.lwd = 0.5)

Clustering Summary and Comparison

(i)Clustering based on Latitude, Longitude, Elevation Kmeans: (Optimal Clusters:9) Clusters form compact spatial groupings K-medoids: (Optimal Clusters:2) A coarse split, possibly separating east-west or coastal-inland. Hierarchical: (Optimal Clusters:2) Similar to K-medoids, divides the region into two broad spatial regimes.

K-means captured more nuanced structure, while K-medoids and Hierarchical produced broader separations. This suggests that elevation and spatial clustering benefit from K-means’ ability to capture finer granularity.

(ii)Clustering with Latitude, Longitude, Elevation and distance to Bay of Bengal and distance to Arabian Sea K-means: (Optimal Clusters:9) Similar structure as (i), but with clusters slightly aligned by coastal distance. K-medoids: (Optimal Clusters:4) More refined than (i); coastal proximity played a clearer role. Hierarchical: (Optimal Clusters:3) Broader separation, clearer separation of coastal vs inland clusters.

Adding distance to coasts improved the distinction between coastal and inland regions, which matters for rainfall patterns driven by monsoonal flows.

Conclusions:

K-means consistently finds more detailed structure, which is good for nuanced regional classification. K-medoids and Hierarchical methods find more coarse divisions.

Including distance to Arabian Sea and Bay of Bengal enhanced clustering, capturing the influence of coastal proximity on precipitation patterns.