--- title: "hw2_prob9" author: "Haiying Peng" date: "2025-04-04" output: html_document: default pdf_document: default --- #### Cluster Analysis 9. (i)Cluster the seasonal average precipitation using latitude, longitude and elevation (over average value at each location) The clustering involves: (a) identify the best number of clusters, say, Kbest Select a desired number of clusters, say, j; cluster the data and computer the WSS; repeat for j=1,10; plot j versus WSS; and select the best number of clusters Kbest, where the WSS starts to saturate. (b) cluster the data into Kbest clusters and display them spatially. (ii) Also employ any other clustering method – K-medoid clustering or Hierarchical Clustering. (iii) Cluster the same using latitude, longitude, elevation, distance to Bay of Bengal and distance to Arabian Sea. Compare and summarize your findings. ```{r,warning=FALSE} #### Clear Memory and Set Options rm(list=ls()) options(warn=-1) save <- FALSE # Change to TRUE if you want to save the plot # Set working directory (Modify if necessary) script_dir <- dirname(rstudioapi::getActiveDocumentContext()$path) setwd(script_dir) library(fpc) source("HW2_Library.R") # Color palette myPalette <- colorRampPalette(rev(brewer.pal(9, "RdBu")), space="Lab") ``` ##### Read data ```{r} nrows=68 ncols=65 ntime <- 67 #Jun-Sep 1950 - 2016 nyrs <- ntime nglobe <- nrows*ncols N <- nrows*ncols ### Lat - Long grid.. #ygrid=seq(6.5,38.5,by=0.25) ygrid=seq(6.5,38.5,by=0.5) ny=length(ygrid) #xgrid=seq(66.5,100,by=0.25) xgrid=seq(66.5,100,by=0.5) nx=length(xgrid) xygrid=matrix(0,nrow=nx*ny,ncol=2) i=0 for(iy in 1:ny){ for(ix in 1:nx){ i=i+1 xygrid[i,1]=ygrid[iy] xygrid[i,2]=xgrid[ix] } } # Read India Gridded Rainfall data.. data=readBin("data/India-Rain-JJAS-05deg-1950-2016.r4",what="numeric", n=( nrows * ncols * ntime), size=4,endian="swap") #data=readBin("data.r4",what="numeric", n=( nrows * ncols * ntime), size=4,endian="swap") data <- array(data = data, dim=c( nrows, ncols, ntime ) ) data1=data[,,1] # the lat -long data grid.. index=1:(nx*ny) ## pull only data and the locations with non-missing data index1=index[data1 != "NaN"] # only non-missing data. xygrid1=xygrid[index1,] nsites=length(index1) data2=data1[index1] ### Rain data matrix - rows are seasonal (i.e. one value per year) ## and columns are locations raindata=matrix(NA,nrow=nyrs, ncol=nsites) for(i in 1:nyrs){ data1=data[,,i] index1=index[ data1 != "NaN"] data2=data1[index1] raindata[i,]=data2 } index = 1:dim(raindata)[2] xx = apply(raindata,2,mean) index2 = index1[xx > 0] index3 = index[xx > 0] xygrid1=xygrid[index2,] rainavg = raindata[,index3] indexgrid = index2 rm("data") #remove the object data to clear up space seasonal_avg <- colMeans(rainavg) X1 <- cbind(xygrid1, seasonal_avg) colnames(X1) <- c("Latitude", "Longitude", "SeasonalAvg") distance_data <- read.table("data/Rajgrid-lon-lat-elev-dista-distb.txt", header=FALSE) names(distance_data) <- c("Longitude", "Latitude", "Elevation", "Dist_ArabianSea", "Dist_BayOfBengal") #### Merge the two datasets X <- merge(X1, distance_data, by=c("Longitude", "Latitude")) X <- X[, c("Longitude", "Latitude", "Elevation", "Dist_ArabianSea", "Dist_BayOfBengal", "SeasonalAvg")] prec_obs <- X$SeasonalAvg # Extract precipitation column precip <- X[,1:3] names(precip) = c("Long","Lat","Elev") lon = precip[,1] lat = precip[,2] precip$Prec = prec_obs precip_scale = scale(precip) ``` ##### (i)Cluster the seasonal average precipitation ```{r} Niter=50 WSS = matrix(nrow=Niter, ncol=10) for (j in 1:Niter) { WSS[j,1]= (nrow(precip_scale)-1)*sum(apply(precip_scale,2,var))#with elev for (i in 2:10) { WSS[j,i] = sum(kmeans(precip_scale, centers=i)$withinss) }} wss = apply(WSS, 2, FUN = mean) g1 = ggplot() + geom_line(mapping = aes(1:10, wss)) + geom_point(mapping = aes(1:10, wss), shape = 21, size = 3, color = "gray30", fill ="cadetblue1")+ labs(title = "WSS vs. # of Clusters",x = "Number of Clusters",y = "WSS (Within-Group Sum of Squares)")+ theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))+ scale_x_continuous(breaks = 1:10) show(g1) ``` ##### (a) Identify the best number of clusters Kbest ```{r} slope=wss[2:length(wss)]-wss[1:(length(wss)-1)] kbest = which(slope== max(slope)) # optimal number of clusters print(sprintf("The optimal number of clusters is %s", kbest)) ``` ##### (b) Cluster the data into Kbest clusters and display them spatially ```{r} fit <- kmeans(precip_scale, kbest) # get cluster means means = aggregate(precip,by=list(fit$cluster),FUN=mean) # append cluster assignment Cluster=factor(fit$cluster) data_1 = data.frame(precip,Cluster) # Plot Precipitation Clusters Over India n_clusters <- length(unique(data_1$Cluster)) cluster_colors <- rainbow(n_clusters) cluster_numeric <- as.numeric(data_1$Cluster) # Plot clusters over India plot(data_1$Long, data_1$Lat, col = cluster_colors[cluster_numeric], pch = 21, bg = cluster_colors[cluster_numeric], xlim = c(65, 100), ylim = c(5, 40), xlab = "Longitude", ylab = "Latitude", main = "Clustering of Seasonal Average Precipitation over India") maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5) grid() points(means$Long, means$Lat, pch = 2, # triangle point cex = 1, # size lwd = 2, # stroke thickness col = "black") # outline color text(means$Long, means$Lat, labels = means$Group.1, pos = 3, cex = 0.8) legend("topright", legend = paste("Cluster", 1:n_clusters), fill = cluster_colors[1:n_clusters], border = "black", title = "Cluster", bg = "white", box.lwd = 0.5) ``` #### (ii) Repeat with a different clustering method - K-medoid clustering ##### (a)Identify the best number of clusters Kbest ```{r} fit <- pamk(precip_scale, krange = 2:10, scaling = F) print(sprintf("The optimal number of clusters is %s",fit$nc)) ``` ##### (b) Cluster the data into Kbest clusters and display them spatially ```{r} # get cluster means means = aggregate(precip, by = list(fit$pamobject$cluster), FUN = mean) # append cluster assignment Cluster=factor(fit$pamobject$clustering) data_2 = data.frame(precip,Cluster) # Plot Precipitation Clusters Over India n_clusters <- length(unique(data_2$Cluster)) cluster_colors <- rainbow(n_clusters) cluster_numeric <- as.numeric(data_2$Cluster) # Plot clusters over India plot(data_2$Long, data_2$Lat, col = cluster_colors[cluster_numeric], pch = 21, bg = cluster_colors[cluster_numeric], xlim = c(65, 100), ylim = c(5, 40), xlab = "Longitude", ylab = "Latitude", main = "Clustering of Seasonal Average Precipitation over India") maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5) grid() points(means$Long, means$Lat, pch = 2, # triangle point cex = 1, # size lwd = 2, # stroke thickness col = "black") # outline color text(means$Long, means$Lat, labels = means$Group.1, pos = 3, cex = 0.8) legend("topright", legend = paste("Cluster", 1:n_clusters), fill = cluster_colors[1:n_clusters], border = "black", title = "Cluster", bg = "white", box.lwd = 0.5) ``` #### (ii) Repeat with a different clustering method - Hierarchical Clustering ##### (a)Identify the best number of clusters Kbest ```{r} # Compute distance matrix dist_matrix <- dist(precip_scale) # Perform hierarchical clustering (Ward's method) hc <- hclust(dist_matrix, method = "ward.D2") # Find best number of clusters using silhouette width sil_widths <- numeric(9) for (k in 2:10) { clusters <- cutree(hc, k) sil <- silhouette(clusters, dist_matrix) sil_widths[k - 1] <- mean(sil[, 3]) } hc_kbest <- which.max(sil_widths) + 1 print(sprintf("Optimal number of clusters for hierarchical clustering is %s", hc_kbest)) ``` ```{r} # Cut dendrogram into kbest clusters hc_clusters <- cutree(hc, k = hc_kbest) # Get cluster means means = aggregate(precip, by = list(hc_clusters), FUN = mean) # Append cluster assignment Cluster = factor(hc_clusters) data_3 = data.frame(precip, Cluster) # Plot Precipitation Clusters Over India n_clusters <- length(unique(data_3$Cluster)) cluster_colors <- rainbow(n_clusters) cluster_numeric <- as.numeric(data_3$Cluster) # Plot clusters over India plot(data_3$Long, data_3$Lat, col = cluster_colors[cluster_numeric], pch = 21, bg = cluster_colors[cluster_numeric], xlim = c(65, 100), ylim = c(5, 40), xlab = "Longitude", ylab = "Latitude", main = "Hierarchical Clustering of Seasonal Average Precipitation over India") # Add world map maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5) # Add grid grid() # Add centroids points(means$Long, means$Lat, pch = 2, cex = 1, lwd = 2, col = "black") # Add labels for each centroid text(means$Long, means$Lat, labels = means$Group.1, pos = 3, cex = 0.8) # Add legend legend("topright", legend = paste("Cluster", 1:n_clusters), fill = cluster_colors[1:n_clusters], border = "black", title = "Cluster", bg = "white", box.lwd = 0.5) ``` #### (iii) Cluster the same using latitude, longitude, elevation, distance to Bay of Bengal and distance to Arabian Sea. #### (1)k-means ```{r} ## Cluster the seasonal average precipitation precip_d <- X[,1:5] names(precip_d) = c("Long", "Lat", "Elev", "DistArabian", "DistBay") lon = precip_d[,1] lat = precip_d[,2] precip_d$Prec = prec_obs precip_d_scale = scale(precip_d) Niter=50 WSS = matrix(nrow=Niter, ncol=10) for (j in 1:Niter) { WSS[j,1]= (nrow(precip_d_scale)-1)*sum(apply(precip_d_scale,2,var))#with elev,distA,distB for (i in 2:10) { WSS[j,i] = sum(kmeans(precip_d_scale, centers=i)$withinss) }} wss = apply(WSS, 2, FUN = mean) g1 = ggplot() + geom_line(mapping = aes(1:10, wss)) + geom_point(mapping = aes(1:10, wss), shape = 21, size = 3, color = "gray30", fill ="cadetblue1")+ labs(title = "WSS vs. # of Clusters",x = "Number of Clusters",y = "WSS (Within-Group Sum of Squares)")+ theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))+ scale_x_continuous(breaks = 1:10) show(g1) ``` ##### (1a) Identify the best number of clusters Kbest ```{r} slope=wss[2:length(wss)]-wss[1:(length(wss)-1)] kbest = which(slope== max(slope)) # optimal number of clusters print(sprintf("The optimal number of clusters is %s", kbest)) ``` ##### (b) Cluster the data into Kbest clusters and display them spatially ```{r} fit <- kmeans(precip_d_scale, kbest) # get cluster means means = aggregate(precip_d,by=list(fit$cluster),FUN=mean) # append cluster assignment Cluster=factor(fit$cluster) data_1 = data.frame(precip_d,Cluster) # Plot Precipitation Clusters Over India n_clusters <- length(unique(data_1$Cluster)) cluster_colors <- rainbow(n_clusters) cluster_numeric <- as.numeric(data_1$Cluster) # Plot clusters over India plot(data_1$Long, data_1$Lat, col = cluster_colors[cluster_numeric], pch = 21, bg = cluster_colors[cluster_numeric], xlim = c(65, 100), ylim = c(5, 40), xlab = "Longitude", ylab = "Latitude", main = "Clustering of Seasonal Average Precipitation over India") maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5) grid() points(means$Long, means$Lat, pch = 2, # triangle point cex = 1, # size lwd = 2, # stroke thickness col = "black") # outline color text(means$Long, means$Lat, labels = means$Group.1, pos = 3, cex = 0.8) legend("topright", legend = paste("Cluster", 1:n_clusters), fill = cluster_colors[1:n_clusters], border = "black", title = "Cluster", bg = "white", box.lwd = 0.5) ``` ##### (2) K-medoid ##### (2a) Identify the best number of clusters Kbest ```{r} fit <- pamk(precip_d_scale, krange = 2:10, scaling = F) print(sprintf("The optimal number of clusters is %s",fit$nc)) ``` ##### (2b) Cluster the data into Kbest clusters and display them spatially ```{r} # get cluster means means = aggregate(precip_d, by = list(fit$pamobject$cluster), FUN = mean) # append cluster assignment Cluster=factor(fit$pamobject$clustering) data_2 = data.frame(precip_d,Cluster) # Plot Precipitation Clusters Over India n_clusters <- length(unique(data_2$Cluster)) cluster_colors <- rainbow(n_clusters) cluster_numeric <- as.numeric(data_2$Cluster) # Plot clusters over India plot(data_2$Long, data_2$Lat, col = cluster_colors[cluster_numeric], pch = 21, bg = cluster_colors[cluster_numeric], xlim = c(65, 100), ylim = c(5, 40), xlab = "Longitude", ylab = "Latitude", main = "Clustering of Seasonal Average Precipitation over India") maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5) grid() points(means$Long, means$Lat, pch = 2, # triangle point cex = 1, # size lwd = 2, # stroke thickness col = "black") # outline color text(means$Long, means$Lat, labels = means$Group.1, pos = 3, cex = 0.8) legend("topright", legend = paste("Cluster", 1:n_clusters), fill = cluster_colors[1:n_clusters], border = "black", title = "Cluster", bg = "white", box.lwd = 0.5) ``` ##### (3) Hierarchical ##### (3a) Identify the best number of clusters Kbest ```{r} # Compute distance matrix dist_matrix_d <- dist(precip_d_scale) # Perform hierarchical clustering (Ward's method) hc <- hclust(dist_matrix_d, method = "ward.D2") # Find best number of clusters using silhouette width sil_widths <- numeric(9) for (k in 2:10) { clusters <- cutree(hc, k) sil <- silhouette(clusters, dist_matrix_d) sil_widths[k - 1] <- mean(sil[, 3]) } hc_kbest_d <- which.max(sil_widths) + 1 print(sprintf("Optimal number of clusters for hierarchical clustering is %s", hc_kbest_d)) ``` ##### (2b) Cluster the data into Kbest clusters and display them spatially ```{r} # Cut dendrogram into kbest clusters hc_clusters <- cutree(hc, k = hc_kbest_d) # Get cluster means means = aggregate(precip_d, by = list(hc_clusters), FUN = mean) # Append cluster assignment Cluster = factor(hc_clusters) data_3 = data.frame(precip_d, Cluster) # Plot Precipitation Clusters Over India n_clusters <- length(unique(data_3$Cluster)) cluster_colors <- rainbow(n_clusters) cluster_numeric <- as.numeric(data_3$Cluster) # Plot clusters over India plot(data_3$Long, data_3$Lat, col = cluster_colors[cluster_numeric], pch = 21, bg = cluster_colors[cluster_numeric], xlim = c(65, 100), ylim = c(5, 40), xlab = "Longitude", ylab = "Latitude", main = "Hierarchical Clustering of Seasonal Average Precipitation over India") # Add world map maps::map("world", add = TRUE, wrap = c(0, 360), lwd = 1.5) # Add grid grid() # Add centroids points(means$Long, means$Lat, pch = 2, cex = 1, lwd = 2, col = "black") # Add labels for each centroid text(means$Long, means$Lat, labels = means$Group.1, pos = 3, cex = 0.8) # Add legend legend("topright", legend = paste("Cluster", 1:n_clusters), fill = cluster_colors[1:n_clusters], border = "black", title = "Cluster", bg = "white", box.lwd = 0.5) ``` #### Clustering Summary and Comparison (i)Clustering based on Latitude, Longitude, Elevation Kmeans: (Optimal Clusters:9) Clusters form compact spatial groupings K-medoids: (Optimal Clusters:2) A coarse split, possibly separating east-west or coastal-inland. Hierarchical: (Optimal Clusters:2) Similar to K-medoids, divides the region into two broad spatial regimes. K-means captured more nuanced structure, while K-medoids and Hierarchical produced broader separations. This suggests that elevation and spatial clustering benefit from K-means' ability to capture finer granularity. (ii)Clustering with Latitude, Longitude, Elevation and distance to Bay of Bengal and distance to Arabian Sea K-means: (Optimal Clusters:9) Similar structure as (i), but with clusters slightly aligned by coastal distance. K-medoids: (Optimal Clusters:4) More refined than (i); coastal proximity played a clearer role. Hierarchical: (Optimal Clusters:3) Broader separation, clearer separation of coastal vs inland clusters. Adding distance to coasts improved the distinction between coastal and inland regions, which matters for rainfall patterns driven by monsoonal flows. ##### Conclusions: K-means consistently finds more detailed structure, which is good for nuanced regional classification. K-medoids and Hierarchical methods find more coarse divisions. Including distance to Arabian Sea and Bay of Bengal enhanced clustering, capturing the influence of coastal proximity on precipitation patterns.