--- title: "hw2_Prob6" author: "Haiying Peng" date: "2025-03-20" output: html_document --- ### PCA + Multinomial Regression 6. Construction safety outcomes data have been compiled from accident reports and described in Tixier et al. (2016). The data, attribute information and relevant R commands can be found at http://civil.colorado.edu/~balajir/CVEN6833/const-data For the safety outcome of body part, with five categories (see Table 2 of Tixier et al., 2016). Twenty leading attributes, all binary (1 or 0 – i.e., present or absent) are provided for each accident record. (a) Perform a PCA on the attribute data – show the Eigen spectrum and the leading four Eigen vectors (b) Fit a best multinomial regression with the principal components as predictors to model the categorical probability of injuries to the five category of body parts and compute the RPSS (c) Repeat (b) for the Injury Severity Tixier et al. (2016) modeled these using Random Forests and Stochastic Boosting. They could not get good skill for Injury Severity. Compare your results with theirs. ```{r} #### Clear Memory and Set Options rm(list=ls()) options(warn=-1) save <- TRUE # Change to TRUE if you want to save the plot # Set working directory (Modify if necessary) script_dir <- dirname(rstudioapi::getActiveDocumentContext()$path) setwd(script_dir) source("HW2_Library.R") ``` #### (a) Perform a PCA on the attribute data - show the Eigen spectrum and the leading four Eigen vectors ```{r} ## Read Data # Retain top 20 variables to.retain = 20 imp.vars.body = as.character(read.csv("./data/gbm.importance.50.body.csv", header = T)[1:to.retain,"var"]) imp.vars.severity = as.character(read.csv("./data/gbm.importance.50.code.csv", header = T)[1:to.retain,"var"]) ## Read Body Part data and select the important variables data = read.csv("./data/data.body.part.csv") colnames(data) <- as.character(colnames(data)) index = which(colnames(data)%in%imp.vars.body) Xpredictors = data[,index] bpart = data[,1] bpart=as.character(bpart) bpart[bpart == "head"]=1 bpart[bpart == "neck"]=2 bpart[bpart == "trunk"]=3 bpart[bpart == "upper extremities"]=4 bpart[bpart == "lower extremities"]=5 ################### ### Perform PCA ### ################### #get variance matrix.. zs=var(Xpredictors) #do an Eigen decomposition.. zsvd=svd(zs) #Principal Components... pcs=t(t(zsvd$u) %*% t(Xpredictors)) # eigenvector * X matrix #Eigen Values.. - fraction variance lambdas=(zsvd$d/sum(zsvd$d)) ## Eigen Spectrum plot(1:20, lambdas[1:20], type="l", xlab="Modes", ylab="Frac. Var. Explained") points(1:20, lambdas[1:20], col="red") print(sprintf("The first 4 PCs explain %s%% of the variance", round(sum(lambdas[1:4])*100))) print(sprintf("The first 6 PCs explain %s%% of the variance", round(sum(lambdas[1:6])*100))) ``` ```{r} ## Eigen Vectors ### plot zsvd$u[,1:4] ### against the variables ### Identify the variables that have a strong influence ### on the injuries. tu = expand.grid(Eofs = gl(4, 1, labels = c("EOF no.1","EOF no.2","EOF no.3","EOF no.4")), variable = imp.vars.body) a=vector(length = 80) a[1:4]=zsvd$u[1,1:4] for (i in 2:20) { ind1=4*(i-1)+1 ind=ind1+3 a[ind1:ind]=zsvd$u[i,1:4] } tu$Eof= a show(ggplot(tu, aes(x = variable, y = Eof, color = Eofs, group = Eofs)) + geom_point(size=3) + geom_line()+ labs(title = "Eigen vectors injury precursors",x = "variable",y = "Influence",color="")+ theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))+ theme(axis.text.x=element_text(color = "black", size=11, angle=90, vjust=.8, hjust=0.8))+ coord_flip()) ``` #### (b) Fit a best multinomial regression with the principal components as predictors to model the categorical probability of injuries to the five category of body parts and compute the RPSS ```{r} ### create binary vector bpartbin = as.numeric(bpart) pcs=cbind(bpartbin,pcs) pcs = data.frame(pcs) zz = multinom(bpartbin ~ ., data=pcs) N = log(length(bpartbin)) ``` ```{r} zbest=stepAIC(zz,k=N) print(summary(zbest)) ``` #### Compute the RPSS ```{r} #################### RPSS ypred = predict(zbest, type = "probs") # RPSS calculations acc2 = as.numeric(bpart) N = length(acc2) acc2_name ="" #identify the numeric index for (i in 1:5) { ind=which(acc2==i) acc2_name[i]=as.character(data$body.part[ind[1]]) } ### using verification package you get the same results.. # Empirical probabilities p1 <- length(acc2[acc2 == 1])/N p2 <- length(acc2[acc2 == 2])/N p3 <- length(acc2[acc2 == 3])/N p4 <- length(acc2[acc2 == 4])/N p5 <- length(acc2[acc2 == 5])/N prob = c(p1, p2, p3,p4, p5) RPSS=rps(acc2, ypred, baseline=prob)$rpss prob=as.data.frame(prob) rownames(prob)=acc2_name print("Empirical probabilities are:") print(prob) ``` ```{r} print(sprintf('Body Part Multinomial Regression RPSS: %s', RPSS)) ``` #### (c) Repeat (b) for the Injury Severity ```{r} # Read data ## Read Injury Severity data and select the important variables data1 = read.csv("data/data.worst.case.severity.csv") index=which(colnames(data1)%in%imp.vars.severity) Xpredictors = data1[,index] sev = data1[,1] # Divide body parts into five classes sev[sev == "1st Aid"]=1 sev[sev == "Medical Case"]=2 sev[sev == "Lost Work Time"]=3 sev[sev == "Permanent Disablement"]=4 sev[sev == "Fatality"]=5 ### repeat steps from above. #get variance matrix.. zs=var(Xpredictors) #do an Eigen decomposition.. zsvd=svd(zs) #Principal Components... pcs1 = t(t(zsvd$u) %*% t(Xpredictors)) # eigenvector * X matrix #Eigen Values.. - fraction variance lambdas=(zsvd$d/sum(zsvd$d)) ## Eigen Spectrum plot(1:20, lambdas[1:20], type="l", xlab="Modes", ylab="Frac. Var. Explained") points(1:20, lambdas[1:20], col="red") print(sprintf("The first 4 PCs explain %s%% of the variance", round(sum(lambdas[1:4])*100))) print(sprintf("The first 6 PCs explain %s%% of the variance", round(sum(lambdas[1:6])*100))) ``` ```{r} ## Eigen Vectors ### plot zsvd$u[,1:4] ### against the variables ### Identify the variables that have a strong influence ### on the injuries. tu = expand.grid(Eofs = gl(4, 1, labels = c("EOF no.1","EOF no.2","EOF no.3","EOF no.4")), variable = imp.vars.severity) a=vector(length = 80) a[1:4]=zsvd$u[1,1:4] for (i in 2:20) { ind1=4*(i-1)+1 ind=ind1+3 a[ind1:ind]=zsvd$u[i,1:4] } tu$Eof= a show(ggplot(tu, aes(x = variable, y = Eof, color = Eofs, group = Eofs)) + geom_point(size=3) + geom_line()+ labs(title = "Eigen vectors injury precursors",x = "variable",y = "Influence",color="")+ theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))+ theme(axis.text.x=element_text(color = "black", size=11, angle=90, vjust=.8, hjust=0.8))+ coord_flip()) ``` ```{r} ### fit Multinomial regression ### create binary vector sevbin = as.numeric(sev) pcs1 = cbind(sevbin,pcs1) pcs1 = data.frame(pcs1) zz = multinom(sevbin ~ ., data=pcs1) ``` ```{r} N = log(length(sevbin)) zbest=stepAIC(zz,k=N) print(summary(zbest)) ``` ```{r} #################### RPSS ypred = predict(zbest, type = "probs") # RPSS calculations acc2 = as.numeric(sev) N = length(acc2) acc2_name ="" #identify the numeric index for (i in 1:4) { ind=which(acc2==i) acc2_name[i]=as.character(data1$worst.case.severity[ind[1]]) } ### using verification package you get the same results.. # Empirical probabilities p1 <- length(acc2[acc2 == 1])/N p2 <- length(acc2[acc2 == 2])/N p3 <- length(acc2[acc2 == 3])/N p4 <- length(acc2[acc2 == 4])/N prob <- c(p1, p2, p3,p4) RPSS=rps(acc2, ypred, baseline=prob)$rpss prob=as.data.frame(prob) rownames(prob)=acc2_name print("Empirical probabilities are:") print(prob) ``` ```{r} print(sprintf('Injury Severity Multinomial Regression RPSS: %s', RPSS)) ``` ##### Compare results with Tixier et al. (2016) 1.Body Part Injury Prediction The PCA + multinomial regression model achieved an RPSS of 0.282, which is higher than the RF model (0.172) reported by Tixier et al., and slightly below the SGTB model (0.324). This indicates that this approach is competitive and provides a compromise between interpretability and prediction accuracy. 2.Injury Severity Prediction The PCA + multinomial regression model achieved an RPSS of 0.358, which significantly outperforms both RF (−0.1) and SGTB (−0.65) reported in Tixier et al. (2016). This is a notable improvement, as the ensemble models failed to show predictive skill for injury severity. The PCA-based approach seems to better capture meaningful structure in the data, maybe because PCA can minimize noise and highlight important information, whereas Tixier et al. had limited success predicting this outcome. Conclusion While ensemble methods perform well for predicting injury body part, the PCA-based regression approach proved to be similarly effective and even outperformed them in predicting injury severity. This demonstrates the value of PCA for improving model performance and clarity in addition to lowering dimensionality, particularly for complicated outcomes that are generally more difficult to anticipate.