PCA + Multinomial Regression

Construction safety outcomes data have been compiled from accident reports and described in Tixier et al. (2016). The data, attribute information and relevant Rcommands can be found at http://civil.colorado.edu/~balajir/CVEN6833/const-data. For the safety outcome of body part, with five categories (see Table 2 of Tixier et al., 2016).

Twenty leading attributes, all binary (1 or 0 - i.e., present or absent) are provided for each accident record.

  1. Perform a PCA on the attribute data - show the Eigen spectrum and the leading four Eigen vectors.

  2. Fit a best multinomial regression with the principal components as predictors to model the categorical probability of injuries to the five category of body parts and compute the RPSS

  3. Repeat (b) for the Injury Severity Tixier et al. (2016) modeled these using Random Forests and Stochastic Boosting. They could not get good skill for Injury Severity. Compare your results with theirs.

  4. Apply CART and compare the results.

Solution

#### CLEAR MEMORY
rm(list=ls())
#### Prevent Warnings
options(warn=-1)

#### SOURCE LIBRARIES (suppress package loading messages) AND SET WORKING DIRECTORY
mainDir="C:/Users/DELL/Dropbox/Boulder/Courses/Advance data analysis/HW/HW2"
setwd(mainDir)
suppressPackageStartupMessages(source("Lib_HW2.R"))
source("Lib_HW2.R")

Read data

## read data
# retain top 20 variables
to.retain=20
imp.vars.body=as.character(read.csv("./data/gbm.importance.50.body.csv",header=TRUE)[1:to.retain,"var"])
imp.vars.severity = as.character(read.csv("./data/gbm.importance.50.code.csv",header=TRUE)[1:to.retain, "var"]) 

## Read Body part data and select the important variables
data = read.csv("./data/data.body.part.csv")
index=which(colnames(data)%in%imp.vars.body)
Xpredictors = data[,index]
bpart=data[,1]

## Read the Injury Severity data and select the important variables
data1 = read.csv("./data/data.severity.csv")

(a) Perform a PCA on the attribute data and plot the Eigen spectrum

###################
### Perform PCA ###
###################

#get variance matrix..
zs=var(Xpredictors)

#do an Eigen decomposition..
zsvd=svd(zs)

#Principal Components...
pcs=t(t(zsvd$u) %*% t(Xpredictors))  # eigenvector * X matrix 

#Eigen Values.. - fraction variance 
lambdas=(zsvd$d/sum(zsvd$d))

#Plot the eigen variance spectrum for the first 20 modes
#Eigen spectrum
ggplot() +
 geom_line(mapping = aes(1:20, lambdas[1:20])) +
 geom_point(mapping = aes(1:20, lambdas[1:20]), shape = 21, size = 3, color = "gray30", fill ="cadetblue1") +
 labs(title = "Eigen Spectrum injury precursors",x = "Modes",y = "Frac. Var. explained")+
  theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))

sprintf("First 3 PCs explained %s percent of the variance",round(sum(lambdas[1:3])*100))
## [1] "First 3 PCs explained 39 percent of the variance"
sprintf("First 4 PCs explained %s percent of the variance",round(sum(lambdas[1:4])*100))
## [1] "First 4 PCs explained 46 percent of the variance"
sprintf("First 6 PCs explained %s percent of the variance",round(sum(lambdas[1:6])*100))
## [1] "First 6 PCs explained 57 percent of the variance"

Plot the leading four Eigen vectors

#plot the leading four eigen vectors
tu = expand.grid(Eofs   = gl(4, 1, labels = c("EOF no.1","EOF no.2","EOF no.3","EOF no.4")),
                  variable       = imp.vars.body)
a=vector(length = 80)
a[1:4]=zsvd$u[1,1:4]
for (i in 2:20) {
  ind1=4*(i-1)+1
  ind=ind1+3
  a[ind1:ind]=zsvd$u[i,1:4]
  
}
tu$Eof= a
ggplot(tu, aes(x = variable, y = Eof, color = Eofs, group = Eofs)) + 
  geom_point(size=3) + geom_line()+
  labs(title = "Eigen vectors injury precursors",x = "variable",y = "Influence",color="")+
  theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))+
  theme(axis.text.x=element_text(color = "black", size=11, angle=90, vjust=.8, hjust=0.8))+
  coord_flip()

(b) Fit a best multinomial regression

### fit Multinomial regression

pcs=cbind(bpart,pcs)
pcs = data.frame(pcs)

N = log(length(bpart))
## or model with all the PCS
zz = multinom(bpart ~ ., data=pcs)
## # weights:  110 (84 variable)
## initial  value 6830.454500 
## iter  10 value 4802.845600
## iter  20 value 4681.045845
## iter  30 value 4667.484616
## iter  40 value 4665.052067
## iter  50 value 4664.133716
## iter  60 value 4663.837152
## iter  70 value 4663.749986
## iter  80 value 4663.668413
## final  value 4663.655477 
## converged
summary(zbest)
## Call:
## multinom(formula = bpart ~ V2 + V3 + V4 + V6 + V7 + V8 + V9 + 
##     V12 + V14 + V15 + V17 + V20 + V21, data = pcs)
## 
## Coefficients:
##   (Intercept)        V2         V3         V4        V6        V7
## 2   0.2795602 -3.909752 12.7226743  0.6675859 2.2823349  1.776654
## 3  -1.8715403 -2.117657  0.7444662  2.2291403 2.7728561 -2.859202
## 4  -0.2512364 -3.945427  7.8531051  1.0534055 0.6796384  1.412624
## 5   1.0090588 -1.829775  4.9939418 -0.6960140 0.6249908  3.050353
##            V8          V9        V12        V14        V15        V17
## 2  1.00568870  0.58910532 -1.1178187 -1.0133073   1.302331 -1.3559690
## 3 -5.06051710  4.54543505  9.5003324  9.2493738 -11.271684 21.3996690
## 4  0.09345112 -0.07983149 -0.2164937 -2.2052052  -1.017061 -0.7406517
## 5 -1.07466342 -0.83864614  0.2290801  0.4808086   1.357942 -0.9804030
##           V20         V21
## 2    1.332043  -1.2893075
## 3 -174.901741 223.4749927
## 4   -4.155083  -0.4797571
## 5    2.700105  -3.2937500
## 
## Std. Errors:
##   (Intercept)        V2        V3        V4        V6        V7        V8
## 2  0.06766681 0.5863416 1.7828655 0.2888557 0.4100698 0.3204268 0.4177351
## 3  0.14450319 0.6873848 1.6222056 0.8975402 1.0366390 1.7731184 1.9411282
## 4  0.07671925 0.6121483 1.8827514 0.4086169 0.4716272 0.3400028 0.4634082
## 5  0.05908701 0.1822436 0.3584732 0.2217303 0.3923408 0.2956986 0.3686182
##          V9       V12       V14       V15       V17       V20       V21
## 2 0.3914876 0.4917368 0.4553335 0.4936806 0.5059824  1.109333  1.303356
## 3 1.8188829 2.3524723 3.2689684 3.4802727 5.9370087 54.749495 68.312069
## 4 0.4305687 0.6170746 0.7693763 0.3503601 0.6167398  3.912979  1.861419
## 5 0.3404493 0.2998880 0.3229168 0.3660987 0.3358988  1.041735  1.135893
## 
## Residual Deviance: 9390.197 
## AIC: 9502.197

Compute the RPSS

#################### RPSS
ypred = predict(zbest, type = "probs")
# RPSS calculations
acc2 = as.numeric(data$body.part)
N = length(acc2)
acc2_name =""
#identify the numeric index
for (i in 1:5) {
  ind=which(acc2==i)
  acc2_name[i]=as.character(data$body.part[ind[1]])
}
### using verification package you get the same results..
# Empirical probabilities
p1 <- length(acc2[acc2 == 1])/N
p2 <- length(acc2[acc2 == 2])/N
p3 <- length(acc2[acc2 == 3])/N
p4 <- length(acc2[acc2 == 4])/N
p5 <- length(acc2[acc2 == 5])/N

prob = c(p1, p2, p3,p4,p5)
RPSS=rps(acc2, ypred, baseline=prob)$rpss
prob=as.data.frame(prob)
rownames(prob)=acc2_name
print("Empirical probabilities are:")
## [1] "Empirical probabilities are:"
print(prob)
##                         prob
## head              0.23963242
## lower extremities 0.19486334
## neck              0.02403393
## trunk             0.12111216
## upper extremities 0.42035815
sprintf("The RPSS of the body part multinomial regression is %s",RPSS)
## [1] "The RPSS of the body part multinomial regression is 0.229479769549887"

(c) Repeat (b) for the Injury Severity

Read data

# find column numbers corresponding to the important variables (where binary is some attribute and outcome data set)
index=which(colnames(data1)%in%imp.vars.severity)
Xpredictors = data1[,index]
sev=data1[,1]

Perform a PCA on the attribute data and plot the Eigen spectrum

#get variance matrix..
zs=var(Xpredictors)

#do an Eigen decomposition..
zsvd=svd(zs)

#Principal Components...
pcs1=t(t(zsvd$u) %*% t(Xpredictors))  # eigenvector * X matrix 

#Eigen Values.. - fraction variance 
lambdas=(zsvd$d/sum(zsvd$d))

#Plot the eigen variance spectrum for the first 20 modes
#Eigen spectrum
ggplot() +
 geom_line(mapping = aes(1:20, lambdas[1:20])) +
 geom_point(mapping = aes(1:20, lambdas[1:20]), shape = 21, size = 3, color = "gray30", fill ="cadetblue1") +
 labs(title = "Eigen Spectrum 20 leading attributes",x = "Modes",y = "Frac. Var. explained")+
  theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))

sprintf("First 3 PCs explained %s percent of the variance",round(sum(lambdas[1:3])*100))
## [1] "First 3 PCs explained 38 percent of the variance"
sprintf("First 4 PCs explained %s percent of the variance",round(sum(lambdas[1:4])*100))
## [1] "First 4 PCs explained 47 percent of the variance"
sprintf("First 6 PCs explained %s percent of the variance",round(sum(lambdas[1:6])*100))
## [1] "First 6 PCs explained 60 percent of the variance"
sprintf("First 10 PCs explained %s percent of the variance",round(sum(lambdas[1:10])*100))
## [1] "First 10 PCs explained 78 percent of the variance"

Plot the leading four eigen vectors

#plot the leading four eigen vectors
tu = expand.grid(Eofs   = gl(4, 1, labels = c("EOF no.1","EOF no.2","EOF no.3","EOF no.4")),
                  variable       = imp.vars.severity)
a=vector(length = 80)
a[1:4]=zsvd$u[1,1:4]
for (i in 2:20) {
  ind1=4*(i-1)+1
  ind=ind1+3
  a[ind1:ind]=zsvd$u[i,1:4]
  
}
tu$Eof= a
ggplot(tu, aes(x = variable, y = Eof, color = Eofs, group = Eofs)) + 
  geom_point(size=3) + geom_line()+
  labs(title = "Eigen vectors of 20 leading attributes",x = "variable",y = "Influence",color="")+
  theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))+
  theme(axis.text.x=element_text(color = "black", size=11, angle=90, vjust=.8, hjust=0.8))+
  coord_flip()

Fit Multinomial regression

pcs1=cbind(sev,pcs1)
pcs1 = data.frame(pcs1)

## or model with all the PCS
zz = multinom(sev ~ ., data=pcs1)
## # weights:  88 (63 variable)
## initial  value 2470.376552 
## iter  10 value 1437.098629
## iter  20 value 1431.364632
## iter  30 value 1430.827448
## iter  40 value 1430.770986
## iter  50 value 1430.767913
## final  value 1430.767458 
## converged
N = log(length(sev))
summary(zbest1)
## Call:
## multinom(formula = sev ~ 1, data = pcs1)
## 
## Coefficients:
##   (Intercept)
## 2   -2.627830
## 3   -1.883903
## 4   -2.148727
## 
## Std. Errors:
##   (Intercept)
## 2  0.10568406
## 3  0.07551759
## 4  0.08487577
## 
## Residual Deviance: 2977.115 
## AIC: 2983.115

Compute the RPSS

#################### RPSS
ypred = predict(zbest1, type = "probs")
# RPSS calculations
acc2 = as.numeric(sev)
N = length(acc2)
acc2_name =""
#identify the numeric index
for (i in 1:4) {
  ind=which(acc2==i)
  acc2_name[i]=as.character(data1$injury.severity[ind[1]])
}
### using verification package you get the same results..
# Empirical probabilities
p1 <- length(acc2[acc2 == 1])/N
p2 <- length(acc2[acc2 == 2])/N
p3 <- length(acc2[acc2 == 3])/N
p4 <- length(acc2[acc2 == 4])/N

prob <- c(p1, p2, p3,p4)
RPSS=rps(acc2, ypred, baseline=prob)$rpss
prob=as.data.frame(prob)
rownames(prob)=acc2_name
print("Empirical probabilities are:")
## [1] "Empirical probabilities are:"
print(prob)
##                      prob
## 1st Aid        0.74579125
## Lost Work Time 0.05387205
## Medical Case   0.11335578
## Pain           0.08698092
sprintf("The RPSS of the body part multinomial regression is %s",RPSS)
## [1] "The RPSS of the body part multinomial regression is -5.48756595719624e-11"

(d) Fit tree to Body part data and visualize

#Create Complex Tree - or full tree
myTree = tree(body.part ~., data = data, model = T)
summary(myTree)
## 
## Classification tree:
## tree(formula = body.part ~ ., data = data, model = T)
## Variables actually used in tree construction:
## [1] "small.particle"      "hazardous.substance" "object.on.the.floor"
## Number of terminal nodes:  4 
## Residual mean deviance:  2.397 = 10160 / 4240 
## Misclassification error rate: 0.4618 = 1960 / 4244
### treeFun below
treeFit = treeFun(myTree, toPlot=T, title="Burnt Area Tree")

summary(treeFit)
## 
## Classification tree:
## tree(formula = body.part ~ ., data = data, model = T)
## Variables actually used in tree construction:
## [1] "small.particle"      "hazardous.substance" "object.on.the.floor"
## Number of terminal nodes:  4 
## Residual mean deviance:  2.397 = 10160 / 4240 
## Misclassification error rate: 0.4618 = 1960 / 4244
#Use tree to predict original data
treePred <- predict(treeFit)
treeResid <- resid(treeFit)
myRange <- range(treePred, data$body.part)

N = length(data$body.part)
bpart=as.integer(data$body.part)
bpart_name=""
#identify the numeric index
for (i in 1:5) {
  ind=which(bpart==i)
  bpart_name[i]=as.character(data$body.part[ind[1]])
}
### using verification package you get the same results..
# Empirical probabilities
p1 = length(bpart[bpart== 1])/N
p2 = length(bpart[bpart == 2])/N
p3 = length(bpart[bpart == 3])/N
p4 = length(bpart[bpart == 4])/N
p5 = length(bpart[bpart == 5])/N
prob <- c(p1, p2, p3,p4,p5)
RPSS=rps(bpart, treePred, baseline=prob )$rpss
prob=as.data.frame(prob)
rownames(prob)=bpart_name
print("Empirical probabilities are:")
## [1] "Empirical probabilities are:"
print(prob)
##                         prob
## head              0.23963242
## lower extremities 0.19486334
## neck              0.02403393
## trunk             0.12111216
## upper extremities 0.42035815
sprintf("The RPSS of the body part multinomial regression is %s",RPSS)
## [1] "The RPSS of the body part multinomial regression is 0.172425236818633"

Fit tree to Injury Severity category and visualize

#Create Complex Tree - or full tree
myTree = tree(injury.severity ~., data = data1, model = T)
summary(myTree)
## 
## Classification tree:
## tree(formula = injury.severity ~ ., data = data1, model = T)
## Variables actually used in tree construction:
## character(0)
## Number of terminal nodes:  1 
## Residual mean deviance:  1.672 = 2977 / 1781 
## Misclassification error rate: 0.2542 = 453 / 1782
print("It is not possible to apply CART to Injury Severity category")
## [1] "It is not possible to apply CART to Injury Severity category"

Final comments

Eigen vectors of first 20 leading attributes for the safety outcome of body part

  • Although first 3 PCs may be sufficient according to the Eigen spectrum, they only can explain 39% of the data variance. It is possible to see that would be necessary to consider more PCs in order to explain a explain a significant fraction of the data variance, for example, the first 6 Pcs explain 60% of the variance. In the case of the first 4 PCs, they only explain 46% of the data variance.

  • Eigen vectors plot of the first 20 injury precursors show that variables wire (EOF no.1), powered tool (EOF no. 2), stairs (EOF no. 3), and hazardous substance (EOF no. 4) have strongest influence.

  • The best multinomial regression considered 13 variables. This result is in agreement with the eigen spectrum plot.
  • Upper extremities were body part most injured according to the empirical probabilities obtained from the multinomial regression model (42%).

Eigen vectors of first 20 leading attributes for Injury Severity

  • As in the case of the Eigen spectrum of the injury precursors, a big number of PCs are necessary to represent a significant fraction of the data variance (first 6 PCs explained 60% of the variance). The first 4 PCs, they only explain 47% of the data variance.

  • Eigen vectors plot of the first 20 leading variables show that variables hazardous substance (EOF no.1), uneven surface (EOF no. 2), piping (EOF no. 3), and adverse low temps (EOF no. 4) have strongest influence on the eigen vectors.

  • The best multinomial regression does not consider any variable in the model, it indicates that injury severity attributes are not correlated with 20 leading variables of accident records.

  • According to the empirical probabilities obtained using the multinomial regression model, 1\(^{st}\) Aid is the most likely injury severity (74.5%).

CART on Body Part

  • Both unpruned and pruned tree for burnt area tree model considered 3 variables: small particle (most important variable); hazardous substance; and object on the floor. Only one of these variables was considered as significant by PCA (hazardous substance).

  • Based on Ranked Probability Score (rpss) values, predictions of both models are good since their rpss values are less than 0.25, but the lower rpss values for CART (0.17) indicates a better performance than PCA(0.23).

CART on Injury Severity

  • As in the case, tree obtained only have 1 node (model eliminate all variables), so, it is not possible to plot the tree.