Synopsis

Human Activity Recognition is emerging as a new field where wearable devices are commonly used to quantify the amount of time an activity is performed. In our analysis, we instead look at how well weight lifting exercises were performed in a study. Each individual in the experiment had various accelerometer data collected from devices on different parts of the body while performing barbell exercises in five different ways. We developed machine learning algorithms that predict the way they were performed based on accelerometer data. Our final model that gave us a 100% In Sample accuracy and a 99.0% Out of Sample accuracy was the random forest algorithm with a 10-fold cross-validation repeated 5 times.

Loading the Data

We first load the required packages caret for machine learning and doParallel for registering a parallel backend for our training instructions to utilize multi-Core CPUs. We then download the train and test datasets and read them into data frame objects.

library(caret)
library(doParallel)
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
               destfile="pml-training.csv", mode="wb")
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
              destfile="pml-testing.csv", mode="wb")
dataTrain <- read.csv("pml-training.csv", na.strings=c("#DIV/0!","NA"))
dataTest <- read.csv("pml-testing.csv", na.strings=c("#DIV/0!","NA"))

Summarizing, Cleaning and Partitioning the Data

First, we see that the train dataset has 160 columns. Our outcome variable that will be predicted, classe, consists of values A, B, C, D or E. Each letter (class) refers to one of the five different ways the barbell excercise can be performed. We check for missing values and find that 100 of these columns have missing values in excess of 19216 each.

dim(dataTrain)
## [1] 19622   160
missCols <- colSums(is.na(dataTrain))
names(missCols[missCols>=19216])
##   [1] "kurtosis_roll_belt"       "kurtosis_picth_belt"     
##   [3] "kurtosis_yaw_belt"        "skewness_roll_belt"      
##   [5] "skewness_roll_belt.1"     "skewness_yaw_belt"       
##   [7] "max_roll_belt"            "max_picth_belt"          
##   [9] "max_yaw_belt"             "min_roll_belt"           
##  [11] "min_pitch_belt"           "min_yaw_belt"            
##  [13] "amplitude_roll_belt"      "amplitude_pitch_belt"    
##  [15] "amplitude_yaw_belt"       "var_total_accel_belt"    
##  [17] "avg_roll_belt"            "stddev_roll_belt"        
##  [19] "var_roll_belt"            "avg_pitch_belt"          
##  [21] "stddev_pitch_belt"        "var_pitch_belt"          
##  [23] "avg_yaw_belt"             "stddev_yaw_belt"         
##  [25] "var_yaw_belt"             "var_accel_arm"           
##  [27] "avg_roll_arm"             "stddev_roll_arm"         
##  [29] "var_roll_arm"             "avg_pitch_arm"           
##  [31] "stddev_pitch_arm"         "var_pitch_arm"           
##  [33] "avg_yaw_arm"              "stddev_yaw_arm"          
##  [35] "var_yaw_arm"              "kurtosis_roll_arm"       
##  [37] "kurtosis_picth_arm"       "kurtosis_yaw_arm"        
##  [39] "skewness_roll_arm"        "skewness_pitch_arm"      
##  [41] "skewness_yaw_arm"         "max_roll_arm"            
##  [43] "max_picth_arm"            "max_yaw_arm"             
##  [45] "min_roll_arm"             "min_pitch_arm"           
##  [47] "min_yaw_arm"              "amplitude_roll_arm"      
##  [49] "amplitude_pitch_arm"      "amplitude_yaw_arm"       
##  [51] "kurtosis_roll_dumbbell"   "kurtosis_picth_dumbbell" 
##  [53] "kurtosis_yaw_dumbbell"    "skewness_roll_dumbbell"  
##  [55] "skewness_pitch_dumbbell"  "skewness_yaw_dumbbell"   
##  [57] "max_roll_dumbbell"        "max_picth_dumbbell"      
##  [59] "max_yaw_dumbbell"         "min_roll_dumbbell"       
##  [61] "min_pitch_dumbbell"       "min_yaw_dumbbell"        
##  [63] "amplitude_roll_dumbbell"  "amplitude_pitch_dumbbell"
##  [65] "amplitude_yaw_dumbbell"   "var_accel_dumbbell"      
##  [67] "avg_roll_dumbbell"        "stddev_roll_dumbbell"    
##  [69] "var_roll_dumbbell"        "avg_pitch_dumbbell"      
##  [71] "stddev_pitch_dumbbell"    "var_pitch_dumbbell"      
##  [73] "avg_yaw_dumbbell"         "stddev_yaw_dumbbell"     
##  [75] "var_yaw_dumbbell"         "kurtosis_roll_forearm"   
##  [77] "kurtosis_picth_forearm"   "kurtosis_yaw_forearm"    
##  [79] "skewness_roll_forearm"    "skewness_pitch_forearm"  
##  [81] "skewness_yaw_forearm"     "max_roll_forearm"        
##  [83] "max_picth_forearm"        "max_yaw_forearm"         
##  [85] "min_roll_forearm"         "min_pitch_forearm"       
##  [87] "min_yaw_forearm"          "amplitude_roll_forearm"  
##  [89] "amplitude_pitch_forearm"  "amplitude_yaw_forearm"   
##  [91] "var_accel_forearm"        "avg_roll_forearm"        
##  [93] "stddev_roll_forearm"      "var_roll_forearm"        
##  [95] "avg_pitch_forearm"        "stddev_pitch_forearm"    
##  [97] "var_pitch_forearm"        "avg_yaw_forearm"         
##  [99] "stddev_yaw_forearm"       "var_yaw_forearm"

Next, we see that the first 7 columns contain data that is not relevant to the accelerometers.

colnames(dataTrain[1:7])
## [1] "X"                    "user_name"            "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
## [7] "num_window"

We clean both the train and test datasets by removing the first 7 columns and since we do not want predictors almost completely filled with missing values, we also remove 100 columns. The final datasets have 53 columns.

dataTrain <- dataTrain[,complete.cases(t(dataTrain))]
dataTrain <- dataTrain[, -c(1:7)]
dataTest <- dataTest[,complete.cases(t(dataTest))]
dataTest <- dataTest[, -c(1:7)]
dim(dataTrain)
## [1] 19622    53
dim(dataTest)
## [1] 20 53

Finally, we partition the training set into training and testing subsets to be used for training our models and estimating our errors. The data is split with 60% training and 40% testing.

set.seed(24340)
inTrain = createDataPartition(y=dataTrain$classe, p = 0.6,list=FALSE)
training = dataTrain[inTrain,]
testing = dataTrain[-inTrain,]

Fitting Predictive Models

We implement various machine learning algorithms using the caret package: Linear discriminant analysis, naive Bayes, CART, random forest, partial least squares and stochastic gradient boosting. In this first step, we use default tuning parameters such as bootstrapping resamples. The models are stored in a list. We also calculate the confusion matrix to find the In Sample Error which will be used to determine a final model. Parallel processing is used to speed up computation time.

cl <- makeCluster(detectCores()) 
registerDoParallel(cl)      # Register parallel backend
model <- list()
modelError <- list()
set.seed(24340)
model[[1]] <- train(classe ~ ., data=training, method="lda")
modelError[[1]] <- confusionMatrix(predict(model[[1]], newdata=training), training$classe)
set.seed(24340)
model[[2]] <- train(classe ~ ., data=training, method="nb")
modelError[[2]] <- confusionMatrix(predict(model[[2]], newdata=training), training$classe)
set.seed(24340)
model[[3]] <- train(classe ~ ., data=training, method="rpart")
modelError[[3]] <- confusionMatrix(predict(model[[3]], newdata=training), training$classe)
set.seed(24340)
model[[4]] <- train(classe ~ ., data=training, method="rf")
modelError[[4]] <- confusionMatrix(predict(model[[4]], newdata=training), training$classe)
set.seed(24340)
model[[5]] <- train(classe ~ ., data=training, method="pls")
modelError[[5]] <- confusionMatrix(predict(model[[5]], newdata=training), training$classe)
set.seed(24340)
model[[6]] <- train(classe ~ ., data=training, method="gbm")
modelError[[6]] <- confusionMatrix(predict(model[[6]], newdata=training), training$classe)

We generate a table that summarizes our predictive models.

Algorithm Resampling In Sample Error (1-Accuracy)
Linear Discriminant Analysis boot (25 iterations) 28.8%
Naive Bayes boot (25 iterations) 24.5%
CART boot (25 iterations) 50.2%
Random Forest boot (25 iterations) 0.0%
Partial Least Squares boot (25 iterations) 60.8%
Stochastic Gradient Boosting boot (25 iterations) 2.4%

This table automatically populates specific attributes fetched from each model object. We can see that the Random Forest model has a 0.0% In Sample Error (100% accuracy). This model is sufficient for us because of its perfect prediction of the outcome variable classe from the training subset. We will tune this model to build our final model.

Final Model Results

Our final model will use tuning that includes 10-fold cross-validation repeated 5 times to obtain our Out of Sample Error on the testing subset.

ctrl <- trainControl(method="repeatedcv", number=10, repeats=5)
set.seed(24340)
model[[7]] <- train(classe ~ ., data=training, method="rf", trControl=ctrl)
modelError[[7]] <- confusionMatrix(predict(model[[7]], newdata=testing), testing$classe)
stopCluster(cl)     # Stop parallel backend

We show the results of our final model.

Algorithm Resampling Out of Sample Error (1-Accuracy)
Random Forest repeatedcv (10 folds, 5 repeats) 1.0%

This table shows that our final model (with cross-validation) has a 1.0% Out of Sample Error (99.0% accuracy) with the testing subset in predicting the classe outcome. The result is very positive.

A B C D E
A 2225 14 0 0 0
B 5 1497 15 0 1
C 1 6 1345 22 0
D 0 1 8 1263 1
E 1 0 0 1 1440

The confusion matrix above shows which predictions on the testing subset were correct and which were not. Our predictions are on the columns and the rows are the actual values. The non-diagonal elements are the errors.

## Random Forest 
## 
## 11776 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## 
## Summary of sample sizes: 10597, 10599, 10598, 10599, 10598, 10597, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9901155  0.9874947  0.002940465  0.003721159
##   27    0.9913889  0.9891065  0.002680881  0.003392092
##   52    0.9840687  0.9798463  0.003778697  0.004779172
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

The summary of our final model shows the varying accuracy metric depends on the predictors.

Relationship Between Randomly Selected Predictors and Accuracy

The plot shows the relationship between the number of randomly selected predictors and the accuracy. As we can see, the accuracy is highest when mtry, the number of variables available for splitting at each tree node is 27 (stated in the summary above also).

We then use our final model to predict the outcome classe on our separate test dataset. The results are satisfactory as all 20 observations are predicted correctly.

Conclusions

The HAR weight lifting exercises dataset has accelerometer readings that allow us to predict one of the five different ways of lifting barbells. Our data analysis yielded the random forest algorithm with 10-fold cross-validation repeated 5 times as our best model. The results were very positive with a 0% In Sample Error and 1.0% Out of Sample Error. Our final model was also able to correctly predict all outcomes of the separate test dataset.

Saif

Leave a Reply

Your email address will not be published. Required fields are marked *