Synopsis
Human Activity Recognition is emerging as a new field where wearable devices are commonly used to quantify the amount of time an activity is performed. In our analysis, we instead look at how well weight lifting exercises were performed in a study. Each individual in the experiment had various accelerometer data collected from devices on different parts of the body while performing barbell exercises in five different ways. We developed machine learning algorithms that predict the way they were performed based on accelerometer data. Our final model that gave us a 100% In Sample accuracy and a 99.0% Out of Sample accuracy was the random forest algorithm with a 10-fold cross-validation repeated 5 times.
Loading the Data
We first load the required packages caret for machine learning and doParallel for registering a parallel backend for our training instructions to utilize multi-Core CPUs. We then download the train and test datasets and read them into data frame objects.
library(caret)
library(doParallel)
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
destfile="pml-training.csv", mode="wb")
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
destfile="pml-testing.csv", mode="wb")
dataTrain <- read.csv("pml-training.csv", na.strings=c("#DIV/0!","NA"))
dataTest <- read.csv("pml-testing.csv", na.strings=c("#DIV/0!","NA"))
Summarizing, Cleaning and Partitioning the Data
First, we see that the train dataset has 160 columns. Our outcome variable that will be predicted, classe, consists of values A, B, C, D or E. Each letter (class) refers to one of the five different ways the barbell excercise can be performed. We check for missing values and find that 100 of these columns have missing values in excess of 19216 each.
dim(dataTrain)
## [1] 19622 160
missCols <- colSums(is.na(dataTrain))
names(missCols[missCols>=19216])
## [1] "kurtosis_roll_belt" "kurtosis_picth_belt" ## [3] "kurtosis_yaw_belt" "skewness_roll_belt" ## [5] "skewness_roll_belt.1" "skewness_yaw_belt" ## [7] "max_roll_belt" "max_picth_belt" ## [9] "max_yaw_belt" "min_roll_belt" ## [11] "min_pitch_belt" "min_yaw_belt" ## [13] "amplitude_roll_belt" "amplitude_pitch_belt" ## [15] "amplitude_yaw_belt" "var_total_accel_belt" ## [17] "avg_roll_belt" "stddev_roll_belt" ## [19] "var_roll_belt" "avg_pitch_belt" ## [21] "stddev_pitch_belt" "var_pitch_belt" ## [23] "avg_yaw_belt" "stddev_yaw_belt" ## [25] "var_yaw_belt" "var_accel_arm" ## [27] "avg_roll_arm" "stddev_roll_arm" ## [29] "var_roll_arm" "avg_pitch_arm" ## [31] "stddev_pitch_arm" "var_pitch_arm" ## [33] "avg_yaw_arm" "stddev_yaw_arm" ## [35] "var_yaw_arm" "kurtosis_roll_arm" ## [37] "kurtosis_picth_arm" "kurtosis_yaw_arm" ## [39] "skewness_roll_arm" "skewness_pitch_arm" ## [41] "skewness_yaw_arm" "max_roll_arm" ## [43] "max_picth_arm" "max_yaw_arm" ## [45] "min_roll_arm" "min_pitch_arm" ## [47] "min_yaw_arm" "amplitude_roll_arm" ## [49] "amplitude_pitch_arm" "amplitude_yaw_arm" ## [51] "kurtosis_roll_dumbbell" "kurtosis_picth_dumbbell" ## [53] "kurtosis_yaw_dumbbell" "skewness_roll_dumbbell" ## [55] "skewness_pitch_dumbbell" "skewness_yaw_dumbbell" ## [57] "max_roll_dumbbell" "max_picth_dumbbell" ## [59] "max_yaw_dumbbell" "min_roll_dumbbell" ## [61] "min_pitch_dumbbell" "min_yaw_dumbbell" ## [63] "amplitude_roll_dumbbell" "amplitude_pitch_dumbbell" ## [65] "amplitude_yaw_dumbbell" "var_accel_dumbbell" ## [67] "avg_roll_dumbbell" "stddev_roll_dumbbell" ## [69] "var_roll_dumbbell" "avg_pitch_dumbbell" ## [71] "stddev_pitch_dumbbell" "var_pitch_dumbbell" ## [73] "avg_yaw_dumbbell" "stddev_yaw_dumbbell" ## [75] "var_yaw_dumbbell" "kurtosis_roll_forearm" ## [77] "kurtosis_picth_forearm" "kurtosis_yaw_forearm" ## [79] "skewness_roll_forearm" "skewness_pitch_forearm" ## [81] "skewness_yaw_forearm" "max_roll_forearm" ## [83] "max_picth_forearm" "max_yaw_forearm" ## [85] "min_roll_forearm" "min_pitch_forearm" ## [87] "min_yaw_forearm" "amplitude_roll_forearm" ## [89] "amplitude_pitch_forearm" "amplitude_yaw_forearm" ## [91] "var_accel_forearm" "avg_roll_forearm" ## [93] "stddev_roll_forearm" "var_roll_forearm" ## [95] "avg_pitch_forearm" "stddev_pitch_forearm" ## [97] "var_pitch_forearm" "avg_yaw_forearm" ## [99] "stddev_yaw_forearm" "var_yaw_forearm"
Next, we see that the first 7 columns contain data that is not relevant to the accelerometers.
colnames(dataTrain[1:7])
## [1] "X" "user_name" "raw_timestamp_part_1" ## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window" ## [7] "num_window"
We clean both the train and test datasets by removing the first 7 columns and since we do not want predictors almost completely filled with missing values, we also remove 100 columns. The final datasets have 53 columns.
dataTrain <- dataTrain[,complete.cases(t(dataTrain))]
dataTrain <- dataTrain[, -c(1:7)]
dataTest <- dataTest[,complete.cases(t(dataTest))]
dataTest <- dataTest[, -c(1:7)]
dim(dataTrain)
## [1] 19622 53
dim(dataTest)
## [1] 20 53
Finally, we partition the training set into training and testing subsets to be used for training our models and estimating our errors. The data is split with 60% training and 40% testing.
set.seed(24340)
inTrain = createDataPartition(y=dataTrain$classe, p = 0.6,list=FALSE)
training = dataTrain[inTrain,]
testing = dataTrain[-inTrain,]
Fitting Predictive Models
We implement various machine learning algorithms using the caret package: Linear discriminant analysis, naive Bayes, CART, random forest, partial least squares and stochastic gradient boosting. In this first step, we use default tuning parameters such as bootstrapping resamples. The models are stored in a list. We also calculate the confusion matrix to find the In Sample Error which will be used to determine a final model. Parallel processing is used to speed up computation time.
cl <- makeCluster(detectCores())
registerDoParallel(cl) # Register parallel backend
model <- list()
modelError <- list()
set.seed(24340)
model[[1]] <- train(classe ~ ., data=training, method="lda")
modelError[[1]] <- confusionMatrix(predict(model[[1]], newdata=training), training$classe)
set.seed(24340)
model[[2]] <- train(classe ~ ., data=training, method="nb")
modelError[[2]] <- confusionMatrix(predict(model[[2]], newdata=training), training$classe)
set.seed(24340)
model[[3]] <- train(classe ~ ., data=training, method="rpart")
modelError[[3]] <- confusionMatrix(predict(model[[3]], newdata=training), training$classe)
set.seed(24340)
model[[4]] <- train(classe ~ ., data=training, method="rf")
modelError[[4]] <- confusionMatrix(predict(model[[4]], newdata=training), training$classe)
set.seed(24340)
model[[5]] <- train(classe ~ ., data=training, method="pls")
modelError[[5]] <- confusionMatrix(predict(model[[5]], newdata=training), training$classe)
set.seed(24340)
model[[6]] <- train(classe ~ ., data=training, method="gbm")
modelError[[6]] <- confusionMatrix(predict(model[[6]], newdata=training), training$classe)
We generate a table that summarizes our predictive models.
Algorithm | Resampling | In Sample Error (1-Accuracy) |
---|---|---|
Linear Discriminant Analysis | boot (25 iterations) | 28.8% |
Naive Bayes | boot (25 iterations) | 24.5% |
CART | boot (25 iterations) | 50.2% |
Random Forest | boot (25 iterations) | 0.0% |
Partial Least Squares | boot (25 iterations) | 60.8% |
Stochastic Gradient Boosting | boot (25 iterations) | 2.4% |
This table automatically populates specific attributes fetched from each model object. We can see that the Random Forest model has a 0.0% In Sample Error (100% accuracy). This model is sufficient for us because of its perfect prediction of the outcome variable classe from the training subset. We will tune this model to build our final model.
Final Model Results
Our final model will use tuning that includes 10-fold cross-validation repeated 5 times to obtain our Out of Sample Error on the testing subset.
ctrl <- trainControl(method="repeatedcv", number=10, repeats=5)
set.seed(24340)
model[[7]] <- train(classe ~ ., data=training, method="rf", trControl=ctrl)
modelError[[7]] <- confusionMatrix(predict(model[[7]], newdata=testing), testing$classe)
stopCluster(cl) # Stop parallel backend
We show the results of our final model.
Algorithm | Resampling | Out of Sample Error (1-Accuracy) |
---|---|---|
Random Forest | repeatedcv (10 folds, 5 repeats) | 1.0% |
This table shows that our final model (with cross-validation) has a 1.0% Out of Sample Error (99.0% accuracy) with the testing subset in predicting the classe outcome. The result is very positive.
A | B | C | D | E | |
---|---|---|---|---|---|
A | 2225 | 14 | 0 | 0 | 0 |
B | 5 | 1497 | 15 | 0 | 1 |
C | 1 | 6 | 1345 | 22 | 0 |
D | 0 | 1 | 8 | 1263 | 1 |
E | 1 | 0 | 0 | 1 | 1440 |
The confusion matrix above shows which predictions on the testing subset were correct and which were not. Our predictions are on the columns and the rows are the actual values. The non-diagonal elements are the errors.
## Random Forest ## ## 11776 samples ## 52 predictor ## 5 classes: 'A', 'B', 'C', 'D', 'E' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## ## Summary of sample sizes: 10597, 10599, 10598, 10599, 10598, 10597, ... ## ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa Accuracy SD Kappa SD ## 2 0.9901155 0.9874947 0.002940465 0.003721159 ## 27 0.9913889 0.9891065 0.002680881 0.003392092 ## 52 0.9840687 0.9798463 0.003778697 0.004779172 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 27.
The summary of our final model shows the varying accuracy metric depends on the predictors.
The plot shows the relationship between the number of randomly selected predictors and the accuracy. As we can see, the accuracy is highest when mtry, the number of variables available for splitting at each tree node is 27 (stated in the summary above also).
We then use our final model to predict the outcome classe on our separate test dataset. The results are satisfactory as all 20 observations are predicted correctly.
Conclusions
The HAR weight lifting exercises dataset has accelerometer readings that allow us to predict one of the five different ways of lifting barbells. Our data analysis yielded the random forest algorithm with 10-fold cross-validation repeated 5 times as our best model. The results were very positive with a 0% In Sample Error and 1.0% Out of Sample Error. Our final model was also able to correctly predict all outcomes of the separate test dataset.