This report was created for a Canadian startup that builds wearable fitness trackers used in gyms and an accompanying mobile application. My solution yielded the best actual results among all report submissions from select individuals with highly qualified backgrounds. Some of the code has intentionally been removed.
We look at nine datasets containing raw acceleration and rotational velocity data measured by the 6-axis IMU inside of a wearable device during a set of bent-over rows performed using a barbell. We developed machine learning algorithms that predict the time interval when the exercise was performed in each dataset. Our final model that gave us a 100% In Sample accuracy and a 95.0% Out of Sample accuracy was the random forest algorithm with a 10-fold cross-validation repeated 5 times. We use this predictive model to generate plots for each of the 8 datasets indicating when the exercise was likely performed. We did not have a separate test dataset to gain further insight into our accuracy at the time of this report.
Loading the Data
We load the doParallel package for registering a parallel backend for our training instructions to utilize multi-Core CPUs. We then load the 9 datasets into data frame objects. Each dataset has 800 observations spanning 34 seconds. The 6-axis sensor readings are provided for each time capture.
Partitioning the Data
In order for us to build predictive models, we look at example.csv as the training set. This training set is very small and make the prediction task difficult. We are given information of the exercise being performed between 13.2 and 20.5 seconds. We use this information to create an outcome column. A value of 1 for the exercise outcome indicates the exercise was performed for a time capture and 0 indicates no exercise.
Finally, the training set is partitioned into training and testing subsets from which our predictive models will be based.
# Use example.csv as training set # Add outcome variable 'exercise" based on condition trainset <- datafiles[] trainset$exercise <- ifelse(trainset>=13.2 & trainset<=20.5, 1, 0) trainset$exercise <- as.factor(trainset$exercise) trainset <- NULL # Partition training set into training and testing subsets set.seed(12170) inTrain = createDataPartition(y=trainset$exercise, p = 0.6,list=FALSE) training = trainset[inTrain,] testing = trainset[-inTrain,]
Fitting Predictive Models
We implement various machine learning algorithms: Linear discriminant analysis, naive Bayes, CART, random forest, partial least squares and stochastic gradient boosting. In this first step, we use default tuning parameters such as bootstrapping resamples. The models are stored in a list. We also calculate the confusion matrix to find the In Sample Error which will be used to determine a final model. Parallel processing is used to speed up computation time.
We generate a table that summarizes our predictive models.
|Algorithm||Resampling||In Sample Error (1-Accuracy)|
|Linear Discriminant Analysis||boot (25 iterations)||14.3%|
|Naive Bayes||boot (25 iterations)||7.7%|
|CART||boot (25 iterations)||10.0%|
|Random Forest||boot (25 iterations)||0.0%|
|Partial Least Squares||boot (25 iterations)||21.6%|
|Stochastic Gradient Boosting||boot (25 iterations)||2.3%|
This table automatically populates specific attributes fetched from each model. We can see that the Random Forest model has a 0.0% In Sample Error (100% accuracy). This model is sufficient for us because of its perfect prediction of the outcome variable exercise from the training subset. We will tune this model to build our final model.
Final Model Results
Our final model will use tuning that includes 10-fold cross-validation repeated 5 times to obtain our Out of Sample Error on the testing subset.
We show the results of our final model.
|Algorithm||Resampling||Out of Sample Error (1-Accuracy)|
|Random Forest||repeatedcv (10 folds, 5 repeats)||5.0%|
This table shows that our final model (with cross-validation) has a 5.0% Out of Sample Error (95.0% accuracy) with the testing subset in predicting the exercise outcome. The result is very positive.
The confusion matrix above shows which predictions on the testing subset were correct and which were not. Our predictions are on the columns and the rows are the actual values. The non-diagonal elements are the errors.
## Random Forest ## ## 481 samples ## 6 predictor ## 2 classes: '0', '1' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 432, 433, 433, 433, 433, 432, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa Accuracy SD Kappa SD ## 2 0.9472225 0.8386610 0.02456231 0.07552316 ## 4 0.9393306 0.8186186 0.02777259 0.08375304 ## 6 0.9334784 0.8024068 0.03293069 0.09561860 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 2.
The summary of our final model shows the varying accuracy metric depends on the predictors.
The plot shows the relationship between the number of randomly selected predictors and the accuracy. As we can see, the accuracy is highest when mtry, the number of variables available for splitting at each tree node is 2 (stated in the summary above also).
We then use our final model to predict the outcome exercise on each of our 8 datasets. Please see the plots below.
For each plot, we can see when the exercise was likely performed. As can be seen, some datasets such as 1, 5 and 7 are visually more accurate in predictions.
Our data analysis yielded the random forest algorithm with 10-fold cross-validation repeated 5 times as our best model. The results were very positive with a 0% In Sample Error and 5.0% Out of Sample Error. Our final model was also able to predict the exercise time interval for each of the 8 datasets while making use of the smallest amount of training data. Exercise predictions for datasets 1, 5 and 7 were more isolated in the plots indicating a better accuracy. An additional testing set would provide more accuracy statistics.