Top Skills for a Data Scientist – 2019


We look at the top key skills required for a data scientist updated for 2019. We do so by gathering 2000 job posts and using text mining to retrieve information. We also use algorithms to find out how each skill is related to another. Finally, we look at implications of identifying the in-demand skills that will effect the workforce and economy. Read More “Top Skills for a Data Scientist – 2019”

Open Data Day: Code and the City

I was a participant in a hackathon called Code and the City. The event was held in celebration of Open Data Day. Along with industry sponsors like Soti, Amazon, Microsoft and Cisco, the event sponsors included the City of Mississauga and Sheridan College.


The idea was to answer a problem set that would benefit the City of Mississauga with a population of almost 800,000 using open data:

How can Mississauga gain greater awareness and engagement with the community in a digital environment?

Read More “Open Data Day: Code and the City”

Wearable Fitness Tracker Predictive Modeling


This report was created for a Canadian startup that builds wearable fitness trackers used in gyms and an accompanying mobile application. My solution yielded the best actual results among all report submissions from select individuals with highly qualified backgrounds. Some of the code has intentionally been removed.

Read More “Wearable Fitness Tracker Predictive Modeling”

Human Activity Recognition and Machine Learning


Human Activity Recognition is emerging as a new field where wearable devices are commonly used to quantify the amount of time an activity is performed. In our analysis, we instead look at how well weight lifting exercises were performed in a study. Each individual in the experiment had various accelerometer data collected from devices on different parts of the body while performing barbell exercises in five different ways. We developed machine learning algorithms that predict the way they were performed based on accelerometer data. Our final model that gave us a 100% In Sample accuracy and a 99.0% Out of Sample accuracy was the random forest algorithm with a 10-fold cross-validation repeated 5 times.

Read More “Human Activity Recognition and Machine Learning”

Parallelize Machine Learning in R with Multi-Core CPUs

R supports parallel computations with the core parallel package. What the doParallel package does is provide a backend while utilizing the core parallel package. The caret package is used for developing and testing machine learning models in R. This package as well as others like plyr support multicore CPU speedups if a parallel backend is registered before the supported instructions are called.

The train instruction of the caret package has built-in support for parallel backends, but you have to call and set it up. If you don’t register a backend, train will resort to single-core computations. With a registered parallel backend, any caret model training will use multi-cores of the CPU, since by default the trainControl argument is already set as allowParallel=TRUE.
Read More “Parallelize Machine Learning in R with Multi-Core CPUs”