Synopsis

We look at the top key skills required for a data scientist updated for 2019. We do so by gathering 2000 job posts and using text mining to retrieve information. We also use algorithms to find out how each skill is related to another. Finally, we look at implications of identifying the in-demand skills that will effect the workforce and economy.

Text Mining

We use our gathered data as a corpus that is cleaned and processed. We convert the corpus to lower case only. We then remove punctuation, stop words and white space. After cleansing, we create a term-document matrix which stores the frequency of terms found in the corpus. With this matrix, we can visualize our data to better understand the key skills most commonly found in the job postings. The tm package is used for text mining.

Word Cloud

We generate a word cloud with the top 100 skills ranked by the size of the text. As can be seen, machine learning skills are overwhelming.

Skills Ranking

We generate a plot containing the top 100 skills required for a data scientist as found from the job postings.

Correlation

We also create a correlation plot that shows how each individual skill is connected to another. Only phrases that appear at least 150 times are shown.

Clustering

We can also use algorithms to find the key skills for a data scientist. One such algorithm is K-means clustering which is a type of unsupervised learning. The idea behind this is to create groups (clusters) of similar data. The algorithm itself determines which data is similar and identifies them in a group. The number of groups is arbitrary and different techniques can be used to find the optimum number.

Hierarchical clustering is another algorithm that can be implemented. It works similar to K-means except the number of groups is not defined and a linkage method must be selected. In our example, we use Ward’s method which is the closest to K-means clustering. Below we have created a tree-like structure called a dendrogram. As can be seen, the grouping of the most frequent terms allows us to extract key skills based on the groups. Various parameters can be changed to alter the dendrogram.

Implications

Determining in-demand skills across various occupations has multiple benefits for the workforce and economy. For job seekers, it allows them to update their skills or transition into a new field to find employment. For students, it allows them to focus on courses and programs that will get them gainful employment in the future.

Identifying in-demand skills also spurs research and development. For example, knowing data science skills are in demand creates growth in artificial intelligence research as there is a growing workforce and industry/academic research. This in turn benefits the economy with new businesses, employment and products.

Knowing the in-demand skills allows employers and educators to work together. Employers can tell educators of the skills they need and educators can employ modules to teach such skills. The end result would be employment of the students. However, for this to work, employers must do more to communicate the skills. Educators in turn must be willing to teach those skills rather than just focus as being traditional educators but instead agents of employment. A close collaboration is needed in the future if we intend to strengthen our economy and workforce.

Saif

Leave a Reply

Your email address will not be published. Required fields are marked *