Synopsis
We are developing an application that can predict a word based on previous ones. Text mining and natural language processing (NLP) are used. This is similar to the software available on mobile platforms such as SwiftKey. The end product will be a web application that takes an incomplete phrase from the user and predicts the next word. In order to build the application, we require an appropriate data collection. Here we use the English language sets from HC Corpora. This milestone report details our initial exploratory analysis of the data and our future goals in a concise and understandable manner.
Raw Data Summary
The HC Corpora English dataset includes three line-separated text files: Blogs, News and Twitter. Each file contains data from their respective sources from all over the Internet. Let’s have a look at the raw data statistics:
Table 1: Raw dataset summary
Dataset | Size (bytes) | Line Count | Word Count | Average Words/Line |
---|---|---|---|---|
Blogs | 210160014 | 899288 | 38154238 | 42.4 |
News | 205811889 | 1010242 | 35010782 | 34.7 |
167105338 | 2360148 | 30218125 | 12.8 |
We can also visually see how the word count of each line varies in the datasets below.
Figure 1: Distribution of words per line of each individual dataset
Exploratory Data Analysis
Sampling
Due to the very large size of the datasets and limited hardware resources, we take a random 10% sample of each dataset (Blogs, News, Twitter). The sample datasets are then combined into one single corpus.
Cleaning
The corpus has profanity words that were removed using the pattern-for-python list. We also removed punctuations, numbers, whitespace, foreign characters and converted everything to lowercase. These tasks allowed us to have a clean tokenized corpus needed for our next step, n-grams.
N-Grams
N-gram is a contiguous sequence of n items from a given sequence of text or speech as explained on Wikipedia. For our application, we use unigrams, bigrams and trigrams (1, 2 and 3-grams). Our corpus is further split into three n-gram data structures where frequency of the n-grams are sorted. The n-grams are important for our modeling since the phrase the user inputs in our final application will be segmented and compared to our n-gram data structures to help predict the next word. N-gram frequency tables allow us to see the distribution of words and word pairs. The following are the most frequent n-grams in our sample corpus.
Figure 2: Top 15 n-grams by their frequency
While the total count of 1-gram (single words) is 20392546 in the sample corpus, most of these words are not unique. In fact, we can make a table to show how many unique words are needed to cover a certain percentage of all word instances in the sample corpus. The table below shows this information and how the ratios vary greatly between the percentages. We can use this information to make our n-gram data structures smaller and more efficient to be used in our final application while still maintaining reasonable accuracy.
Table 2: Unique words needed to cover all word instances in sample corpus
Percentage of Corpus Word Instances | Unique Word Count | Total Corpus Word Instances | Ratio |
---|---|---|---|
50 | 136 | 10196273 | 0.0000133 |
60 | 372 | 12235528 | 0.0000304 |
70 | 967 | 14274782 | 0.0000677 |
80 | 2495 | 16314037 | 0.0001529 |
90 | 7945 | 18353291 | 0.0004329 |
100 | 215074 | 20392546 | 0.0105467 |
Further Goals
With these completed n-gram data structures, we still need to build our prediction model using an appropriate algorithm. The final Shiny web application must be implemented which will take an incomplete phrase from the user and predict the next word. A presentation slide deck will also be completed.
Along the way, optimization must be completed and explored since the Shiny server has limited computing resources. The size of the n-gram data structures will need to be reduced and the prediction model should be efficient in speed. Stemming the raw data and different sample sizes will also be considered for coverage, speed and accuracy.