We are developing an application that can predict a word based on previous ones. Text mining and natural language processing (NLP) are used. This is similar to the software available on mobile platforms such as SwiftKey. The end product will be a web application that takes an incomplete phrase from the user and predicts the next word. In order to build the application, we require an appropriate data collection. Here we use the English language sets from HC Corpora. This milestone report details our initial exploratory analysis of the data and our future goals in a concise and understandable manner.

Raw Data Summary

The HC Corpora English dataset includes three line-separated text files: Blogs, News and Twitter. Each file contains data from their respective sources from all over the Internet. Let’s have a look at the raw data statistics:

Table 1: Raw dataset summary

Dataset Size (bytes) Line Count Word Count Average Words/Line
Blogs 210160014 899288 38154238 42.4
News 205811889 1010242 35010782 34.7
Twitter 167105338 2360148 30218125 12.8

We can also visually see how the word count of each line varies in the datasets below.

Figure 1: Distribution of words per line of each individual dataset
Fig 1: Distribution of Words Per Line of Each Individual Dataset

Exploratory Data Analysis


Due to the very large size of the datasets and limited hardware resources, we take a random 10% sample of each dataset (Blogs, News, Twitter). The sample datasets are then combined into one single corpus.


The corpus has profanity words that were removed using the pattern-for-python list. We also removed punctuations, numbers, whitespace, foreign characters and converted everything to lowercase. These tasks allowed us to have a clean tokenized corpus needed for our next step, n-grams.


N-gram is a contiguous sequence of n items from a given sequence of text or speech as explained on Wikipedia. For our application, we use unigrams, bigrams and trigrams (1, 2 and 3-grams). Our corpus is further split into three n-gram data structures where frequency of the n-grams are sorted. The n-grams are important for our modeling since the phrase the user inputs in our final application will be segmented and compared to our n-gram data structures to help predict the next word. N-gram frequency tables allow us to see the distribution of words and word pairs. The following are the most frequent n-grams in our sample corpus.

Figure 2: Top 15 n-grams by their frequency
plot of chunk unnamed-chunk-5

While the total count of 1-gram (single words) is 20392546 in the sample corpus, most of these words are not unique. In fact, we can make a table to show how many unique words are needed to cover a certain percentage of all word instances in the sample corpus. The table below shows this information and how the ratios vary greatly between the percentages. We can use this information to make our n-gram data structures smaller and more efficient to be used in our final application while still maintaining reasonable accuracy.

Table 2: Unique words needed to cover all word instances in sample corpus

Percentage of Corpus Word Instances Unique Word Count Total Corpus Word Instances Ratio
50 136 10196273 0.0000133
60 372 12235528 0.0000304
70 967 14274782 0.0000677
80 2495 16314037 0.0001529
90 7945 18353291 0.0004329
100 215074 20392546 0.0105467

Further Goals

With these completed n-gram data structures, we still need to build our prediction model using an appropriate algorithm. The final Shiny web application must be implemented which will take an incomplete phrase from the user and predict the next word. A presentation slide deck will also be completed.

Along the way, optimization must be completed and explored since the Shiny server has limited computing resources. The size of the n-gram data structures will need to be reduced and the prediction model should be efficient in speed. Stemming the raw data and different sample sizes will also be considered for coverage, speed and accuracy.


Leave a Reply

Your email address will not be published. Required fields are marked *