Predictive Text Application - Milestones

Facebook

Twitter

Tumblr

Synopsis

We are developing an application that can predict a word based on previous ones. Text mining and natural language processing (NLP) are used. This is similar to the software available on mobile platforms such as SwiftKey. The end product will be a web application that takes an incomplete phrase from the user and predicts the next word. In order to build the application, we require an appropriate data collection. Here we use the English language sets from HC Corpora. This milestone report details our initial exploratory analysis of the data and our future goals in a concise and understandable manner.

Raw Data Summary

The HC Corpora English dataset includes three line-separated text files: Blogs, News and Twitter. Each file contains data from their respective sources from all over the Internet. Let’s have a look at the raw data statistics:

Table 1: Raw dataset summary

Dataset	Size (bytes)	Line Count	Word Count	Average Words/Line
Blogs	210160014	899288	38154238	42.4
News	205811889	1010242	35010782	34.7
Twitter	167105338	2360148	30218125	12.8

We can also visually see how the word count of each line varies in the datasets below.

Figure 1: Distribution of words per line of each individual dataset
Fig 1: Distribution of Words Per Line of Each Individual Dataset

Exploratory Data Analysis

Sampling

Due to the very large size of the datasets and limited hardware resources, we take a random 10% sample of each dataset (Blogs, News, Twitter). The sample datasets are then combined into one single corpus.

Cleaning

The corpus has profanity words that were removed using the pattern-for-python list. We also removed punctuations, numbers, whitespace, foreign characters and converted everything to lowercase. These tasks allowed us to have a clean tokenized corpus needed for our next step, n-grams.

N-Grams

N-gram is a contiguous sequence of n items from a given sequence of text or speech as explained on Wikipedia. For our application, we use unigrams, bigrams and trigrams (1, 2 and 3-grams). Our corpus is further split into three n-gram data structures where frequency of the n-grams are sorted. The n-grams are important for our modeling since the phrase the user inputs in our final application will be segmented and compared to our n-gram data structures to help predict the next word. N-gram frequency tables allow us to see the distribution of words and word pairs. The following are the most frequent n-grams in our sample corpus.

Figure 2: Top 15 n-grams by their frequency
plot of chunk unnamed-chunk-5

While the total count of 1-gram (single words) is 20392546 in the sample corpus, most of these words are not unique. In fact, we can make a table to show how many unique words are needed to cover a certain percentage of all word instances in the sample corpus. The table below shows this information and how the ratios vary greatly between the percentages. We can use this information to make our n-gram data structures smaller and more efficient to be used in our final application while still maintaining reasonable accuracy.

Table 2: Unique words needed to cover all word instances in sample corpus

Percentage of Corpus Word Instances	Unique Word Count	Total Corpus Word Instances	Ratio
50	136	10196273	0.0000133
60	372	12235528	0.0000304
70	967	14274782	0.0000677
80	2495	16314037	0.0001529
90	7945	18353291	0.0004329
100	215074	20392546	0.0105467

Further Goals

With these completed n-gram data structures, we still need to build our prediction model using an appropriate algorithm. The final Shiny web application must be implemented which will take an incomplete phrase from the user and predict the next word. A presentation slide deck will also be completed.

Along the way, optimization must be completed and explored since the Shiny server has limited computing resources. The size of the n-gram data structures will need to be reduced and the prediction model should be efficient in speed. Stemming the raw data and different sample sizes will also be considered for coverage, speed and accuracy.

Saif

Data Acumen

Notes from the life of a data scientist and electrical engineer