SUMMARY

In this report I briefly illustrate the exploratory analysis performed on a three datasets, comprising text from blogs, news and tweets.

The ultimate goal is to produce a light application able to predict text (words) given some preceding text, mimicking the predictive typing feature of modern software keyboard of portable devices.

As a playground a fairly substantial dataset was made available, comprising text from various heterogenous sources (blogs, news, twitter). These datasets are the foundation for developing an understanding of language processing and in turn devise a strategy for achieving the goal, and perhaps more importantly (in practice) they constitute our training and testing datasets.

I decided to invest a significant amount of time to explore the data, and delved (too) deeply into data cleaning, assuming that this effort will pay off by making any algorithm more robust.

At this stage in the project I will mostly review my exploratory analysis of the data, and outline my current thought about the strategy for developing the algorithm for the text-predicting application.

Performance issues: it is worth mentioning that one of the main challenges has been dealing smartly with the computational load, that turned out to be a serious limiting factor, even on a powerful workstation.
I did not use the suggested tm suite and relied instead heavily on perl and in R mainly dplyr, NLP and RWeka.

Current Thoughts About Predictive Algorithm Strategy

My current thoughts, very much in flux, about the strategy are that a n-grams based approach would be the most effective.
In particular, I am leaning towards a weighted combination of 2- 3- 4- 5-grams (linear interpolation), perhaps assisted by some additional information drawn from an analysis of the association of words in sentences or their distance within it.

An important issue that I have not yet had a chance to ponder sufficiently include the handling of “zeros”, i.e. words not included in the dictionary of the training set or, more importantly with a n-grams approach words that are not seen following a given (n-1) gram. In practice, based on my readings, this problem is tackled with some form of smoothing, that is assigning a probability to the “zeros” (and in turn re-allocating some mass probability away from the observed n-grams).
I have not yet had a chance to explore the feasibility and effectiveness of methods like Good-Turing or Stupid Backoff.

CONTENT

The report is organized in the following sections:

EXECUTIVE SUMMARY
PRELIMINARIES
- Preprocessing (before loading into R)
MOVING TO R
- Loading the Data
- Further Data Cleaning in R
  - Text transformations
  - Subsetting of the data
ANALYSIS
- Step 1 : Sentence Annotation
- Step 2 : n-grams Tokenization
APPENDIX

PRELIMINARIES

Back to the Top

Libraries needed for data processing and plotting:

#-----------------------------
# NLP
library("tm")
library("SnowballC")
library("openNLP")
library("NLP")

# To help java fail less :-(
options( java.parameters = "-Xmx6g")
library("RWeka")   # [NGramTokenizer], [Weka_control]

#-----------------------------
# general
library("dplyr")
library("magrittr")
library("devtools")

library("ggplot2")
library("gridExtra")
# library("RColorBrewer")

library("pander")

#-----------------------------
# my functions
source("./scripts/my_functions.R")
#-----------------------------

THE DATA

Back to TOC

The datasets are read-in separately into character vectors, using a user-defined compact function (readByLine()) (see Appendix for the short source).

# NOT EVALUATED because too computationally heavy (loading saved products)
in.blogs.ORIG <- readByLine("./data/en_US.blogs.ORIGINAL.txt.gz", check_nl = FALSE, skipNul = TRUE)
in.news.ORIG <- readByLine("./data/en_US.news.ORIGINAL.txt.gz", check_nl = FALSE, skipNul = TRUE)
in.twitter.ORIG <- readByLine("./data/en_US.twitter.ORIGINAL.txt.gz", check_nl = FALSE, skipNul = TRUE)

Basic statistics of the three datasets in their original form:

# NOT EVALUATED
stats.blogs   <- as.numeric(system("gzip -dc ./data/en_US.blogs.ORIGINAL.txt.gz | wc | awk '{print $1; print $2; print $3}'", intern = TRUE))
stats.news    <- as.numeric(system("gzip -dc ./data/en_US.news.ORIGINAL.txt.gz | wc | awk '{print $1; print $2; print $3}'", intern = TRUE))
stats.twitter <- as.numeric(system("gzip -dc ./data/en_US.twitter.ORIGINAL.txt.gz | wc | awk '{print $1; print $2; print $3}'", intern = TRUE))

stats.ORIG.df <- data.frame( blogs = stats.blogs, news = stats.news, twitter = stats.twitter, 
                             row.names = c("lines", "words", "characters"), stringsAsFactors = FALSE)
saveRDS(stats.ORIG.df, "data/stats_ORIGINAL.RDS")

	blogs	news	twitter
lines	899288	1010242	2360148
words	37334114	34365936	30359804
characters	210160014	205811889	167105338

PREPROCESSING (before loading into R)

Back to TOC

After a quick review of the data with various R functions and packages, I decided to perform some cleaning of the text with standard Linux command line tools, mostly perl scripts. Broadly speaking I performed three categories of transformations:

Removal of “weird” characters.
Homogeneization of characters.
Regularization of text, e.g. dealing with hashtags, profanities, emoticons, acronyms, numbers, “messy punctuation”

which I will describe below.

Weird Characters

The first task was to analyze the mix of invidual characters present in the three datasets with the goal of doing some homogeneization and tidying up of non-alphanumeric characters, such as quotes that can come in different forms.

The used method is not elegant, but effective enough, relying on a simple perl command substituting a series of non-odd characters with spaces, thus leaving a stream of odd characters subsequently parsed and cleaned to produce a list of odd characters sorted by their count.

perl -pe 's|[\d\w\$\,\.\!\?\(\);:\/\\\-=&%#_\~<>]||g; s|\s||g; s|[\^@"\+\*\[\]]||g;' | \
          perl -pe "s/\'//g;" | \
          egrep -v '^$' | \
          split_to_singles.awk | \
          sort -k 1 | uniq -c | sort -k 1nr

# split_to_singles.awk is a short awk script not worth including here (it's on GitHub)

The number of unique odd characters found in each dataset are 2159 for blogs, 310 for news, 2087 for twitter.

The following is the census of odd characters appearing more than 500 times in each of the datasets (the full sorted lists are available on the GitHub repo in the data directory).

   blogs           news              twitter
-----------      ----------         ------------------------
 387317 [’]      102911 [’]         27440 [“]        726 [»]
 109154 [”]       48115 [—]         26895 [”]        718 [«]
 108769 [“]       47090 [“]         11419 [’]        715 [😔]
  50176 [–]       43992 [”]          5746 [♥]        686 [😉]
  41129 […]        8650 [–]          5241 […]        680 [😳]
  23836 [‘]        6991 [ø]          3838 [|]        639 [{]
  18757 [—]        6723 []          2353 [❤]        617 [•]
   3963 [é]        6544 []          2314 [–]        593 [‘]
   2668 [£]        6267 []          1799 [—]        578 [�]
   1301 [′]        4898 [‘]          1333 [😊]        561 [💜]
    914 [´]        3641 []          1211 [👍]        560 [😃]
    755 [″]        3319 [é]          1149 [😂]        544 [😏]
    643 [€]        3062 […]           977 [é]        506 [☀]
    624 [ā]        2056 []           963 [😁]        503 [😜]
    605 [½]        1408 []           955 [☺]
    598 [á]        1152 [�]           926 [😒]
    582 [ö]         971 [•]           802 [`]
    555 [è]         837 [½]           758 [😍]
    518 [°]         711 [`]           751 [😘]
                    537 [ñ]           741 [}]

Homogeneization of Characters

Back to TOC

For this preliminary stage I decided to not worry about accented letters, and characters from non-latin alphabet (e.g. asian, emoticons), but I thought it would be helpful to standardize a small set of very frequent characters, whose “meaning” is substantially equivalent

		blogs	news	twitter	TOTAL
quotes	[‘]	23836	4898	593	29327
	[’]	387317	102911	11419	501647
	[“]	108769	47090	27440	183299
	[”]	109154	43992	26895	180041
	[«]	0	0	718	718
	[»]	0	0	726	726

dashes	[–]	50176	8650	2314	61140
	[—]	48115	18757	1799	68671

ellipsis	[…]	41129	5241	3062	49432

Contractions, Profanities, Emoticons, Hashtags, etc…

Back to TOC

I have put a major effort into understanding the idiosyncrasies of the textual data, with the expectation that a deep cleaning would make a difference in the prediction context.

One example of what I have in mind is that transforming to categorical generic “tag” frequent “items” with a lot of variations but broadly similar meaning (e.g. dates, money, possessive pronouns), could strengthen the predictive ability of any algorithm.

Most of the work was done with perl “offline” (can’t beat it for regex work).
To match the application input with the data on which the application is built, all operations were ported to R either directly or by relying on an external perl script. Among the main transformations applied to the text:

Contractions (e.g. don’t, isn’t, I’ll): this seem to be more commonly regarded as stopword, hence removed. My take has been that they can provide meaning and it was worth preserving them, as well as they non-contracted counterparts. I homogeneized all of them in forms like “I_will”, “do_not”, with an underscore gluing them together.
Profanity filtering: I based my cleaning on the “7 dirt words”, and some words rooted on them.
- To preserve their potential predictive value, I replace them with a tag <PROFANITY>.
- User input is also filtered, but the information carried by a possible profanity can be used.
Emoticons: Recognized them with regex. Marked with a tag, <EMOJ>.

Other transformations done on the text before loading the data into R:

Regularization/ Homogeneization of Characters
- Mostly cleaning (not necessarily removing) odd characters e.g. apostrophes, quotes, etc.
- Sequences of characters: inline and End-Of-Line ellipsis, and other “non-sense”.
- Substitution on “|” that seem to be equivalent to end of sentences (i.e. a period).
- Substitution of <==/<-- and ==>/--> with ;.
- Cleaning sequences of ! and ?.
Hashtags: Recognized and replaced with a generic tag HASHTAG
Acronyms: limited to variations of U.S., also replaced with a tag, <USA>.
Number-related:
- (likely) dollar amounts by the presence of $: marked with <MONEY> tag.
- dates (e.g. 12/34/5678): marked with <DATE> tag.
- hours (e.g. 1:30 p.m.): marked with <HOUR> tag.
- percentages: marked with <PERCENTAGE> tag.
Repeated Consecutive Characters: handled by type.
- $ signs, assumed to stand for a money: replaced with tag <MONEY>.
- *, within words usually are disguised profanities: replaced with <PROFANITY> tag.
- -: context/surroundings dependent replacement with regular punctuation.
- Some character sequences were entirely deleted: multiple <, >, =, #.

The dataset where cleaned with perl scripts, available on GitHub (look for regularize_text-pass[1-5].pl and remove_tags_content.pl scripts).
A summary of what they do is listed in the Appendix

Excluding rows with less than 6 words

Back to TOC

During my initial attempts it immediately emerged the problem of excessively short rows of text. In particular, because I decided to perform tokenization on individual sentences, not directly on individual rows, the tokenizer tripped and failed on empty “sentences” resulting from short rows.

I have then decided to set a cutoff to the minimum acceptable length of rows. After some empirical testing and row-length analysis with command line tools I have set a threshold at $\ge6$ words.

MOVING TO R

Loading the data

Back to TOC

The rest of the analysis presented here is based on the cleaned datasets resulting from the processing described in the previous sections.

# NOT EVALUATED because too computationally heavy (loading saved products)
in.blogs.REG   <- readByLine("./data/blogs_REG.txt.gz", check_nl = FALSE, skipNul = TRUE)
in.news.REG    <- readByLine("./data/news_REG.txt.gz", check_nl = FALSE, skipNul = TRUE)
in.twitter.REG <- readByLine("./data/twitter_REG.txt.gz", check_nl = FALSE, skipNul = TRUE)

Basic statistics of the three datasets in their original form:

# NOT EVALUATED
stats.blogs   <- as.numeric(system("gzip -dc ./data/blogs_REG.txt.gz | wc | awk '{print $1; print $2; print $3}'", intern = TRUE))
stats.news    <- as.numeric(system("gzip -dc ./data/news_REG.txt.gz | wc | awk '{print $1; print $2; print $3}'", intern = TRUE))
stats.twitter <- as.numeric(system("gzip -dc ./data/twitter_REG.txt.gz | wc | awk '{print $1; print $2; print $3}'", intern = TRUE))

stats.REG.df <- data.frame( blogs = stats.blogs, news = stats.news, twitter = stats.twitter, 
                             row.names = c("lines", "words", "characters"), stringsAsFactors = FALSE)
saveRDS(stats.REG.df, "data/stats_REG.RDS")

After preprocessing we have the following stats:

	blogs	news	twitter
lines	793096	957228	2072644
words	36606082	33908035	29072455
characters	207683984	206471971	162152503

Further Data Cleaning in R

Back to TOC

There are some common, customary, operations performed on a text dataset before proceeding to analyze it.

Make text lowercase.
Strip extra white spaces.
Remove numbers.
Remove punctuation.
Remove stopwords.

Given that the goal is to predict words in a typing context I think that removing stopwords does not make much sense.
Working with a text without stopwords may be useful if one wanted to use in the prediction algorithm some information about words’ association in sentences, which may help improve meaningful discrimination between different next word possibilities “proposed” by an algorithm based on n-grams.

Because of the context, I also do not think that removing punctuation would be wise, nor make sense.

Text transformations

Back to TOC

The next step is applying the additional following three transformations:

conversion to lower case.
removal of numbers.
removal of redundant white spaces.

Done as follows (with big obligatory acknowledgement and thank you to Hadley Wickham and Stefan Bache for bringing us the pipe %>%!).

# NOT EVALUATED because too computationally heavy (loading saved products)
in.blogs.REG <- tolower(in.blogs.REG) %>% removeNumbers() %>% stripWhitespace()
in.news.REG <- tolower(in.news.REG) %>% removeNumbers() %>% stripWhitespace()
in.twitter.REG <- tolower(in.twitter.REG) %>% removeNumbers() %>% stripWhitespace()

# re-uppercases TAGS
in.blogs.REG <- gsub('<(emoticon|hashtag|dollaramount|period|hour|profanity|usa|percentage|date|money|ass|space|decade|ordinal|number|telephonenumber|timeinterval)>', 
                  '<\\U\\1>', in.blogs.REG, ignore.case = TRUE, perl = TRUE)
in.news.REG <- gsub('<(emoticon|hashtag|dollaramount|period|hour|profanity|usa|percentage|date|money|ass|space|decade|ordinal|number|telephonenumber|timeinterval)>', 
                  '<\\U\\1>', in.news.REG, ignore.case = TRUE, perl = TRUE)
in.twitter.REG <- gsub('<(emoticon|hashtag|dollaramount|period|hour|profanity|usa|percentage|date|money|ass|space|decade|ordinal|number|telephonenumber|timeinterval)>', 
                  '<\\U\\1>', in.twitter.REG, ignore.case = TRUE, perl = TRUE)

ANALYSIS

Step 1 : Sentence Annotation (R)

Back to TOC

As noted, after some tests, I settled on an approach whereby n-grams tokenization is performed on separate individual sentences, instead of directly on individual rows as loaded from the dataset.

This is motivated by the fact that the tokenizer I have adopted because I found its performance to be more satisfactory, the NGramTokenizer() of the RWeka package, does not seem to interrupt its construction of n-grams at what are very likely sentence boundaries.

With next word prediction in mind, it makes a lot of sense to restrict n-grams to sequences of words within the boundaries of a sentence.

Therefore, after cleaning, transforming and filtering the data, the first real operation I perform is the annotation of sentences, for which I have been using the openNLP sentence annotator Maxent_Sent_Token_Annotator(), with its default settings, and the function annotate() from the NLP package.

sent_token_annotator <- Maxent_Sent_Token_Annotator()
sent_token_annotator
# An annotator inheriting from classes
#   Simple_Sent_Token_Annotator Annotator
# with description
#   Computes sentence annotations using the Apache OpenNLP Maxent sentence detector
#   employing the default model for language 'en'.

I want the data in the form of a vector with individual sentences, and so I opted for sapply() combined with a function wrapping the operations necessary to prepare a row of data for annotation, the annotation itself and finally return a vector of sentences.

find_sentences <- function(x) {
    s <- paste(x, collapse = " ") %>% as.String()
    a <- NLP::annotate(s , sent_token_annotator) 
    as.vector(s[a])
}

To work around not fully performance issues, the tokenization was done on subsets (chunk) of the data, comprising $100000$ lines. Done this way was much faster than forcing the tokenization over the The following code takes each dataset, and splits it into sentences chunk by chunk, writing out the sentences for each chunk to separate files, which were then concatenated outside of R.

# NOT EVALUATED because too computationally heavy (loading saved products)
chunk_size <- 100000

for( what in c("blogs", "news", "twitter")) {
    
    data.what <- get(paste0("in.", what, ".REG"))
    len.what   <- length(data.what)
    cat(" - length ", len.what, "\n")
    
    n_chunks <- floor(len.what/chunk_size) + 1
    n1 <- ((1:n_chunks)-1)*chunk_size + 1
    n2 <- (1:n_chunks)*chunk_size
    n2[n_chunks] <- len.what
    Ns_by_chunk <- rep(0, n_chunks)
    print(n1)
    print(n2)
    
    names <- paste(what, "sentences", sprintf("%02d", (1:n_chunks)), sep = ".")
    fnames <- paste(names, ".gz", sep = "")
    print(names)
    print(fnames)
    
    # loop over chunks of size 'chunk_size'
    for(i in 1:n_chunks) {
        name1 <- names[i]
        fname1 <- fnames[i]
        idx <- n1[i]:n2[i]
        
        cat("   ", name1, length(idx), idx[1], idx[length(idx)], "\n")

        # find sentences
        assign( name1, sapply(data.what[idx], FUN = find_sentences, USE.NAMES = FALSE) %>% unlist )
        con <- gzfile(fname1, open = "w")

        # write sentences for this chunk to file
        writeLines(get(name1), con = con)
        close(con)
    }
    rm(i, n1, n2, n_chunks)
}

The stats table, with the added number of sentences is now as follows:

	blogs	news	twitter
lines	793096	957228	2072644
words	36606082	33908035	29072455
characters	207683984	206471971	162152503
sentences	2136323	1819982	3257495
sentences_per_line	2.694	1.901	1.572

The list of sentences by dataset were then merged into a single master list.

Some further cleaning of text (perl)

After the tokenization into sentences, some more fixes were applied with perl scripts (sources posted on GitHub

gzip -dc all.sentences.ALL.gz | \
    ./sentences_cleaning-v1.pl | \
    awk '{if(NF > 1){print $0}}' | \
    ./sentences_cleaning-v2.pl | \
    ./sentences_cleaning-v2.pl | \
    ./sentences_cleaning-v2.pl | \
    ./sentences_cleaning-v3.pl | \
    ./sentences_cleaning-v2.pl

The repetition of script #2 turned out to be an easier approach instead of writing more complex regular expressions.

The processed sentences were then filtered based on

Number of words $\ge 3$ (NF in awk).
- This criterion is simply due to the fact that we want sentences long enough to yield 3-grams.
A maximum value of a ratio variable defined as $\frac{(length - NF + 1)}{NF} \le 7.0$
- This ratio cuts out particularly pathological sentences, including non-sense “words” (typically from twitter text).

awk '{if(NF < 3){next}; ratio = (length - NF + 1)/NF; if(ratio > 7.0){next}; print $0}'

Removing stop words (R)

After more thinking I decided to partially reconsider my decision against removal of stop words and remove a controlled, selected list of them.

my_stop_words <- c("a", "an", "as", "at", "no", "of", "on", "or", 
                   "by", "so", "up", "or", "no", "in", "to", "rt")

# fixing extra spaces left by removing stop words
all.sentences <- removeWords(all.sentences.ALL, my_stop_words) %>% 
                        gsub(" +", " ", . , perl = TRUE) %>% 
                        gsub("^ +", "", . , perl = TRUE) %>% 
                        gsub(" +$", "", . , perl = TRUE)

Step 2 : n-grams Tokenization

Back to TOC

For the n-grams tokenization I used the RWeka Tokenizer NGramTokenizer, passing to it a list of token delimiters.

I have not been able to run NGramTokenizer() on the full vector of sentences for each data set. It fails on some variation of memory-allocation related error (that honestly does not make much sense to me considering that I am running it on machines with 12GB of RAM).

So, I am processing data in chunks of 25,000 sentences, as exemplified by this block of code (the n-grams data for the following section are loaded from saved previous analysis).

I extracted n-grams for $n = 3, 4, 5$, with the code shown below:

token_delim <- " \\r\\n\\t.,;:\"()?!"
nl.chunk <- 25000

gc()
cat(" *** Tokenizing n-grams in WHOLE dataset [", my_date(), "]----------------------------------------\n")

len.all.sentences <- length(all.sentences)
cat(" *** Number of sentences in the WHOLE data set : ", len.all.sentences, "\n")

# define variable used to filter sentences long enough for n-grams of length N
subs <- strsplit(all.sentences, split = "[ ;,.\"\t\r\n()!?]+")
nstr.subs  <- sapply(subs, FUN = function(x) { length(unlist(x)) }, USE.NAMES = FALSE)
rm(subs)

for( ngram_size in 3:5 ) {
    cat(" *** Tokenizing : WHOLE : ", ngram_size, "-grams ----------------------------------------\n")
    
    good.sentences <- all.sentences[nstr.subs >= ngram_size]
    len.good <- length(good.sentences)
    cat("   Sentences with good length ( >=", ngram_size, ") : ", sprintf("%7d", len.good), "\n")
    cat("   Sentences with good length ( >=", ngram_size, ") : ", sprintf("%7d", len.good), 
           "(of ", sprintf("%7d", len.all.sentences), ")\n")

    n_chunks <- floor(len.good/nl.chunk) + 1
    n1 <- ((1:n_chunks)-1)*nl.chunk + 1
    n2 <- (1:n_chunks)*nl.chunk
    n2[n_chunks] <- len.good

    names <- paste("n", sprintf("%1d", ngram_size), "grams.blogs.", sprintf("%03d", (1:n_chunks)), sep = "")
    fnames <- paste("output/", names, ".gz", sep = "")
    
    for(i in 1:n_chunks) {
        name1 <- names[i]
        fname1 <- fnames[i]
        idx <- n1[i]:n2[i]
    
        cat("  [", sprintf("%3d", i), "/", sprintf("%3d", n_chunks), "]  ", 
               name1, length(idx), idx[1], idx[length(idx)], "\n")
    
        # tokenize to n-grams
        assign( name1, NGramTokenizer(good.sentences[idx], 
                Weka_control(min = ngram_size, max = ngram_size, delimiters = token_delim)) )
    
        # write to file n-grams from this chunk
        con <- gzfile(fname1, open = "w")
        writeLines(get(name1), con = con)
        close(con)

        gc()
    }

    # Combining chunks into one n-gram vector
    size.ngrams <- rep(0, n_chunks)
    total_length <- 0 
    for(i in 1:n_chunks) {
        name1 <- names[i]
        this_length <- length(get(name1))
        size.ngrams[i] <- this_length
        total_length <- total_length + this_length
        cat("  [", sprintf("%3d", i), "/", sprintf("%3d", n_chunks), 
               "]  length of ", name1, " = ", this_length, "\n")
    }
    cat("    Total Length = ", total_length, "\n")
    
    name_for_all_ngrams <- paste("n", sprintf("%1d", ngram_size), "grams.blogs.all", sep = "")
    temp_all_ngrams <- vector(mode = "character", length = total_length)
    ivec <- c(0, cumsum(size.ngrams))
    for(i in 1:n_chunks) {
        i1 <- ivec[i] + 1
        i2 <- ivec[i+1]
        name <- names[i]
        cat("   ", i, i1, i2, name, "\n")
        temp_all_ngrams[i1:i2] <- get(name)
    }

    assign( name_for_all_ngrams, temp_all_ngrams )

    # write to file all n-grams
    fname <- paste("output/", "n", sprintf("%1d", ngram_size), "grams.blogs.all.gz", sep = "")
    con <- gzfile(fname, open = "w")
    writeLines(temp_all_ngrams, con = con)
    close(con)

    # cleaning
    rm(good.sentences, len.good, temp_all_ngrams)
    
    rm(i, n1, n2, n_chunks)
    ls(pattern = "^n[1-6]grams.blogs.[0-9]")
    rm(list = ls(pattern = "^n[1-6]grams.blogs.[0-9]") )
    gc()
}

Cleaning and re-classification of n-grams

A small fraction of the n-grams so produced contains no-ASCII characters, and it makes sense to simply drop these n-grams.
For instance with:

gzip -dc n5grams.all.gz | grep -P -v '([^\x00-\x7F]+)'

Another perl script (ngrams-reprocess_clean_and_classify.pl, find it on GitHub) then

Cleans up some problematic words or rejects some pathologically bad n-grams.
Reclassifies n-grams based on their true adjusted number of words.

gzip -dc n5grams.filtered_for_nonASCII.gz | ./scripts/ngrams-reprocess_clean_and_classify.pl -go -print

This script produces a set of 7 files (tmp_n*) containing: * good 1-grams * good 2-grams * good 3-grams * good 4-grams * good 5-grams * n > 5 grams * Trashed n-grams (full of problem not worth dealing with)

Finally, we merge the n-grams of each order produced by the above script reprocessing all original n-grams.

A look at the n-grams

Back to TOC

NOTE: the following results refer to the analysis of a subset of sentences (20%).

From the n-grams vectors we can compute frequencies, which will be an important basis for the prediction algorithms.

For now we can take a peek at what are the most frequent 3-grams and 4-grams in the three datasets.

n3g.blogs.freq <- as.data.frame(table(n3grams.blogs.all), stringsAsFactors = FALSE)
n3g.blogs.freq <- n3g.blogs.freq[order(n3g.blogs.freq$Freq, decreasing = TRUE), ]
row.names(n3g.blogs.freq) <- NULL

n4g.blogs.freq <- as.data.frame(table(n4grams.blogs.all), stringsAsFactors = FALSE)
n4g.blogs.freq <- n4g.blogs.freq[order(n4g.blogs.freq$Freq, decreasing = TRUE), ]
row.names(n4g.blogs.freq) <- NULL

colnames(n3g.blogs.freq) <- c("ngram", "count")
colnames(n4g.blogs.freq) <- c("ngram", "count")

n3g.news.freq <- as.data.frame(table(n3grams.news.all), stringsAsFactors = FALSE)
n3g.news.freq <- n3g.news.freq[order(n3g.news.freq$Freq, decreasing = TRUE), ]
row.names(n3g.news.freq) <- NULL

n4g.news.freq <- as.data.frame(table(n4grams.news.all), stringsAsFactors = FALSE)
n4g.news.freq <- n4g.news.freq[order(n4g.news.freq$Freq, decreasing = TRUE), ]
row.names(n4g.news.freq) <- NULL

colnames(n3g.news.freq) <- c("ngram", "count")
colnames(n4g.news.freq) <- c("ngram", "count")

n3g.twitter.freq <- as.data.frame(table(n3grams.twitter.all), stringsAsFactors = FALSE)
n3g.twitter.freq <- n3g.twitter.freq[order(n3g.twitter.freq$Freq, decreasing = TRUE), ]
row.names(n3g.twitter.freq) <- NULL

n4g.twitter.freq <- as.data.frame(table(n4grams.twitter.all), stringsAsFactors = FALSE)
n4g.twitter.freq <- n4g.twitter.freq[order(n4g.twitter.freq$Freq, decreasing = TRUE), ]
row.names(n4g.twitter.freq) <- NULL

colnames(n3g.twitter.freq) <- c("ngram", "count")
colnames(n4g.twitter.freq) <- c("ngram", "count")

3-grams

tmp3.df <- cbind(head(n3g.blogs.top500, 20), 
                 head(n3g.news.top500, 20), 
                 head(n3g.twitter.top500, 20)) 
tmp4.df <- cbind(head(n4g.blogs.top500, 20), 
                 head(n4g.news.top500, 20), 
                 head(n4g.twitter.top500, 20)) 
colnames(tmp4.df) <- c("ngram_blogs", "count", "ngram_news", "count", "ngram_twitter", "count")
colnames(tmp4.df) <- c("ngram_blogs", "count", "ngram_news", "count", "ngram_twitter", "count")
# saveRDS(tmp3.df, file = "tmp_n3g_table.RDS")

blogs

print(tmp3.df[, 1:2], print.gap = 3, right = FALSE)
#      ngram           count
# 1    one of the      4416 
# 2    a lot of        3613 
# 3    to be a         2078 
# 4    it was a        2076 
# 5    as well as      2067 
# 6    some of the     1988 
# 7    the end of      1974 
# 8    out of the      1954 
# 9    be able to      1927 
# 10   i want to       1882 
# 11   a couple of     1828 
# 12   the fact that   1596 
# 13   this is a       1592 
# 14   the rest of     1539 
# 15   going to be     1521 
# 16   part of the     1478 
# 17   i_am going to   1448 
# 18   i do_not know   1425 
# 19   one of my       1408 
# 20   i had to        1373

news

print(tmp3.df[, 3:4], print.gap = 3, right = FALSE)
#      ngram                             count
# 1    the united states                 1324 
# 2    the first time                    1249 
# 3    for the first                     1021 
# 4    more than <DOLLARAMOUNT>          1000 
# 5    the end the                        896 
# 6    it would be                        751 
# 7    it was the                         722 
# 8    the fact that                      690 
# 9    <DOLLARAMOUNT> - <DOLLARAMOUNT>    680 
# 10   this is the                        679 
# 11   the rest the                       676 
# 12   said he was                        667 
# 13   he said he                         655 
# 14   i do_not think                     652 
# 15   the new york                       651 
# 16   he said the                        628 
# 17   i do_not know                      626 
# 18   for more than                      622 
# 19   the same time                      578 
# 20   when he was                        565

twitter

print(tmp3.df[, 5:6], print.gap = 3, right = FALSE)
#      ngram                           count
# 1    thanks for the                  7135 
# 2    thank you for                   2590 
# 3    i love you                      2474 
# 4    for the follow                  2334 
# 5    for the rt                      1311 
# 6    let me know                     1301 
# 7    i do_not know                   1265 
# 8    i feel like                     1179 
# 9    i wish i                        1154 
# 10   thanks for following            1048 
# 11   you for the                     1013 
# 12   i can_not wait                   968 
# 13   <HASHTAG> <HASHTAG> <HASHTAG>    963 
# 14   how are you                      960 
# 15   for the <HASHTAG>                958 
# 16   can_not wait for                 919 
# 17   rt : i                           915 
# 18   i think i                        895 
# 19   if you want                      867 
# 20   what do you                      858

4-grams

blogs

print(tmp4.df[, 1:2], print.gap = 3, right = FALSE)
#      ngram_blogs          count
# 1    the end of the       1011 
# 2    the rest of the       913 
# 3    at the end of         872 
# 4    at the same time      700 
# 5    when it comes to      611 
# 6    one of the most       610 
# 7    to be able to         578 
# 8    for the first time    565 
# 9    in the middle of      519 
# 10   if you want to        469 
# 11   is one of the         462 
# 12   i do_not want to      461 
# 13   a bit of a            403 
# 14   i was going to        395 
# 15   on the other hand     393 
# 16   i would like to       375 
# 17   one of my favorite    350 
# 18   as well as the        325 
# 19   i was able to         304 
# 20   is going to be        302

news

print(tmp4.df[, 3:4], print.gap = 3, right = FALSE)
#      ngram_news                                      count
# 1    for the first time                              791  
# 2    more than <DOLLARAMOUNT> million                398  
# 3    the first time since                            195  
# 4    more than <DOLLARAMOUNT> billion                150  
# 5    for more than years                             138  
# 6    feet <DATE> for <DOLLARAMOUNT>                  137  
# 7    square feet <DATE> for                          137  
# 8    <DOLLARAMOUNT> million <DOLLARAMOUNT> million   136  
# 9    for the most part                               133  
# 10   the past two years                              132  
# 11   told the associated press                       132  
# 12   the united states and                           131  
# 13   i do_not know if                                126  
# 14   the end the year                                126  
# 15   the end the day                                 124  
# 16   g fat g saturated                               118  
# 17   the new york times                              118  
# 18   dow jones industrial average                    114  
# 19   be reached for comment                          112  
# 20   i do_not know what                              112

twitter

print(tmp4.df[, 5:6], print.gap = 3, right = FALSE)
#      ngram_twitter                             count
# 1    thanks for the follow                     1882 
# 2    thanks for the rt                         1031 
# 3    thank you for the                          916 
# 4    for the first time                         513 
# 5    i wish i could                             410 
# 6    thanks for the <HASHTAG>                   375 
# 7    rt : rt :                                  358 
# 8    thanks for the mention                     358 
# 9    let me know if                             330 
# 10   <HASHTAG> <HASHTAG> <HASHTAG> <HASHTAG>    322 
# 11   that awkward moment when                   299 
# 12   what do you think                          292 
# 13   thank you much for                         276 
# 14   hope all is well                           266 
# 15   can_not wait for the                       262 
# 16   thanks for the shout                       257 
# 17   for the shout out                          254 
# 18   i thought it was                           243 
# 19   thank you for following                    240 
# 20   thank you for your                         232

It is apparent that there some work will be necessary on the validation of the n-grams, or better still further text transformations, in particular of the twitter data set that “suffers” from the tendency of using shorthand slang (e.g. “rt” for “re-tweet”) that adds a lot of “noise” to the data.

Some Summary Plots for 4-grams

Back to TOC

Top-30 all mixed

# ECHO FALSE
mycolors <- c("deepskyblue3", "firebrick2", "forestgreen")
data2pl <- n4g.high.sorted[1:30, ]
bp_mix <- ggplot(data2pl, aes(x = reorder(ngram, count), y = count)) + theme_bw() + coord_flip() + xlab("") + 
    theme(plot.title = element_text(face = "bold", size = 20)) + 
    theme(axis.text = element_text(size = 10)) + 
    scale_fill_manual(values = mycolors) + 
    ggtitle("Top 30 4-grams : all sources") + 
    geom_bar(stat = "identity", aes(fill = flag)) + 
    geom_text(aes(label = count, hjust = 1.1, size = 10), col = "white", position = position_stack())

bp_mix

Top-20 by data source

# ECHO FALSE
data.blogs <- n4g.blogs.top500[1:20, ]
bp_blogs <- ggplot(data.blogs, aes(x = reorder(ngram, count), y = count)) + theme_bw() + coord_flip() + xlab("") + 
    theme(plot.title = element_text(face = "bold", size = 20)) + 
    theme(axis.text = element_text(size = 10)) + 
    theme(legend.position = "none") + 
    ggtitle("Top 20 4-grams : blogs") + 
    geom_bar(stat = "identity", fill = mycolors[1]) + 
    geom_text(aes(label = count, y = 100, size = 10), col = "white") 

data.news <- n4g.news.top500[1:20, ]
bp_news <- ggplot(data.news, aes(x = reorder(ngram, count), y = count)) + theme_bw() + coord_flip() + xlab("") + 
    theme(plot.title = element_text(face = "bold", size = 20)) + 
    theme(axis.text = element_text(size = 10)) + 
    theme(legend.position = "none") + 
    ggtitle("Top 20 4-grams : news") + 
    geom_bar(stat = "identity", fill = mycolors[2]) + 
    geom_text(aes(label = count, y = 100, size = 10), col = "white") 

data.twitter <- n4g.twitter.top500[1:20, ]
bp_twitter <- ggplot(data.twitter, aes(x = reorder(ngram, count), y = count)) + theme_bw() + coord_flip() + xlab("") + 
    theme(plot.title = element_text(face = "bold", size = 20)) + 
    theme(axis.text = element_text(size = 10)) + 
    theme(legend.position = "none") + 
    ggtitle("Top 20 4-grams : twitter") + 
    geom_bar(stat = "identity", fill = mycolors[3]) + 
    geom_text(aes(label = count, y = 100, size = 10), col = "white") 

grid.arrange(bp_blogs, bp_news, bp_twitter, nrow = 3)

APPENDIX

Back to the Top

User Defined Functions

These are two handy functions used in the analysis.

The first for reading the data.
The second is passed to sapply() to annotate sentences, allowing to work by row instead of converting the whole dataset into one document.

#===================================================================================================
# modified readLines

readByLine <- function(fname, check_nl = TRUE, skipNul = TRUE) {
    if( check_nl ) {
        cmd.nl   <- paste("gzip -dc", fname, "| wc -l | awk '{print $1}'", sep = " ")
        nl   <- system(cmd.nl, intern = TRUE)
    } else {
        nl   <- -1L
    }
    con <- gzfile(fname, open = "r")
    on.exit(close(con))
    readLines(con, n = nl, skipNul = skipNul) 
}

#===================================================================================================
# to use w/ sapply for finer sentence splitting.

find_sentences <- function(x) {
    s <- paste(x, collapse = " ") %>% as.String()
    a <- NLP::annotate(s , sent_token_annotator) 
    as.vector(s[a])
}

#===================================================================================================

More functions can be reviewed directly from the repository

Summary of regularization done with perl scripts

Scripts source in this GitHub folder

regularize_text-pass1.pl

CHARACTERS “HOMOGENEIZATION”
- Normalize odd characters
- Currency symbols
- HTML tags
- encoded apostrophe
CHARACTER SEQUENCES
- Cleaning of BEGIN / END of LINE
- EOL ellipsis
- EOL non-sense
- word bracketed by *
- substitution on “|” that seem to be equivalent to end of sentences (i.e. a period)
- substitution of <==/<– and ==>/–> with “;”
- sequences of !, ?
HASHTAGS
- recognized (most) hashtags

regularize_text-pass2.pl

NUMBER RELATED
- dates
- hours am/pm
- hours a.m./p.m.
- dollar amounts
- percentages
ACRONYMS
- acronyms: US
EMOTICONS
- regular
- (reverse) [not done]
ELLIPSIS
- INLINE ellipsis
REPEATED CHARACTERS
- DOLLAR
- STAR
- “+”
- “-”
- “a – b” ==> replace with “,”
- “a–b” ==> replace with “,”
- “! – A” ==> REMOVE
- “A–A[a ’]” ==> replace with “;”
- “a– A” ==> replace with “;”
- “a– a” ==> replace with “,”
- LEAVE ONE (followed by space) : ,
- LEAVE ONE : _ (the % is handled separately when dealing with percentages)
- REMOVE ENTIRELY if 2+ : < > = #

regularize_text-pass3.pl

CONTRACTIONS
- ’ll ==> _will / " will" ==> _will
- n’t ==> _not
- ’re ==> _are
- ’ve ==> _have
- some additional ad hoc (e.g. won’t ==> will_not)
- ’s ==> _s
- additional possibly useful/meaningful replacements (e.g. y’all)
WHITE SPACES
- squeezing extra white spaces
- fixing some punctuation and white space
PROFANITIES
- catch the “7 ones” and replace with tag

regularize_text-pass4.pl

ABBREVIATIONS: find and mark with tags standard/common abbreviations
- month names
- Mr, Mrs, Dr, …
MORE
- Find and replace something like -word
- Remove genitives
- Clean line endings preceded by spurious spaces
- Replace: ’’ ==> "

regularize_text-pass5.pl

WEIRD CHARACTERS
- Replace weird characters with tag
TAGS CLEANING
- Clean consecutive
- Remove
- Remove
- Remove
- Remove quotes from single quoted words
MORE
- Clear row BEGINNINGS with non-alpha characters

regularize_text-pass6.pl

MORE TAG-RELATED REGULARIZATIONS
- FIX missed HASHTAGS at line beginning
- FIX additional number capture
- Emptying tags that were defined to capture the original expression (e.g. ==>

Summary of post-sentence-tokenization cleaning (with perl scripts)

Scripts source in this GitHub folder

sentences-cleaning_1.pl

CONTRACTIONS
- ’ll ==> _will / " will" ==> _will
- n’t ==> _not
- ’re ==> _are
- ’ve ==> _have
- some additional ad hoc
- ’s ==> _s
- additional possibly useful/meaningful replacements (e.g. y’all)

sentences-cleaning_2.pl

Remove spaces at the end of a sentence
More catching of profanities
TAGS
- Consecutive identical TAGS
- Remove TAGS enclosed in parenthesis
- Remove USELESS TAGS
Remove excess space
- Removed excess space at the beginning
- Removed excess space at the end
- Removed excess space in the middle
Clean non-alpha BEGINNING of sentences
- Clean sentences “bracketed” by quotes or parenthesis (removing the “bracketing” character“)
- Remove beginning quotes not paired in the rest of the sentence.
- Bracketed TAGS - just remove them
- Non-alpha preceding a TAG (done separately to make life simpler)
- Catching up with more fixes for beginning
- Clean sentences “bracketed” by quotes or parenthesis (removing the “bracketing” character“)
Clean non-alpha ENDING of sentences
- Clean extra spaces before “good” sentence endings
- Cleaning orphan " ’ ) ] at the end match earlier in the sentence
- Some dirty ENDS

sentences-cleaning_3.pl

CLEAN (again) lines BEGINNING with TAGS and non-alpha

Mysterious issue with `NGramTokenizer`

Because the NGramTokenizer would fail with a java memory error if fed the full vector of sentences, but run when fed chunks of 100,000 sentences, I thought that turning this into a basic loop handling the splitting in chunks, collecting the output and finally return just one vector of n-grams would work, be compact and smarter.

It turns out that it fails… and this puzzles me deeply.
Is R somehow handling the “stuff” in the loop in the same way it would if I run the tokenizer with the full vector?

Any clue?

# NOT EVALUATED
nl.chunk <- 100000
N <- ceiling(length(sel.blogs.sentences)/nl.chunk)
alt.n3grams.blogs <- vector("list", N)

system.time({
for( i in 1:N ) {
    i <- i+1
    n1 <- (i-1)*nl.chunk + 1
    n2 <- min(i*nl.chunk, end.blogs)
    cat(" ", i, n1, n2, "\n")
    alt.n3grams.blogs[[i]] <- NGramTokenizer(sel.blogs.sentences[n1:n2], 
                                             Weka_control(min = 3, max = 3, 
                                                          delimiters = token_delim)) 
}
})

Building a Text Prediction Algorithm

Exploratory Analysis and Thoughts about a Prediction Strategy

Giovanni Fossati

SUMMARY

Current Thoughts About Predictive Algorithm Strategy

CONTENT

PRELIMINARIES

THE DATA

PREPROCESSING (before loading into R)

Weird Characters

Homogeneization of Characters

Contractions, Profanities, Emoticons, Hashtags, etc…

Excluding rows with less than 6 words

MOVING TO R

Loading the data

Further Data Cleaning in R

Text transformations

ANALYSIS

Step 1 : Sentence Annotation (R)

Some further cleaning of text (perl)

Removing stop words (R)

Step 2 : n-grams Tokenization

Cleaning and re-classification of n-grams

A look at the n-grams

3-grams

4-grams

Some Summary Plots for 4-grams

Top-30 all mixed

Top-20 by data source

APPENDIX

User Defined Functions

Summary of regularization done with perl scripts

regularize_text-pass1.pl

regularize_text-pass2.pl

regularize_text-pass3.pl

regularize_text-pass4.pl

regularize_text-pass5.pl

regularize_text-pass6.pl

Summary of post-sentence-tokenization cleaning (with perl scripts)

sentences-cleaning_1.pl

sentences-cleaning_2.pl

sentences-cleaning_3.pl

Mysterious issue with NGramTokenizer

Mysterious issue with `NGramTokenizer`