
This report illustrates the work done after obtaining the first master list of n-grams, \(n = 3, 4, 5\).

It is still very much in progress, and for now not much more than a skeleton.


This phase is summarize in the Milestone Report, available here

  • I cleaned the most pathological issues of the text in perl before feeding it to R.
  • In perl I did some more useful regex parsing to catch things like emoticon, hashtags, abbreviations, number-related pieces, profanities (I learned an awful lot more than I thought it would be possible on regex!!! I confess I did not know about the lookahead/lookbehind magic)
  • In R I tokenize it into sentences (with NLP::annotate), because it seemed to improve n-gram tokenization, but in terms of speed and quality.
  • In perl I did another pass at cleaning the sentences.
  • With cleaned sentences, back to R for n-gram tokenization, 2/3/4/5-grams (did not use 2-grams.)

N-grams analysis

  • Back to perl for validation of the n-grams. If it seems like a lot of back and forth… in good part it is because some issues became more visible (literally speaking) after each tokenization.
  • In perl I extracted a census of words from the sentences (I would cal that a dictionary).
    At first I was doing this from the n-grams until it dawned on me that it was overcounting words because of the shifting-window nature of n-grams.
  • I handled the census of n-grams with good old command line! Pretty much piping into sort to uniq -c to sort -k 1nr and finally for formatting (n-grams were “;” separated words, and the above pipe leaves space separated count and string) an easy awk.

The final table is something like this:


It was faster (not-slower) than doing this in R with the gained convenience of infinitely easier “access” to files for inspection and such.

More perl for a couple more “advanced” operations:

  • Rejecting n-grams containing words not occurring at a minimum number of times in the dictionary.
    In the end I cut at 5, leaving 126,000 words in the dictionary (out of > 400,000) but accounting for 99.5% of the coverage.
  • Making a census of n-gram “roots” (i.e. the first 4 words in a 5-gram), which I used (sort of) in the prediction process.
    Kind of like using the probability (frequency) of a word in a corpus as a discounting factor.
  • This also gives relative frequencies of 5-th word for a given “root”, also potentially useful/usable information.

The new 5-gram table looks like this now:

  • First column is the total number of times the 4-subgram appears in the 5-grams set
  • Second colum is how many different 5-grams have that 4-subgram as their root.
  • Third column is the count of this particular 5-gram.

For this particular “root” for instance these are its 5-grams.


For efficiency I also (again in perl) checked 3-grams agains 4- and 5-grams to keep only the 3-grams that were not contained in those higher order sets. Same for 4-grams against 5-grams.
This decreased the size of the 3-grams and 4-grams data sets significantly:

  • for n-grams occurring at least 2 times: 3-grams went from 6.6M to 4.1M, 4-grams from 4.2M to 2.9M,
  • for n-grams occurring at least 4 times: 3-grams went from 1.9M to 0.75M, 4-grams from 700k to 337M.

I also cut 3- and 4-grams at a minimum count of 4, and 5-grams at 2.
With these cuts, the final tally was 755k 3-grams, 337K 4-grams, 1.4M 5-grams.

I coded words as numbers, and look these up in the dictionary.
In the end I did not rely on hashes for this because it did not quite provide such a spectacular performance improvement over more mundane arrays.

I split the dictionary data frame (ID, word, count) in two vectors for word and counts indexed on the ID (the fact that some ID were not used because I dropped some words is not really a huge waste of memory, these are very small arrays).

Getting ID from words by a plain dict$word == "text", and then count as count[ID] is fast enough.
Moreover, creating this couple of arrays from the data frame turned out to be massively quicker than creating the hash, pretty much non-measureable compared to few seconds. Not worth the hassle especially considering that the delay in creating the hash when the application start can be annoying.

Prediction algorithm

I might come back to illustrate the prediction algorithm based on these data.
I was satisfied by its performance, both for speed and accuracy, but not too happy about the latter because it tends to give higher priority to short words. One easy improvement would be to increase the weight of “full n-gram matches” when they exist, because right now they can be diluted by the results of sub-n-gram matches.


The main code on which the application runs is shown below or reviewed on GitHub:

# Main function
predict_engine <- function(input_text = NULL, n3w = NULL, n4w = NULL, n5w = NULL, dict = NULL, vec_words, vec_counts, how_many = 5, debug = FALSE) {
    dict_size <- sum(dict$count)

    OutputColumns <- c(1, 2)

    # return some default data frame.
    fake1_wpred <- c("BAZINGA!", "", "You've", "kidding me", "")
    fake1_metric1 <- c("3.141592653", "", "Got to be", "with this!", "")
    fake1_metric2 <- c("2.718281828", "", "", "", "")

    fake2_wpred <- c("BAZINGA!", "Sorry...", "I Could", "find")
    fake2_metric1 <- c("3.141592653", "", "Not", "a match")
    fake2_metric2 <- c("2.718281828", "", "", "")

    fake3_wpred <- c("BAZINGA!", "", "EMPTY", "Please")
    fake3_metric1 <- c("3.141592653", "", "INPUT", "Try Again")
    fake3_metric2 <- c("2.718281828", "", "", "")

    fall_back1 <- data.frame( prediction = fake1_wpred, metric1 = fake1_metric1, metric2 = fake1_metric2, stringsAsFactors = FALSE)
    fall_back2 <- data.frame( prediction = fake2_wpred, metric1 = fake2_metric1, metric2 = fake2_metric2, stringsAsFactors = FALSE)
    fall_back3 <- data.frame( prediction = fake3_wpred, metric1 = fake3_metric1, metric2 = fake3_metric2, stringsAsFactors = FALSE)

    debug_df_col1 <- c("PROBLEM", " ", " ", " ", " ")
    debug_df_col2 <- c("CAUGHT", " ", " ", " ", " ")
    debug_df_col3 <- c(" ", " ", " ", " ", " ")
    debug_df <- data.frame( prediction = debug_df_col1, metric1 = debug_df_col2, metric2 = debug_df_col3, stringsAsFactors = FALSE)

    # input_refmt1 <- system2("./scripts/", input = input_text, stdout = TRUE)
    # input_refmt2 <- system2("./scripts/", input = shQuote(input_text), stdout = TRUE)
    # input_refmt3 <- text_string_refmt(input_text)
    input_refmt <- text_string_refmt(input_text)
    # input_refmt <- input_text;
    input_length <- 0.0
    input_length <- nchar(input_refmt)

    if( is.null(input_length) ) {
        candidates <- debug_df
        candidates[2, 1] <- "input_length"; candidates[2, 2] <- "NULL"
        top10 <- head(candidates[, OutputColumns], how_many)

    if( ) {
        candidates <- debug_df
        candidates[2, 1] <- "input_length"; candidates[2, 2] <- "NA"
        top10 <- head(candidates[, OutputColumns], how_many)

    if( input_length == 0 ) { 
        candidates <- debug_df
        candidates[2, 1] <- "input_length"; candidates[2, 2] <- 0
        top10 <- head(candidates[, OutputColumns], how_many)

    words_all <- unlist(strsplit(input_refmt, split = "[ ;,.\"\t\r\n()!?:]+"))
    nwords <- min(5, length(words_all))
    if( is.null(nwords) ) {
        candidates <- debug_df
        candidates[2, 1] <- "nwords"; candidates[2, 2] <- "NULL"
        top10 <- head(candidates[, OutputColumns], how_many)

    if( ) {
        candidates <- debug_df
        candidates[2, 1] <- "nwords"; candidates[2, 2] <- "NA"
        top10 <- head(candidates[, OutputColumns], how_many)

    if( nwords == 0 ) {
        candidates <- debug_df
        candidates[2, 1] <- "nwords"; candidates[2, 2] <- 0
        top10 <- head(candidates[, OutputColumns], how_many)

    # keeps only 5 words if there are more after regularization
    if( nwords >= 5 ) {
        # words_select <- words_all[(length(words_all)-4):length(words_all)]
        words_select <- tail(words_all, 4)
    } else {
        words_select <- words_all
    # checks matches and counts on dictionary
    check_counts <- 0.0
    w_counts <- sapply(words_select, function(x) { get_word_count(dict, x) }, USE.NAMES = FALSE )
    w_IDs    <- sapply(words_select, function(x) { get_word_ID(dict, x) }, USE.NAMES = FALSE )

    check_counts <- sum(w_counts)

    if( is.null(check_counts) ) {
        candidates <- debug_df
        candidates[2, 1] <- "check_counts"; candidates[2, 2] <- "NULL"
        candidates[3, 1] <- "nwords"      ; candidates[3, 2] <- nwords
        candidates[ , 3] <- w_IDs[1:5]
        top10 <- head(candidates[, OutputColumns], how_many)
    if( ) {
        candidates <- debug_df
        candidates[2, 1] <- "check_counts"; candidates[2, 2] <- "NA"
        candidates[3, 1] <- "nwords"      ; candidates[3, 2] <- nwords
        candidates[ , 3] <- w_IDs[1:5]
        top10 <- head(candidates[, OutputColumns], how_many)

    if( check_counts == 0 ) { 
        candidates <- fall_back1
        top10 <- head(candidates[, OutputColumns], how_many)

    bs_exp <- 0:3
    bs_w2 <- 2^bs_exp 
    c_threshold <- 1e7

    threshold_n34 <- 10
    flag_n34 <- FALSE
    threshold_bs2 <- 20
    flag_bs2 <- FALSE
    weight_ng5 <- 1.5
    weight_ng4 <- 1.0
    weight_ng3 <- 0.5
    N5_candidates <- 0.0
    N4_candidates <- 0.0
    N3_candidates <- 0.0

     # candidates <- debug_df
     # candidates[2, 1] <- "N5_candidates"; candidates[2, 2] <- N5_candidates
     # candidates[3, 1] <- "N4_candidates"; candidates[3, 2] <- N4_candidates
     # candidates[4, 1] <- "N3_candidates"; candidates[4, 2] <- N3_candidates
     # candidates[5, 1] <- "FLAG_DONE";     candidates[5, 2] <- FLAG_DONE
     # candidates[6, 1] <- "input_length";  candidates[6, 2] <- input_length
     # candidates[7, 1] <- "check_counts";  candidates[7, 2] <- check_counts
     # candidates[8, 1] <- "nwords"      ;  candidates[8, 2] <- nwords
     # candidates[1:3 , 3] <- c(FLAG_EMPTY_5, FLAG_EMPTY_4, FLAG_EMPTY_3)
     # top10 <- head(candidates[, OutputColumns], how_many)
     # return(top10)

    n5s <- select_ngrams_xnew(ngrams = n5w, 
                              columns = 1:4, 
                              words = tail(w_IDs, 4), 
                              counts = tail(w_counts, 4), 
                              c_threshold = c_threshold) 

    N5_candidates <- nrow(n5s$matches)
    if( N5_candidates > 0 ) { 

        n5text <- mutate(n5s$matches, 
                         logPdict_pred = -1*(log10(vec_counts[pred]/dict_size)), 
                         logPdict = as.numeric(logPdict),
                         orderW = weight_ng5)
        n5text$nm <- rowSums(n5s$flags)
        binscore5_w2 <- apply(n5s$flags, 1, function(x) { x %*% bs_w2 } )
        n5text$bs2 <- binscore5_w2
        test_n34 <- sum( n5text$nm  >=  3 )
        test_bs2 <- sum( n5text$bs2 >= 10 | n5text$bs2 == 7 | n5text$bs2 == 6 )
        if( test_n34 >= threshold_n34 ) {
            flag_n34 <- TRUE
            n5text <- filter(n5text, nm >= 3)
        } else if( test_bs2 >= threshold_bs2 ) {
            flag_bs2 <- TRUE
            n5text <- filter(n5text, bs2 >= 10 | bs2 == 7 | bs2 == 6 )
        n5text <- mutate(n5text, 
                         root = str_join(vec_words[w1], vec_words[w2], vec_words[w3], vec_words[w4], sep = " "), 
                         wpred = vec_words[pred])
        n5text <- mutate(n5text, 
                         ngImp = logPdict - logPng, 
                         predImp = logPdict_pred + log10(ngFreq/Nng), 
                         mixImp = ngImp + predImp)
        N5_select <- nrow(n5text)
        if( N5_select > 0 ) {
            filter5 <- ( n5text$logPdict > 12.5 & n5text$logPdict_pred > 2.0 ) 
            filter5 <- ifelse(n5text$bs2 >= 11, TRUE, filter5)
            n5pred <- filter(n5text, filter5) %>%
                select(wpred, pred, logPdict, logPdict_pred, bs2, mixImp, orderW) %>%
                arrange(desc(bs2), desc(logPdict)) 
            n5pred_aggr <- group_by(n5pred, wpred) %>% 
                summarise( sum_bs2 = sum(bs2), sum_logPdict = sum(logPdict)) %>% 
                arrange(desc(sum_bs2), desc(sum_logPdict))
            N5_candidates <- nrow(n5pred_aggr)

    if( N5_candidates > 0 ) { 
        FLAG_EMPTY_5 <- FALSE 

    if( N5_candidates >= 2 ) {
        FLAG_DONE <- TRUE

        candidates <- group_by(n5pred, wpred) %>% 
            summarise( sum_bs2 = sum(bs2*orderW), sum_logPdict = sum(logPdict), sum_mixImp = sum(mixImp*orderW)) %>% 
            arrange(desc(sum_bs2), desc(sum_logPdict))

    if( FLAG_DONE == FALSE ) {

        n4s <- select_ngrams_xnew(ngrams = n4w, 
                                  columns = 1:3, 
                                  words = tail(w_IDs, 3), 
                                  counts = tail(w_counts, 3), 
                                  c_threshold = c_threshold) 
        n4text <- mutate(n4s$matches, 
                         logPdict_pred = -1*(log10(vec_counts[pred]/dict_size)), 
                         logPdict = as.numeric(logPdict),
                         orderW = weight_ng4)
        n4text$nm <- rowSums(n4s$flags)
        binscore4_w2 <- apply(n4s$flags, 1, function(x) { x %*% c(bs_w2[2:4], 0) } )
        n4text$bs2 <- binscore4_w2
        if( flag_n34 ) {
            n4text <- filter(n4text, nm  >=  3)
        } else if( flag_bs2 ) {
            n4text <- filter(n4text, bs2 >= 10)
        n4text <- mutate(n4text, 
                         root = str_join(vec_words[w1], vec_words[w2], vec_words[w3], sep = " "), 
                         wpred = vec_words[pred])
        n4text <- mutate(n4text, 
                         ngImp = logPdict - logPng, 
                         predImp = logPdict_pred + log10(ngFreq/Nng), 
                         mixImp = ngImp + predImp)
        N4_select <- nrow(n4text)
        if( N4_select > 0 ) {
            # n4pred <- filter(n4text, logPdict > 8.0 & logPdict_pred > 2.0) %>% 
            filter4 <- ( n4text$logPdict > 8.0 & n4text$logPdict_pred > 2.0 ) 
            filter4 <- ifelse(n4text$bs2 >= 10, TRUE, filter5)
            n4pred <- filter(n4text, filter4) %>%
                select(wpred, pred, logPdict, logPdict_pred, bs2, mixImp, orderW) %>%
                arrange(desc(bs2), desc(logPdict)) 
            n4pred_aggr <- group_by(n4pred, wpred) %>% 
                summarise( sum_bs2 = sum(bs2), sum_logPdict = sum(logPdict)) %>% 
                arrange(desc(sum_bs2), desc(sum_logPdict))

            N4_candidates <- nrow(n4pred_aggr)
        if( N4_candidates > 0 ) { 
            FLAG_EMPTY_4 <- FALSE 

        if( N4_candidates >= 2 ) {
            FLAG_DONE <- TRUE
            candidates <- group_by(n4pred, wpred) %>% 
                summarise( sum_bs2 = sum(bs2*orderW), sum_logPdict = sum(logPdict), sum_mixImp = sum(mixImp*orderW)) %>% 
                arrange(desc(sum_bs2), desc(sum_logPdict))

    if( FLAG_DONE == FALSE ) {

        if( FLAG_EMPTY_5 == FALSE & FLAG_EMPTY_4 == FALSE ) {
            candidates <- rbind(n5pred, n4pred) %>% group_by(wpred) %>% 
                summarise( sum_bs2 = sum(bs2*orderW), sum_logPdict = sum(logPdict), sum_mixImp = sum(mixImp*orderW)) %>% 
                arrange(desc(sum_bs2), desc(sum_logPdict))
            FLAG_DONE <- TRUE
        } else if( FLAG_EMPTY_5 == FALSE & FLAG_EMPTY_4 == TRUE ) {
            candidates <- group_by(n5pred, wpred) %>% 
                summarise( sum_bs2 = sum(bs2*orderW), sum_logPdict = sum(logPdict), sum_mixImp = sum(mixImp*orderW)) %>% 
                arrange(desc(sum_bs2), desc(sum_logPdict))
            FLAG_DONE <- TRUE
        } else if( FLAG_EMPTY_5 == TRUE & FLAG_EMPTY_4 == FALSE ) {
            candidates <- group_by(n4pred, wpred) %>% 
                summarise( sum_bs2 = sum(bs2*orderW), sum_logPdict = sum(logPdict), sum_mixImp = sum(mixImp*orderW)) %>% 
                arrange(desc(sum_bs2), desc(sum_logPdict))
            FLAG_DONE <- TRUE

    if( FLAG_DONE == FALSE ) {

        n3s <- select_ngrams_xnew(ngrams = n3w, 
                                  columns = 1:2, 
                                  words = tail(w_IDs, 2), 
                                  counts = tail(w_counts, 2), 
                                  c_threshold = c_threshold) 
        if( nrow(n3s$matches) > 0 ) { 
            n3text <- mutate(n3s$matches, 
                             logPdict_pred = -1*(log10(vec_counts[pred]/dict_size)), 
                             logPdict = as.numeric(logPdict),
                             orderW = weight_ng3)
            n3text$nm <- rowSums(n3s$flags)
            binscore3_w2 <- apply(n3s$flags, 1, function(x) { x %*% c(bs_w2[3:4], 0, 0) } )
            n3text$bs2 <- binscore3_w2
            n3text <- mutate(n3text, 
                             root = str_join(vec_words[w1], vec_words[w2], sep = " "), 
                             wpred = vec_words[pred])
            n3text <- mutate(n3text, 
                             ngImp = logPdict - logPng, 
                             predImp = logPdict_pred + log10(ngFreq/Nng), 
                             mixImp = ngImp + predImp)
            # N3_matches <- nrow(n3s$matches)
            N3_matches <- nrow(n3text)
            if( N3_matches > 0 ) { select3 <- flag_best_matches(data = n3text, order = 2) }
            N3_select <- sum(select3)
            if( N3_select > 0 ) {
                n3pred <- n3text[select3, ] %>% 
                    filter(logPdict > 6.0 & logPdict_pred > 2.0) %>% 
                    select(pred, logPdict, logPdict_pred, bs2, mixImp, orderW) %>%
                    arrange(desc(bs2), desc(logPdict)) 
                n3pred_aggr <- group_by(n3pred, pred) %>% 
                    summarise( sum_bs2 = sum(bs2), sum_logPdict = sum(logPdict)) %>% 
                    arrange(desc(sum_bs2), desc(sum_logPdict))
                N3_candidates <- nrow(n3pred_aggr)
            if( N3_candidates > 0 ) {
                candidates <- group_by(n3pred, wpred) %>% 
                    summarise( sum_bs2 = sum(bs2*orderW), sum_logPdict = sum(logPdict), sum_mixImp = sum(mixImp*orderW)) %>% 
                    arrange(desc(sum_bs2), desc(sum_logPdict))
                FLAG_DONE <- TRUE
    if( FLAG_DONE == FALSE ) {
        candidates <- fall_back2

    colnames(candidates) <- c("prediction", "score", "score_alt") 
    candidates$prediction <- sapply(candidates$prediction, function(x) {reverse_refmt(input = x)}, USE.NAMES = FALSE)

    top10 <- head(candidates[, OutputColumns], how_many)

    # top10 <- mutate(top10, prediction = reverse_refmt(prediction))
    # top10$prediction <- sapply(top10$prediction, function(x) {reverse_refmt(input = x)}, USE.NAMES = FALSE)



get_word_ID <- function( dict, word ) {
    ID <- dict[dict$word == word, 1]
    # Return first ID value to avoid duplication (which should not actually be possible)
    # If not found return 0
    if( length(ID) > 0 ) {
        output <- ID[1] 
    } else {
        output <- 0 

get_word_count <- function( dict, word ) {
    ID <- dict[dict$word == word, 3]
    # Return first ID value to avoid duplication (which should not actually be possible)
    # If not found return 0
    if( length(ID) > 0 ) {
        output <- ID[1] 
    } else {
        output <- 0 

flag_best_matches <- function(data = NULL, order = NULL) {

    # tier1_bs2  <- c(7, 11, 13, 14, 15)   # nm >= 3
    # tier2_bs2  <- c(6, 10, 12)
    # tier3_bs2  <- c(5, 8, 9)
    # tier4_bs2  <- c(1, 2, 3, 4)          # these are pretty poor

    check_A <- rep(FALSE, nrow(data))
    check_B <- rep(FALSE, nrow(data))
    check_C <- rep(FALSE, nrow(data))
    best <- check_A

    check_A <- ( data$nm == order )
    n_A <- sum( check_A )
    if( order > 2 ) {
        check_B <- ( data$nm == order-1 )
        n_B <- sum( check_B )
    if( order > 3 ) {
        check_C <- ( data$nm == order-2 )
        n_C <- sum( check_C )

    check_AB  <- ( check_A | check_B )
    check_ABC <- ( check_A | check_B | check_C )
    n_AB  <- sum( check_AB )
    n_ABC <- sum( check_ABC )

    best <- check_ABC
    if( n_A >= 10 ) { 
        best <- check_A
    } else if( n_AB >= 10 ) {
        best <- check_AB



select_ngrams_xnew <- function(ngrams = NULL, columns = NULL, words = NULL, counts = NULL, c_threshold = 1e7) {
    max_ncol <- ncol(ngrams) - 3
    if( length(columns) > max_ncol ) {
        columns <- 1:max_ncol
    if( length(words) > max_ncol ) {
        words <- tail(words, max_ncol)
    diff <- length(columns) - length(words) 
    if( diff > 0 ) {
        i_sub <- (diff + 1):length(columns)
        columns <- columns[i_sub]
    if( diff < 0 ) {
        i_sub <- (abs(diff) + 1):length(words)
        words <- words[i_sub]
        counts <- counts[i_sub]
    n_max <- length(columns)

    name <- paste0("temp_flag_", sprintf("%1d", 1:4))
    for(i in 1:4) {
        assign( name[i], rep(FALSE, nrow(ngrams)) )
    for(i in 1:n_max) {
        if(counts[i] > 0 && counts[i] <= c_threshold) {
            assign( name[i],  as.logical( words[i] == ngrams[, columns[i]] ) )
    df_flags <- data.frame(as.logical(temp_flag_1), as.logical(temp_flag_2), 
                           as.logical(temp_flag_3), as.logical(temp_flag_4) )
    colnames(df_flags) <- paste0("flag", 1:4)
    selection <- ( temp_flag_1 | temp_flag_2 | temp_flag_3 | temp_flag_4 )
    df_sel_ngrams <- filter(ngrams, selection)
    df_sel_flags <- filter(df_flags, selection)

    # df_output <- data.frame(df_sel_ngrams, df_sel_flags)
    # return( df_output )
    return( list( matches = df_sel_ngrams, flags = df_sel_flags ) )


text_string_refmt <- function(input = NULL) {

    if( nchar(input) == 0 ) { 
    } else {
        tmp <- input

    # tmp <- gsub('([[:space:]\(\!\".?\-\/])#((?!(?:[a-z_]*[0-9]){5})[_[:alnum:]]*[[:alpha:]][_[:alnum:]]*)\b', "\\1 <HASHTAG> ", tmp, perl = TRUE);
    tmp <- gsub('([[:space:]])#((?!(?:[a-z_]*[0-9]){5})[_[:alnum:]]*[[:alpha:]][_[:alnum:]]*)\\b', "\\1 <HASHTAG> ", tmp, perl = TRUE);
    tmp <- gsub('^#((?!(?:[a-z_]*[0-9]){5})[_[:alnum:]]*[[:alpha:]][_[:alnum:]]*)\\b', "<HASHTAG> ", tmp, perl = TRUE)

    # 'll ==> _will / " will" ==> _will
    tmp <- gsub("([^[:alnum:]])-+([^[:alnum:]])", "\\1 \\2", tmp, perl = TRUE)

    tmp <- gsub("\\b(I|you|he|it|she|we|they|who|there)'ll\\b", "\\1_will", tmp, perl = TRUE)
    tmp <- gsub("\\b(I|you|he|it|she|we|they|who|there) +will\\b", "\\1_will",  tmp, perl = TRUE)

    # n't ==> _not
    tmp <- gsub("\\b(do|does|did|has|have|had|is|are|was|were|could|would|should|must|need)n\'t\\b", "\\1_not", tmp, perl = TRUE)
    tmp <- gsub("\\b(do|does|did|has|have|had|is|are|was|were|could|would|should|must|need) +not\\b", "\\1_not", tmp, perl = TRUE)

    tmp <- gsub("\\b(they|you|we)'re\\b", "\\1_are", tmp, perl = TRUE)
    tmp <- gsub("\\b(they|you|we) +are\\b", "\\1_are", tmp, perl = TRUE)

    # 've ==> _have
    tmp <- gsub("\\b(I|you|we|they|would|should|could|must|might)'ve\\b", "\\1_have", tmp, perl = TRUE)
    tmp <- gsub("\\b(I|you|we|they|would|should|could|must|might) +have\\b", "\\1_have", tmp, perl = TRUE)

    # ad hoc
    tmp <- gsub("\\b(can)'t\\b", "can_not", tmp, perl = TRUE)
    tmp <- gsub("\\b(can) +not\\b", "can_not", tmp, perl = TRUE)

    tmp <- gsub("\\b(won)'t\\b", "will_not", tmp, perl = TRUE)
    tmp <- gsub("\\b(will) +not\\b", "will_not", tmp, perl = TRUE)

    tmp <- gsub("\\b(ain)'t\\b", "is_not", tmp, perl = TRUE)

    tmp <- gsub("\\b(shan)'t\\b", "shall_not", tmp, perl = TRUE)
    tmp <- gsub("\\b(shall) +not\\b", "shall_not", tmp, perl = TRUE)

    tmp <- gsub("\\b(I)'m\\b", "I_am", tmp, perl = TRUE)
    tmp <- gsub("\\b(I) am\\b", "I_am", tmp, perl = TRUE)
    tmp <- gsub("\\bam I\\b", "am_I", tmp, perl = TRUE)

    tmp <- gsub("\\b(it)'s", "\\1_s", tmp, perl = TRUE)
    tmp <- gsub("\\b(he)'s", "\\1_s", tmp, perl = TRUE)
    tmp <- gsub("\\b(she)'s", "\\1_s", tmp, perl = TRUE)
    tmp <- gsub("\\b(that)'s", "\\1_s", tmp, perl = TRUE)
    tmp <- gsub("\\b(there)'s", "\\1_s", tmp, perl = TRUE)
    tmp <- gsub("\\b(what)'s", "\\1_s", tmp, perl = TRUE)
    tmp <- gsub("\\b(let)'s", "\\1_s", tmp, perl = TRUE)
    tmp <- gsub("\\b(here)'s", "\\1_s", tmp, perl = TRUE)
    tmp <- gsub("\\b(who)'s", "\\1_s", tmp, perl = TRUE)
    tmp <- gsub("\\b(how)'s", "\\1_s", tmp, perl = TRUE)
    tmp <- gsub("\\b(where)'s", "\\1_s", tmp, perl = TRUE)

    tmp <- gsub("\\b(it) +is", "\\1_is", tmp, perl = TRUE)
    tmp <- gsub("\\b(he) +is", "\\1_is", tmp, perl = TRUE)
    tmp <- gsub("\\b(she) +is", "\\1_is", tmp, perl = TRUE)
    tmp <- gsub("\\b(that) +is", "\\1_is", tmp, perl = TRUE)
    tmp <- gsub("\\b(there) +is", "\\1_is", tmp, perl = TRUE)

    tmp <- gsub("\\b((shit|piss|fuck|cunt|cocksuck|motherfuck|tits)[[:alpha:]]{0,12})\\b", "<PROFANITY>", tmp, perl = TRUE)
    tmp <- gsub("\\b([[:alnum:]]+)'s\\b", "\\1", tmp, perl = TRUE)
    tmp <- gsub("\\b(a|an|as|at|no|of|on|or|by|so|up|or|no|in|to|rt)\\b", "", tmp, perl = TRUE)

    tmp <- gsub(' +<(NUMBER|PERCENTAGE|EMOTICON|HASHTAG|TELEPHONENUMBER)> +', ' ', tmp, perl = TRUE)
    tmp <- gsub(' +<(NUMBER|PERCENTAGE|EMOTICON|HASHTAG|TELEPHONENUMBER)>([^[:alpha:]])', '\\2', tmp, perl = TRUE)

    tmp <- gsub("[,;:]", " ", tmp, perl = TRUE)

    tmp <- gsub(" +", " ", tmp, perl = TRUE)
    tmp <- gsub("^ +", "", tmp, perl = TRUE)
    tmp <- gsub(" +$", "", tmp, perl = TRUE)

    refmt <- tmp
reverse_refmt <- function(input = NULL) {

    if( nchar(input) == 0 ) { 
    } else {
        tmp <- input
    # rev_refmt <- system2("./scripts/", input = input, stdout = TRUE) 

    tmp <- gsub("\\b(i|you|he|it|she|we|they|who|there)_will\\b", "\\1 will", tmp, perl = TRUE)

    tmp <- gsub("\\b(do|does|did|has|have|had|is|are|was|were|could|would|should|must|need)_not\\b", "\\1 not", tmp, perl = TRUE)
    tmp <- gsub("\\bdon_t\\b", "don't", tmp, perl = TRUE)

    tmp <- gsub("\\b(they|you|we)_are\\b", "\\1 are", tmp, perl = TRUE)

    tmp <- gsub("\\b(i|you|we|they|would|should|could|must|might)_have\\b", "\\1 have", tmp, perl = TRUE)

    tmp <- gsub("\\b(can|will|is|shall)_not\\b", "\\1 not", tmp, perl = TRUE)
    tmp <- gsub("\\bi_am\\b", "I am", tmp, perl = TRUE)
    tmp <- gsub("\\bam_i\\b", "am I", tmp, perl = TRUE)

    tmp <- gsub("\\b(it|he|she|that|there|what|let|here|who|how|where)_s\\b", "\\1's", tmp, perl = TRUE)

    tmp <- gsub("\\b(it|he|she|that|there)_is\\b", "\\1 is", tmp, perl = TRUE)

    tmp <- gsub("\\bo_clock\\b", "o'clock", tmp, perl = TRUE)

    rev_refmt <- tmp
