Giovanni Fossati
April 26 2015
Rice University
Astrophysicist
tm
suite. Too heavy and cumbersome.perl
and in R
mainly dplyr
, NLP
and RWeka
.perl
“offline” (can't beat it for regex
work).R
either directly or by relying on an external perl script.As noted I put a major effort into understanding the idiosyncrasies of the textual data, with the expectation that a deep cleaning would truly make a difference in the prediction context.
Among the main transformations applied to the text:
<PROFANITY>
.<EMOJ>
.<HASHTAG>
.My final algorithm is basically a linear interpolation of 3-/4-/5-grams.
perl
script).agrep
.
It was great, but extremely computationally expensive. agrep
to evaluate the closeness of n-grams to the input text.I hope it will work well once deployed and that you will enjoy it (and I wish I had more space to illustrate the technical details.)