Giovanni Fossati
April 26 2015
Rice University
Astrophysicist
tm suite. Too heavy and cumbersome.perl and in R mainly dplyr, NLP and RWeka.perl “offline” (can't beat it for regex work).R either directly or by relying on an external perl script.As noted I put a major effort into understanding the idiosyncrasies of the textual data, with the expectation that a deep cleaning would truly make a difference in the prediction context.
Among the main transformations applied to the text:
<PROFANITY>.<EMOJ>.<HASHTAG>.My final algorithm is basically a linear interpolation of 3-/4-/5-grams.
perl script).agrep.
It was great, but extremely computationally expensive. agrep to evaluate the closeness of n-grams to the input text.I hope it will work well once deployed and that you will enjoy it (and I wish I had more space to illustrate the technical details.)