This repo contains python implementations for extracting features from text, that I have used in my research mostly for user input classification tasks.
Two approaches are implemented:
- One based on word-embeddings, which is described as part of the baseline methods in [1].
- A typical statistical n-gram language modeling approach, that estimates the conditional probability of a sentence in a class.
To do....
A toy example is provided, to play around with. The dataset used is a randomly selected subset of the "SMS Spam Collection" dataset available at the UCI Machine learning repository.
- Cedric De Boom, Steven Van Canneyt, Thomas Demeester, and Bart Dhoedt. 2016. Representation learning for very short texts using weighted word embedding aggregation. Pattern Recogn. Lett. 80, C (September 2016), 150-156. DOI: https://doi.org/10.1016/j.patrec.2016.06.012