- To creatively implement a general-purpose text classifier
- To write code with sound object-oriented design
- To learn TDD
Dependencies: cmake
To run on Linux/Mac:
- Clone the repo
- In the project directory, I recommend creating a build folder:
mkdir build
- Run
cd build
- Run
cmake ..
- Finally, run
make
A CountVectorizer is an text-storage data structure from the popular ML library Scikit Learn. In their words, a CountVectorizer: "Converts a collection of text documents to a matrix of token counts" source
Conceptually, it looks something like this:
My implementation has a word array as a header, and a vector of pointers to sentences:
The CountVectorizer readily pairs with any number of classification algorithms. As of the time of this writing, two algorithms have been built: a simple weighted average classifier and a Bayesian classifier (inspired from scikit learn's NaiveBayes model). source
The CountVectorizer is a standalone class that is only concerned with holding the data. The classifiers inherit from a base classifier to reduce redundancy.
The training data must be in a specific format to be used in this library. Two files, a "features" and "labels" file must be used as an input for the model to work. The format is this: