Text Classification with CountVectorizer

Project Goals

To creatively implement a general-purpose text classifier
To write code with sound object-oriented design
To learn TDD

Project Overview:

Getting Started

Dependencies: cmake

To run on Linux/Mac:

Clone the repo
In the project directory, I recommend creating a build folder: mkdir build
Run cd build
Run cmake ..
Finally, run make

Usage

What is a CountVectorizer?

A CountVectorizer is an text-storage data structure from the popular ML library Scikit Learn. In their words, a CountVectorizer: "Converts a collection of text documents to a matrix of token counts" source

Conceptually, it looks something like this:

How is it implemented?

My implementation has a word array as a header, and a vector of pointers to sentences:

How can it be used for text classification?

The CountVectorizer readily pairs with any number of classification algorithms. As of the time of this writing, two algorithms have been built: a simple weighted average classifier and a Bayesian classifier (inspired from scikit learn's NaiveBayes model). source

How is the code structured?

The CountVectorizer is a standalone class that is only concerned with holding the data. The classifiers inherit from a base classifier to reduce redundancy.

A note on data

The training data must be in a specific format to be used in this library. Two files, a "features" and "labels" file must be used as an input for the model to work. The format is this:

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vscode		.vscode
app		app
code		code
data		data
static		static
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakeLists.txt.in		CMakeLists.txt.in
README.md		README.md
TODO.md		TODO.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Classification with CountVectorizer

Project Goals

Project Overview:

Getting Started

Usage

What is a CountVectorizer?

How is it implemented?

How can it be used for text classification?

How is the code structured?

A note on data

NOTE: This is part of a dataset that was created for the Paper 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015

About

Releases

Packages

Languages

tjdolan121/TextClassifierLib

Folders and files

Latest commit

History

Repository files navigation

Text Classification with CountVectorizer

Project Goals

Project Overview:

Getting Started

Usage

What is a CountVectorizer?

How is it implemented?

How can it be used for text classification?

How is the code structured?

A note on data

NOTE: This is part of a dataset that was created for the Paper 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages