doc_classification_tfidf

Code to do document classification with Tfidf and support vector machines.

To download newsgroup data:

python get_data_newsgroups.py

This script will create the folder DATA/newsgroup and the files train_data.tsv and test_data.tsv, with at each line: base64 encoded document /t target_name /t label.

To train a classifier on the train data:

python /train.py
--filename DATA/newsgroup/train_data.tsv
--output_dir output_folder
--vectorizer_type tfidf
--feature_selection_svc

This will create the output_folder, where the trained classfier will be saved (python pickle format).

To evaluate the classifier on a labeled test set:

python test.py
--filename DATA/newsgroup/test_data.tsv
--model_path output_folder/model.p
--output_file output_folder/results

This will write the predicted labels to the results file in the output directory.

To use the classfier on new data (unlabeled):

python /predict.py
--filename DATA/newsgroup/test_data
--model_path output_folder/model.p
--output_file output_folder/results_predict

with test_data a plain file with at each line a base64 encoded document.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docker_bert		docker_bert
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doc_classification_tfidf

About

Releases

Packages

Languages

ArneDefauw/doc_classification_tfidf

Folders and files

Latest commit

History

Repository files navigation

doc_classification_tfidf

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages