Deep Learning for Text Pairs Relation Classification

This repository is my bachelor graduation project, and it is also a study of TensorFlow, Deep Learning (CNN, RNN, etc.).

The main objective of the project is to determine whether the two sentences are similar in sentence meaning (binary classification problems) by the two given sentences based on Neural Networks (Fasttext, CNN, LSTM, etc.).

Requirements

Python 3.6
Tensorflow 1.15.0
Tensorboard 1.15.0
Sklearn 0.19.1
Numpy 1.16.2
Gensim 3.8.3
Tqdm 4.49.0

Project

The project structure is below:

.
├── Model
│   ├── test_model.py
│   ├── text_model.py
│   └── train_model.py
├── data
│   ├── word2vec_100.model.* [Need Download]
│   ├── Test_sample.json
│   ├── Train_sample.json
│   └── Validation_sample.json
└── utils
│   ├── checkmate.py
│   ├── data_helpers.py
│   └── param_parser.py
├── LICENSE
├── README.md
└── requirements.txt

Innovation

Data part

Make the data support Chinese and English (Can use jieba or nltk ).
Can use your pre-trained word vectors (Can use gensim).
Add embedding visualization based on the tensorboard (Need to create metadata.tsv first).

Model part

Add the correct L2 loss calculation operation.
Add gradients clip operation to prevent gradient explosion.
Add learning rate decay with exponential decay.
Add a new Highway Layer (Which is useful according to the model performance).
Add Batch Normalization Layer.
Add several performance measures (especially the AUC) since the data is imbalanced.

Code part

Can choose to train the model directly or restore the model from the checkpoint in train.py.
Can create the prediction file which including the predicted values and predicted labels of the Testset data in test.py.
Add other useful data preprocess functions in data_helpers.py.
Use logging for helping to record the whole info (including parameters display, model training info, etc.).
Provide the ability to save the best n checkpoints in checkmate.py, whereas the tf.train.Saver can only save the last n checkpoints.

Data

See data format in /data folder which including the data sample files. For example:

{"front_testid": "4270954", "behind_testid": "7075962", "front_features": ["invention", "inorganic", "fiber", "based", "calcium", "sulfate", "dihydrate", "calcium"], "behind_features": ["vcsel", "structure", "thermal", "management", "structure", "designed"], "label": 0}

"testid": just the id.
"features": the word segment (after removing the stopwords)
"label": 0 or 1. 1 means that two sentences are similar, and 0 means the opposite.

Text Segment

You can use nltk package if you are going to deal with the English text data.
You can use jieba package if you are going to deal with the Chinese text data.

Data Format

This repository can be used in other datasets (text pairs similarity classification) in two ways:

Modify your datasets into the same format of the sample.
Modify the data preprocessing code in data_helpers.py.

Anyway, it should depend on what your data and task are.

Pre-trained Word Vectors

You can download the Word2vec model file (dim=100). Make sure they are unzipped and under the /data folder.

You can pre-training your word vectors (based on your corpus) in many ways:

Use gensim package to pre-train data.
Use glove tools to pre-train data.
Even can use a fasttext network to pre-train data.

🤔Before you open the new issue, please check the data sample file under the data folder and read the other open issues first, because someone maybe ask the same question already.

Usage

See Usage.

Network Structure

FastText

References:

Bag of Tricks for Efficient Text Classification

TextANN

References:

Personal ideas 🙃

TextCNN

References:

TextRNN

Warning: Model can use but not finished yet 🤪!

TODO

Add BN-LSTM cell unit.
Add attention.

References:

Recurrent Neural Network for Text Classification with Multi-Task Learning

TextCRNN

References:

Personal ideas 🙃

TextRCNN

References:

Personal ideas 🙃

TextHAN

References:

Hierarchical Attention Networks for Document Classification

TextSANN

Warning: Model can use but not finished yet 🤪!

TODO

Add attention penalization loss.
Add visualization.

References:

A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING

TextABCNN

Warning: Only achieve the ABCNN-1 Model🤪!

TODO

Add ABCNN-3 model.

References:

ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs

About Me

黄威，Randolph

SCU SE Bachelor; USTC CS Ph.D.

Email: chinawolfman@hotmail.com

My Blog: randolph.pro

LinkedIn: randolph's linkedin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Learning for Text Pairs Relation Classification

Requirements

Project

Innovation

Data part

Model part

Code part

Data

Text Segment

Data Format

Pre-trained Word Vectors

Usage

Network Structure

FastText

TextANN

TextCNN

TextRNN

TODO

TextCRNN

TextRCNN

TextHAN

TextSANN

TODO

TextABCNN

TODO

About Me

About

Releases

Sponsor this project

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github		.github
ABCNN		ABCNN
ANN		ANN
CNN		CNN
CRNN		CRNN
FastText		FastText
HAN		HAN
RCNN		RCNN
RNN		RNN
SANN		SANN
data		data
utils		utils
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
Usage.md		Usage.md
requirements.txt		requirements.txt

License

RandolphVI/Text-Pairs-Relation-Classification

Folders and files

Latest commit

History

Repository files navigation

Deep Learning for Text Pairs Relation Classification

Requirements

Project

Innovation

Data part

Model part

Code part

Data

Text Segment

Data Format

Pre-trained Word Vectors

Usage

Network Structure

FastText

TextANN

TextCNN

TextRNN

TODO

TextCRNN

TextRCNN

TextHAN

TextSANN

TODO

TextABCNN

TODO

About Me

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages