Arabic Sequence Labeling: Part Of Speech Tagging NLP task

This project is implemented on arabic part of speech tagging as part of the "Natural Language Processing" course of my master's degree. the project uses the arabic PUD dataset from universal dependencies and implements

Deep learning model (BiLSTM) for sequential labeling classification
Pre-deep learning model (KNN) for multi-class classification

Arabic PUD Dataset

During preprocessing steps the following processes are applied :

Remove tanween and tashkeel
Remove sentences that contains non-arabic words (i.e. english characters)

Such that the distribution of tags within the dataset is visualized as barchart where the majority of words (5553 word) in the dataset is associated with noun tag while the least common tag with the dataset is X. Each of the tags symbolizes part of the speech, refer to the image below for description of each tag.

Arabic Word Embedding

Word embedding provides a dense representation of words and their relative meanings.
The word embedding technique used in this project is N-Gram Word2Vec -SkipGram model from aravec project trained on twitter data with vector size 300.

Structure BiLSTM sequential labeling classification model

Results

The dataset is split to 70% for training and 30% for testing

BiLSTM sequential labeling classification model

KNN multi-class classification model

Requirements

Preprocessing and visualization

conllu
matplotlib.pyplot
pandas
re
seaborn
numpy
tensorflow (Tokenizer,pad_sequences)
sklearn (preprocessing.LabelEncoder,model_selection.train_test_split)

Word Embedding

gensim

Classification model

tensorflow
keras.models.sequential
keras.layers (Dense,Embedding,Bidirectional,LSTM,TimeDistributed,InputLayer)
sklearn.neighbors.KNeighborsClassifier

Model Evaluation

sklearn.metrics

References and Resources

reading and parsing dataset: link
Processing input data:link
Aravec for word embedding model :link
Keras Embedding layer : link1,link2,link3

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
KNN.ipynb		KNN.ipynb
README.md		README.md
bilstm.ipynb		bilstm.ipynb
part-of-speech-tagging-homework2.docx		part-of-speech-tagging-homework2.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic Sequence Labeling: Part Of Speech Tagging NLP task

Table of contents

Arabic PUD Dataset

Arabic Word Embedding

Structure BiLSTM sequential labeling classification model

Results

BiLSTM sequential labeling classification model

KNN multi-class classification model

Requirements

References and Resources

About

Releases

Packages

Languages

shaimaaK/arabic-sequence-classification-POS

Folders and files

Latest commit

History

Repository files navigation

Arabic Sequence Labeling: Part Of Speech Tagging NLP task

Table of contents

Arabic PUD Dataset

Arabic Word Embedding

Structure BiLSTM sequential labeling classification model

Results

BiLSTM sequential labeling classification model

KNN multi-class classification model

Requirements

References and Resources

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages