This project is implemented on arabic part of speech tagging as part of the "Natural Language Processing" course of my master's degree. the project uses the arabic PUD dataset from universal dependencies and implements
- Deep learning model (BiLSTM) for sequential labeling classification
- Pre-deep learning model (KNN) for multi-class classification
- Arabic PUD Dataset
- Arabic Word Embedding
- Structure BiLSTM sequential labeling classification model
- Results
- Requirements
- References and Resources
During preprocessing steps the following processes are applied :
- Remove tanween and tashkeel
- Remove sentences that contains non-arabic words (i.e. english characters)
noun
tag while the least common tag with the dataset is X
.
Each of the tags symbolizes part of the speech, refer to the image below for description of each tag.
Word embedding provides a dense representation of words and their relative meanings.
The word embedding technique used in this project is N-Gram Word2Vec -SkipGram model from aravec project trained on twitter data with vector size 300.
The dataset is split to 70% for training and 30% for testing
Preprocessing and visualization
- conllu
- matplotlib.pyplot
- pandas
- re
- seaborn
- numpy
- tensorflow (Tokenizer,pad_sequences)
- sklearn (preprocessing.LabelEncoder,model_selection.train_test_split)
- gensim
- tensorflow
- keras.models.sequential
- keras.layers (Dense,Embedding,Bidirectional,LSTM,TimeDistributed,InputLayer)
- sklearn.neighbors.KNeighborsClassifier
- sklearn.metrics