ArXiv Keyword Predictor

Overview

This is a Python repo that constructs a neural network that attempts to predict the keywords used in a manuscript from arXiv given the words from the title and abstract.

How it Works

The Data

The data comes from a Kaggle dataset which can be found here https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts. The only columns contained in the dataset store the title, abstract, and keywords associated with the manuscript.

Data Cleaning/Tokenization

The data was tokenized using Keras Tokenizer. The titles and abstracts were tokenized such that only the top 50 most common words were used, a list of filler words was also removed such that simple words like 'a', 'the', and 'where', were not included in tokenization. Keywords were tokenized as well, including only the 4 most common keywords.

Neural Network

The neural network was built to take the tokenized titles and abstracts in different streams before joining them together and producing an output. Custom loss functions were used to ensure loss was properly weighted between different classes as the dataset is imbalanced. An image showing the layout of this network can be seen below.

Performance

Below you can find the classification matrix output by the program on an 80-20 train test split.

	Precision	Recall	F1-Score
Micro Avg	0.80	0.73	0.76
Macro Avg	0.56	0.51	0.53
Weighted Avg	0.79	0.73	0.76
Samples Avg	0.83	0.79	0.78

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dataset		dataset
LICENSE		LICENSE
README.md		README.md
arXiv Keyword Predictor.py		arXiv Keyword Predictor.py
model.png		model.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArXiv Keyword Predictor

Overview

How it Works

The Data

Data Cleaning/Tokenization

Neural Network

Performance

Other Links

About

Releases

Languages

License

JNoel71/KeywordPredictor

Folders and files

Latest commit

History

Repository files navigation

ArXiv Keyword Predictor

Overview

How it Works

The Data

Data Cleaning/Tokenization

Neural Network

Performance

Other Links

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages