Vietnamese Natural Language Processing Resources

Create a pull request or issue to add your works into this list.

Large Language Models
Corpus
Text Processing Toolkit
Pre-trained Language Model
Sentiment Analysis
Named Entity Recognition
Speech Processing

Large Language Models

GemSUra: Pretrained Large Language Models based on Gemma built by URA (HCMUT).
Ghost-7b: This model is fine tuned from HuggingFaceH4/zephyr-7b-beta on a small synthetic datasets (about 200MB) for 50% English and 50% Vietnamese.
PhoGPT: They open-source a state-of-the-art 7.5B-parameter generative model series named PhoGPT for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-7B5 and its instruction-following variant PhoGPT-7B5-Instruct.
Sailor: Sailor is a suite of Open Language Models tailored for South-East Asia (SEA), focusing on languages such as 🇮🇩Indonesian, 🇹🇭Thai, 🇻🇳Vietnamese, 🇲🇾Malay, and 🇱🇦Lao.
SeaLLM): The state-of-the-art multilingual LLM for Southeast Asian (SEA) languages 🇬🇧 🇨🇳 🇻🇳 🇮🇩 🇹🇭 🇲🇾 🇰🇭 🇱🇦 🇲🇲 🇵🇭.
ToRoLaMa: The Vietnamese Instruction-Following and Chat Model.
Vistral-7B-Chat-function-calling: This model was fine-tuned on Vistral-7B-chat for function calling.
Vistral-7B-Chat: Towards a State-of-the-Art Large Language Model for Vietnamese
ViGPTQA: LLMs for Vietnamese Question Answering
VBD-LLaMA2-Chat: A Conversationally-tuned LLaMA2 for Vietnamese.
Vietnamse LLaMA 2: A 7B version of LLaMA 2 with 140GB of Vietnamese text by BKAI Foundation Models Lab.
VinaLlaMA: Another collection of Vietnamese LlaMA tuned models.
Vietcuna: A series of Vicuna tuned models for Vietnamese.
Llama2_vietnamese: A fine-tuned Large Language Model (LLM) for the Vietnamese language based on the Llama 2 model.
Vietnamese_LLMs: This project aims to create high-quality Vietnamese instruction datasets and tune several open-source large language models (LLMs). So far, they have released various models, including LLaMa and BLOOMZ. Additionally, they have released five instruction datasets, most of which were generated by GPT-4.

Corpus

For more recent updates, you can consider searching for datasets that include Vietnamese on HuggingFace here: https://huggingface.co/datasets?language=language:vi&sort=trending

Math Instruction datasets: A series of translated datasets by 5CD AI Team.
LLaVA - Visual Question Answering: A series of translated datasets by 5CD AI Team.
CoT Instruction datasets: A series of translated datasets by 5CD AI Team.
DPO Instruction datasets: A series of translated datasets by 5CD AI Team.
Retrieve-Rerank datasets: A series of translated datasets by 5CD AI Team.
Coding Instruction datasets: A series of translated datasets by 5CD AI Team.
Chat Instruction datasets: A series of translated datasets by 5CD AI Team.
VN News Corpus: 50GB of uncompressed texts crawled from a wide range ofnews websites and topics.
10000 Vietnamese Books: 10000 Vietnamese Books from 195x.
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Bactrain-X: The Bactrain-X dataset is a collection of 3.4M instruction-response pairs in 52 languages.
OSCAR: 68GB of text data with 12.036.845.359 words.
Common Crawl: Open repository of web crawl data.
WikiDumps: You can download directly or use scripts from viwik18, viwik19.
Vietnamese Treebank: VLSP Project.
Vietnamese Stopwords: Vietnamese stopwords.
Vietnamese Dictionary: Vietnamese dictionary.
vietnamese-wordnet: Vietnamese wordnet.
VietnameseWAC: The dataset comprises a substantial collection of Vietnamese text, consisting of 129,781,089 tokens and 106,464,835 words, which have been automatically segmented and labeled as per Kilgarriff, A., and Le-Hong, P., 2012.
Vietlex Corpus: Vietlex's Vietnamese Corpus, a pioneering effort in Vietnam since 1998, contains about 80 million syllables from various sources.
Lexical Database of Vietnamese: A lexical database of Vietnamese contains various lexical information derived from two Vietnamese corpora.

Text Processing Toolkit

coccoc-tokenizer: High performance tokenizer for Vietnamese language. It is written in C++ with Python and Java bindings.
RDRSegmenter: Fast and accurate Vietnamese word segmenter (LREC 2018).
RDRPOSTagger: Fast and accurate POS and morphological tagging toolkit (EACL 2014).
VnCoreNLP: A Vietnamese natural language processing toolkit (NAACL 2018).
vlp-tok: Vietnamese text processing library developed in the Scala programming language.
ETNLP: A toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings.
VietnameseTextNormalizer: Vietnamese Text Normalizer.
nnvlp: Neural network-based Vietnamese language processing toolkit.
jPTDP: Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018).
vi_spacy: Vietnamese language model compatible with Spacy.
underthesea: Underthesea - Vietnamese NLP toolkit.
vnlp: GATE plugin for Vietnamese language processing.
pyvi: Python Vietnamese toolkit.
JVnTextPro: Java-based Vietnamese text processing tool.
DongDu: C++ implementation of Vietnamese word segmentation tool.
VLSP Toolkit: Vietnamese tokenizer from VLSP.
vTools: Vietnamese NLP toolkit: Tokenizer, Sentence detector, POS tagger, Phrase chunker.
JNSP: Java Implementation of Ngram Statistic Package.

Pre-trained Language Model

RoBERTa Vietnamese: Pre-trained embedding using RoBERTa architecture on Vietnamese corpus.
PhoBERT: Pre-trained language models for Vietnamese (another implementation of RoBERTa for Vietnamese).
ALBERT for Vietnamese: "A Lite" version of BERT for Vietnamese.
Vietnamese ELECTRA: Electra pre-trained model using Vietnamese corpus.
word2vecVN: Pre-trained Word2Vec models for Vietnamese.

Sentiment Analysis

Benchmark

VLSP 2016 Share Task: Sentiment Analysis

Train: 5100 sentences (1700 positive, 1700 neutral, 1700 negative).

Test: 1050 sentences (350 positive, 350 neutral, 350 negative).

Model	F1	Paper	Code
Perceptron/SVM/Maxent	80.05	DSKTLAB: Vietnamese Sentiment Analysis for Product Reviews
SVM/MLNN/LSTM	71.44	A Simple Supervised Learning Approach to Sentiment Classification at VLSP 2016
Ensemble: Random forest, SVM, Naive Bayes	71.22	A Lightweight Ensemble Method for Sentiment Classification Task
Ensemble: SVM, LR, LSTM, CNN	69.71	An Ensemble of Shallow and Deep Learning Algorithms for Vietnamese Sentiment Analysis
SVM	67.54	Sentiment Analysis for Vietnamese using Support Vector Machines with application to Facebook comments
SVM/MLNN	67.23	A Multi-layer Neural Network-based System for Vietnamese Sentiment Analysis at the VLSP 2016 Evaluation Campaign
Multi-channel LSTM-CNN	59.61	Multi-channel LSTM-CNN model for Vietnamese sentiment analysis	official

VLSP 2018 Shared Task: Aspect Based Sentiment Analysis

Restaurant Dataset: 2961 reviews (train), 1290 reviews (development), 500 reviews (test).

Model	Aspect (F1)	Aspect Polarity (F1)	Paper
CNN	0.80		Deep Learning for Aspect Detection on Vietnamese Reviews
SVM	0.77	0.61	NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis
SVM	0.54	0.48	Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task

Hotel Dataset: 3000 reviews (training), 2000 reviews (development), 600 reviews (test).

Model	Aspect (F1)	Aspect Polarity (F1)	Paper
SVM	0.70	0.61	NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis
CNN	0.69		Deep Learning for Aspect Detection on Vietnamese Reviews
SVM	0.56	0.53	Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task

Vietnamese Student's Feedback Corpus (UIT-VSFC)

UIT-VSFC consists of over 16,000 sentences for sentiment analysis and topic classification.

Model	Sentiment (F1)	Topic (F1)	Paper	Code
Bi-LSTM/Word2Vec	0.896	0.92	Deep Learning versus Traditional Classifiers on Vietnamese Student’s Feedback Corpus
Maximum Entropy Classifier	0.88	0.84	UIT-VSFC: Vietnamese Student’s Feedback Corpus for Sentiment Analysis

Named Entity Recognition

Benchmark

VLSP 2016 Shared Task: Named Entity Recognition

Model	F1	Paper	Code
PhoBERT_large	94.7	PhoBERT: Pre-trained language models for Vietnamese	official
vELECTRA + BiLSTM + Attention	94.07	Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models
PhoBERT_base	93.6	PhoBERT: Pre-trained language models for Vietnamese	official
XLM-R	92.0	PhoBERT: Pre-trained language models for Vietnamese
VnCoreNLP-NER + ETNLP	91.3	ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task
BiLSTM-CNN-CRF + ETNLP	91.1	ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task
VNER: Attentive Neural Network	89.6	Attentive Neural Network for Named Entity Recognition in Vietnamese
BiLSTM-CNN-CRF	88.3	VnCoreNLP: A Vietnamese Natural Language Processing Toolkit	official
LSTM + CRF	66.07	An investigation of Vietnamese Nested Entity Recognition Models

VLSP 2018 Shared Task: Named Entity Recognition

Model	F1	Paper
vELECTRA + BiGRU	90.31	Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models
VIETNER: CRF (ngrams + word shapes + cluster + w2v)	76.63	A Feature-Based Model for Nested Named-Entity RecognitionatVLSP-2018 NER Evaluation Campaign
ZA-NER	74.70	ZA-NER: Vietnamese Named Entity Recognition at VLSP 2018 Evaluation Campaign

Speech Processing

Corpus:

VLSP 2020 - ASR challenge - training set: announcement, unofficial mirror link on huggingface
VIVOS: official link, mirror link on huggingface
Bud500: announcement, mirror link on huggingface
FOSD (FPT open speech dataset): official link, unofficial mirror link on huggingface
LSVSC (Large-scale Vietnamese speech corpus): announcement, unofficial mirror link on huggingface
Infore: official link, unofficial mirror link for dataset 1 on huggingface, unofficial mirror link for dataset 2 on huggingface
unofficial mirror link Vivos + InfoRe 1 + InfoRe 2
VietTTS-v1: A synthesized dataset for Vietnamese TTS task (35.1 hrs)
Mozilla CommonVoice
Google FLEURS

Project

vietTTS: Tacotron + HiFiGAN vocoder for vietnamese datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vietnamese Natural Language Processing Resources

Large Language Models

Corpus

Text Processing Toolkit

Pre-trained Language Model

Sentiment Analysis

Benchmark

Named Entity Recognition

Benchmark

Speech Processing

Corpus:

Project

About

Releases

Packages

Contributors 3

vndee/awsome-vietnamese-nlp

Folders and files

Latest commit

History

Repository files navigation

Vietnamese Natural Language Processing Resources

Large Language Models

Corpus

Text Processing Toolkit

Pre-trained Language Model

Sentiment Analysis

Benchmark

Named Entity Recognition

Benchmark

Speech Processing

Corpus:

Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages