Data-Science-Challenge-UIT-2024

Introduction

This repository contains my submission for the Group B of the DSC UIT 2024, with the task of Multimodal Sarcasm Detection on Vietnamese Social Media Texts. In this challenge, teams are required to develop multimodal models capable of detecting sarcasm in Vietnamese social media posts that combine text and images.

Group B DSC UIT 2024

Challenge Topic: Multimodal Sarcasm Detection on Vietnamese Social Media Texts

The goal is to propose a multimodal approach for sarcasm detection on a dataset of Vietnamese text-image posts gathered from social media platforms.

Evaluation Criteria

Submissions are evaluated based on Precision, Recall, and F1 score for predicted labels against the provided labels. Rankings are based on the F1 score.

Terms and Conditions

Only the ViMMSD dataset is permitted for use.
Public and private test data must not be manually annotated, nor should any data augmentation be applied.
Only pre-trained models from the approved list are allowed.
Teams must report the pre-trained embeddings and language models used.

Dataset

Train, Dev, and Test Datasets: Available on Kaggle (search “DSC”) or by request.
Labels: multi-sarcasm, not-sarcasm, image-sarcasm, and text-sarcasm.
Class Distribution (imbalanced dataset):
- Train Set: 10,805 instances (6062 not-sarcasm, 4224 multi-sarcasm, 77 text-sarcasm, 442 image-sarcasm)
- Test Set: 1413 instances for public test, 1504 instances for private test

Public Test Results

Summary of public test set results:

For more details, refer to this notebook or this CSV file.

Main Approaches

ViLT Model

ViLT Paper Summary: Summary Notebook

During testing, I observed that the ViLT inference code’s output was simply a softmax layer, where the node with the highest probability produced a one-word answer.
I modified this setup by replacing the linear layer with a softmax layer containing four output nodes, then fine-tuned the model on the training dataset using a CrossEntropy loss function.
Initially, the training dataset was split into training and validation sets in a stratified manner. However, I later used a GroupShuffleSplit approach to ensure that identical posts (those sharing the same caption) did not appear in both the training and validation sets.
The model’s input consists of the image caption (without any preprocessing) alongside the image itself.
Training began with the model’s layers frozen, allowing only the output layer to be trained. Once stable, I incrementally unfroze additional layers and reduced the learning rate for more refined fine-tuning.
Various combinations of learning rates, batch sizes, epochs, and layer-unfreezing strategies were tested, with the best F1 score achieved on the public test set reaching approximately 0.39.
For more details on these experiments, refer to vilt v1 and vilt v2.
Fine-tuning source code for ViLT is available on Colab

Multilabel Classification Approach

This approach was motivated by a few key observations:
- The multi-sarcasm label includes both text-sarcasm and image-sarcasm components.
- The not-sarcasm label includes both not-text-sarcasm and not-image-sarcasm.
- The dataset is highly imbalanced.
Given these factors, I decided on a multilabel classification approach:
- The model now outputs two classes: image-sarcasm and text-sarcasm. Here, we assume that multi-sarcasm is the combination of both text-sarcasm and image-sarcasm, while not-sarcasm is represented by not-text-sarcasm and not-image-sarcasm.
- This approach helps balance the previously imbalanced dataset.
I adapted the ViLT model by modifying its output layer to a linear layer with two nodes for this multilabel setup.
After reviewing A Survey on Multilabel Learning in Deep Learning, I tested various loss functions:
- Focal Loss (for multilabel)
- BinaryCrossEntropy (for multilabel)
- ZLPR (for multilabel)
- WeightedCrossEntropy (for multilabel - label powerset)
- F1 Soft (for multilabel - label powerset)
Using the same strategies applied to the base ViLT model, Focal Loss and WeightedCrossEntropy (WCE) produced the best results, with F1 scores of 0.38 and 0.39, respectively.
For further details, you can check my notes on ViLT with Focal Loss and ViLT with Weighted CrossEntropy.

Traditional Data Science Approach

Based on the assumption that icons and hashtags provide useful clues for prediction, I conducted an analysis on the dataset to extract some hidden features.
I tested several traditional algorithms, including Logistic Regression, SVM, and tree-based models. However, the results were not promising, so this approach was discontinued.

Vintern Model

Vintern Paper Summary
After testing various models, I selected Vintern as my Vision Language Model (VLM) based on several constraints:
- Model size (1-2B) to fit resource limitations
- Support for the Vietnamese language
- Availability of public pretrained weights
- A user-friendly API
Unlike ViLT, Vintern offers a modern chat-response JSON format for inputs, which limited my ability to deeply modify the model (e.g., changing output layers). However, Vintern, along with the Hugging Face API, includes LoRA fine-tuning scripts, which significantly enhanced predictions when combined with instruction-based fine-tuning.
Fine-tuning and prompt strategies improved the model's robustness:
- Prompt 1
- Prompt 2
- Prompt 3
- Prompt 5
- Prompt 6
Most results achieved F1 scores between 0.4 and 0.41 for the best epoch across prompts.
Fine-tuning Strategy:
- Each prompt was trained one epoch at a time, saving the model after each epoch to manage GPU memory and reduce overfitting risks.
- To optimize VRAM, only LoRA fine-tuning was used, and images were resized to smaller dimensions.
- I utilized the same train-validation group split data as for ViLT.
- JSONL extraction file for Vintern.
Prompt Explanations:
- Prompt 1: Simple prompt structure.
- Prompt 2: Indicates that multi-sarcasm = text-sarcasm + image-sarcasm.
- Prompt 3: Provides a detailed explanation of each label.
- Prompt 5: Chain of Thought (CoT) prompt, guiding the model from text and image understanding to label inference.
- Prompt 6: Similar to Prompt 2, but denotes icons and removes redundant dots, slashes ,commas from each word.
For access to the Vintern fine-tuning script used in this competition, please contact me.

One-vs-All Approach

Building on the multilabel classification concept, I implemented a one-vs-all approach with Vintern to predict each label individually.
Results for each label are available in the result_ova folder:
- Image OVA: Predicts image-sarcasm vs. not-image-sarcasm
- Text OVA: Predicts text-sarcasm vs. not-text-sarcasm
- Multi OVA: Predicts multi-sarcasm vs. not-multi-sarcasm
- Yes-No OVA: Predicts sarcasm vs. not-sarcasm
- Prompt 4: Uses a Chain of Thought (CoT) prompt to predict sarcasm vs. not-sarcasm
Among these, only the results from the Yes-No OVA and Prompt 4 produced positive outcomes.

Stacking

Stack OVA

To enhance results, I explored a stacking approach by combining multiple models.
First, I combined the results from various OVA models:
- text OVA + image OVA
- text OVA + image OVA + multi OVA + yes-no OVA
- multi OVA + yes-no OVA
I used Logistic Regression, SVM, and Random Forest to stack the OVA results and also manually wrote stacking rules. However, both approaches did not yield positive results.

Stack Base Vintern + OVA

Next, I stacked the yes-no OVA results with the Base Vintern results by:
- Counting the frequency of each pair of predicted labels from both models.
- Manually evaluating these label pairs against the ground truth to assess correctness.
- Identifying the pairs that yielded correct predictions and updating the Base Vintern results accordingly.
- You can see the approach in this notebook.
As shown above, this approach improved the results!
You can further explore this approach here.

Stack Base Vintern + Vintern with Image Sarcasm + OVA

After additional testing, I overfitted the Base Vintern by training for more epochs (epochs 4, 5, 6, 7 for Vintern 1 and epochs 3, 4, 5 for Vintern 2).
This decision was based on the assumption that the yes-no OVA could strengthen results for not-sarcasm and multi-sarcasm, so I focused on improving image-sarcasm and text-sarcasm.
Note:
- Text sarcasm instances were limited, making it challenging to improve results.
- Attempts to increase epochs for text sarcasm alone negatively impacted the model.
- I also experimented with Weighted CrossEntropy to emphasize text-sarcasm, but results did not improve.
I then stacked the Base Vintern model with yes-no OVA and additional image-sarcasm models (e.g., vintern_v1_epoch5).
Using the same strategies, I manually evaluated label pairs.

This approach led to further improvements!
You can explore this strategy in more detail here.

Stack Base Vintern + 2nd Vintern + Vintern with Image Sarcasm + OVA

Continuing with the stacking concept, I attempted to merge three or more models with the yes-no OVA, including:
- Models with high scores on the training set.
- Models with high scores on the test set.
- Models with stable scores on both train and test sets.
- The top-performing models across train, test, and one model with consistent scores.
Although many combinations did not perform well, some produced stable and high F1 scores.
Below is the result for merging [Base Vintern] + [Vintern e1 epoch 5] + [Vintern e1.1 epoch 7] + [Yes-no epoch 2].
You can check the results in this notebook.

Using ML Approach to Stack Models

I also experimented with traditional machine learning methods to determine the optimal weights for each model in the stack.
Details of this approach can be found here.
However, due to instability in the results, I ultimately decided to discontinue this approach.

Project Structure

data/: Contains all data files and folders.
- csv/: CSV files.
- json/: JSON files.
- jsonl/: JSONL files, primarily for Vintern.
- *-images/: Image folders from the competition across 3 phases.
- resized_images/: Resized images for Vintern fine-tuning.
- weights/: Fine-tuned weights for ViLT and Vintern.
- train_val_group_split.txt: File with train and validation IDs (split from the original training set of the competition).
emojis/: All files related to icon-emojis.
extraction/: Contains extraction-related files.
- jsonl_extraction/: JSONL extraction for Vintern.
- JSON file extraction for preprocessing labels for Vintern epoch 6.
images/: Contains images used in the README.
logs/: Logs for Vintern and ViLT fine-tuning.
manually_process/: Contains files for manual processing.
- text_hidden_features: Analysis information on labels.
- split_test_val.ipynb: Splits train and validation IDs.
merge_approach/: Files for evaluating and executing the stacking approach.
- merge_ova_test.ipynb and merge_ova_train.ipynb: Evaluate stacking approach and save results to files.
- submission.ipynb: Converts results to results.json and results.zip formats.
- merge_and_submit.ipynb: End-to-end submission file; input is the file name, output is results.zip.
- log_merge_and_submit.txt: Log file for merge_and_submit.ipynb.
- ml_approach.ipynb: Uses machine learning for stacking models.
merge_test/ and merge_val/: Contains result files for merging in test and validation sets.
result_ova/ and results_test_val/: Result files for OVA and Base Vintern approaches.
submissions/: Submission format files (results.json and results.zip).
result_private_test/ and submissions_private_test/: Vintern-related and merged results for the private test.
compare_val_test.ipynb: Displays results for validation and public test sets.
private_test_analysis.ipynb: Merges and saves results for the private test set.
models.csv: CSV file with results on validation and public test sets.
web_image.html: Hosts and renders images, mainly for image analysis.

Private Test Results

Final model choices were based on stability and high F1 scores across train and test sets:

Selected Models:

vintern_v1_e3: te_b2_1epoch5_no_epoch2
vintern_v1_1_e4: te_b2_1epoch5_b3_1.1_e7_no_e2
vintern_v1_1_1_e6: te_b2_1epoch5_b3_1.1_e7_no_e2

Final private test score: ~0.40 F1.

Technologies Used

Kaggle
Google Colab
PyTorch
Hugging Face

Acknowledgments

Vintern: Paper
InternVL: Paper
Deep Learning for Multi-Label Learning: A Comprehensive Survey: Paper
ViLT: Paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Science-Challenge-UIT-2024

Table of Contents

Introduction

Group B DSC UIT 2024

Evaluation Criteria

Terms and Conditions

Dataset

Public Test Results

Main Approaches

ViLT Model

Multilabel Classification Approach

Traditional Data Science Approach

Vintern Model

One-vs-All Approach

Stacking

Stack OVA

Stack Base Vintern + OVA

Stack Base Vintern + Vintern with Image Sarcasm + OVA

Stack Base Vintern + 2nd Vintern + Vintern with Image Sarcasm + OVA

Using ML Approach to Stack Models

Project Structure

Private Test Results

Technologies Used

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
emojis		emojis
extraction		extraction
images		images
logs		logs
manually_process		manually_process
merge_approach		merge_approach
merge_test		merge_test
merge_val		merge_val
papers		papers
result_ova		result_ova
result_private_test		result_private_test
result_test_val		result_test_val
submissions		submissions
submissions_private_test		submissions_private_test
README.md		README.md
compare_val_test.ipynb		compare_val_test.ipynb
model.csv		model.csv
private_test_analysis.ipynb		private_test_analysis.ipynb
web_image.html		web_image.html

BFCmath/Data-Science-Challenge-UIT-2024

Folders and files

Latest commit

History

Repository files navigation

Data-Science-Challenge-UIT-2024

Table of Contents

Introduction

Group B DSC UIT 2024

Evaluation Criteria

Terms and Conditions

Dataset

Public Test Results

Main Approaches

ViLT Model

Multilabel Classification Approach

Traditional Data Science Approach

Vintern Model

One-vs-All Approach

Stacking

Stack OVA

Stack Base Vintern + OVA

Stack Base Vintern + Vintern with Image Sarcasm + OVA

Stack Base Vintern + 2nd Vintern + Vintern with Image Sarcasm + OVA

Using ML Approach to Stack Models

Project Structure

Private Test Results

Technologies Used

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages