Released on September 27, 2024
MalayMMLU is the first multitask language understanding (MLU) for Malay Language. The benchmark comprises 24,213 questions spanning both primary (Year 1-6) and secondary (Form 1-5) education levels in Malaysia, encompassing 5 broad topics that further divide into 22 subjects.
Category | Subjects |
---|---|
STEM | Computer Science (Secondary), Biology (Secondary), Chemistry (Secondary), Computer Literacy (Secondary), Mathematics (Primary, Secondary), Additional Mathematics (Secondary), Design and Technology (Primary, Secondary), Core Science (Primary, Secondary), Information and Communication Technology (Primary), Automotive Technology (Secondary) |
Language | Malay Language (Primary, Secondary) |
Social science | Geography (Secondary), Local Studies (Primary), History (Primary, Secondary) |
Others | Life Skills (Primary, Secondary), Principles of Accounting (Secondary), Economics (Secondary), Business (Secondary), Agriculture (Secondary) |
Humanities | Quran and Sunnah (Secondary), Islam (Primary, Secondary), Sports Science Knowledge (Secondary) |
Organization | Model | Vision | Acc. | ||||||
---|---|---|---|---|---|---|---|---|---|
Language | Humanities | STEM | Social Science | Others | Average | ||||
Random | 38.01 | 42.09 | 36.31 | 36.01 | 38.07 | 38.02 | |||
OpenAI | GPT-4o | β | 87.12 | 88.12 | 83.83 | 82.58 | 83.09 | 84.98 | |
GPT-4 | β | 82.90 | 83.91 | 78.80 | 77.29 | 77.33 | 80.11 | ||
GPT-4o mini | β | 82.03 | 81.50 | 78.51 | 75.67 | 76.30 | 78.78 | ||
GPT-3.5 | 69.62 | 71.01 | 67.17 | 66.70 | 63.73 | 67.78 | |||
Meta | LLaMA-3.1 (70B) | 78.75 | 82.59 | 78.96 | 77.20 | 75.32 | 78.44 | ||
LLaMA-3.1 (8B) | 65.47 | 67.17 | 64.10 | 62.59 | 62.13 | 64.24 | |||
LLaMA-3 (8B) | 63.93 | 66.21 | 62.26 | 62.97 | 61.38 | 63.46 | |||
LLaMA-2 (13B) | 45.58 | 50.72 | 44.13 | 44.55 | 40.87 | 45.26 | |||
LLaMA-2 (7B) | 47.47 | 52.74 | 48.71 | 50.72 | 48.19 | 49.61 | |||
LLaMA-3.2 (3B) | 58.52 | 60.66 | 56.65 | 54.06 | 52.75 | 56.45 | |||
LLaMA-3.2 (1B) | 38.88 | 43.30 | 40.65 | 40.56 | 39.55 | 40.46 | |||
Qwen (Alibaba) | Qwen 2.5 (72B) | 79.09 | 79.95 | 80.88 | 75.80 | 75.05 | 77.79 | ||
Qwen-2.5 (32B) | 76.96 | 76.70 | 79.74 | 72.35 | 70.88 | 74.83 | |||
Qwen-2-VL (7B) | β | 68.16 | 63.62 | 67.58 | 60.38 | 59.08 | 63.49 | ||
Qwen-2-VL (2B) | β | 58.22 | 55.56 | 57.51 | 53.67 | 55.10 | 55.83 | ||
Qwen-1.5 (14B) | 64.47 | 60.64 | 61.97 | 57.66 | 58.05 | 60.47 | |||
Qwen-1.5 (7B) | 60.13 | 59.14 | 58.62 | 54.26 | 54.67 | 57.18 | |||
Qwen-1.5 (4B) | 48.39 | 52.01 | 51.37 | 50.00 | 49.10 | 49.93 | |||
Qwen-1.5 (1.8B) | 42.70 | 43.37 | 43.68 | 43.12 | 44.42 | 43.34 | |||
Zhipu | GLM-4-Plus | 78.04 | 75.63 | 77.49 | 74.07 | 72.66 | 75.48 | ||
GLM-4-Air | 67.88 | 69.56 | 70.20 | 66.06 | 66.18 | 67.60 | |||
GLM-4-Flash | 63.52 | 65.69 | 66.31 | 63.21 | 63.59 | 64.12 | |||
GLM-4 | 63.39 | 56.72 | 54.40 | 57.24 | 55.00 | 58.07 | |||
GLM-4β β (9B) | 58.51 | 60.48 | 56.32 | 55.04 | 53.97 | 56.87 | |||
Gemma-2 (9B) | 75.83 | 72.83 | 75.07 | 69.72 | 70.33 | 72.51 | |||
Gemma (7B) | 45.53 | 50.92 | 46.13 | 47.33 | 46.27 | 47.21 | |||
Gemma (2B) | 46.50 | 51.15 | 49.20 | 48.06 | 48.79 | 48.46 | |||
SAIL (Sea) | Sailorβ (14B) | 78.40 | 72.88 | 69.63 | 69.47 | 68.67 | 72.29 | ||
Sailorβ (7B) | 74.54 | 68.62 | 62.79 | 64.69 | 63.61 | 67.58 | |||
Cohere for AI | Command R (32B) | 71.68 | 71.49 | 66.68 | 67.19 | 63.64 | 68.47 | ||
OpenGVLab | InternVL2 (40B) | β | 70.36 | 68.49 | 64.88 | 65.93 | 60.54 | 66.51 | |
Damo (Alibaba) | SeaLLM-v2.5β (7B) | 69.75 | 67.94 | 65.29 | 62.66 | 63.61 | 65.89 | ||
Mistral | Pixtral (12B) | β | 64.81 | 62.68 | 64.72 | 63.93 | 59.49 | 63.25 | |
Mistral Small (22B) | 65.19 | 65.03 | 63.36 | 61.58 | 59.99 | 63.05 | |||
Mistral-v0.3 (7B) | 56.97 | 59.29 | 57.14 | 58.28 | 56.56 | 57.71 | |||
Mistral-v0.2 (7B) | 56.23 | 59.86 | 57.10 | 56.65 | 55.22 | 56.92 | |||
Microsoft | Phi-3 (14B) | 60.07 | 58.89 | 60.91 | 58.73 | 55.24 | 58.72 | ||
Phi-3 (3.8B) | 52.24 | 55.52 | 54.81 | 53.70 | 51.74 | 53.43 | |||
01.AI | Yi-1.5 (9B) | 56.20 | 53.36 | 57.47 | 50.53 | 49.75 | 53.08 | ||
Stability AI | StableLM 2 (12B) | 53.40 | 54.84 | 51.45 | 51.79 | 50.16 | 52.45 | ||
StableLM 2 (1.6B) | 43.92 | 51.10 | 45.27 | 46.14 | 46.75 | 46.48 | |||
Baichuan | Baichuan-2 (7B) | 40.41 | 47.35 | 44.37 | 46.33 | 43.54 | 44.30 | ||
Mesolitica | MaLLaM-v2β (5B) | 42.57 | 46.44 | 42.24 | 40.82 | 38.74 | 42.08 | ||
Yellow.ai | Komodoβ (7B) | 43.62 | 45.53 | 39.34 | 39.75 | 39.48 | 41.72 |
git clone https://github.com/UMxYTL-AI-Labs/MalayMMLU.git
cd MalayMMLU
pip install -r requirements.txt
We provide example evaluation scripts in scripts
usage: evaluate.py [-h] [--by_letter] --base_model BASE_MODEL --output_folder OUTPUT_FOLDER [--playground PLAYGROUND] [--task TASK] [--shot SHOT] [--token TOKEN]
options:
-h, --help show this help message and exit
--by_letter Use this flag to calculate first token accuracy
--base_model BASE_MODEL
Path to pretrained model
--output_folder OUTPUT_FOLDER
Folder where the output will be saved
--playground PLAYGROUND
Set this to True to enable playground mode (default: False).
--task TASK Specify the task to be executed (default: 'MalayMMLU').
--shot SHOT Provide the number of shots: 0,1,2 or 3 (default: 0).
--token TOKEN Specify the HuggingFace token
PRED_FILE
: filename of prediction file- For example,
"output/MalayMMLU_result_Meta-Llama-3-8B-Instruct_True_0shot.csv"
- For example,
SHOT=0
# prediction
python src/evaluate.py --by_letter --shot $SHOT --task=MalayMMLU \
--base_model=meta-llama/Meta-Llama-3-8B-Instruct \
--output_folder=output/ --token $TOKEN
# calculate accuracy
PRED_FILE=output/MalayMMLU_result_Meta-Llama-3-8B-Instruct_True_0shot.csv
python src/calculate_accuracies.py --pred_files $PRED_FILE \
--data_file=$SHOT \
--output_dir=output/
# calculate accuracy for all prediction files in a folder
PRED_DIR=output/
python src/calculate_accuracies.py --all --pred_dir $PRED_DIR \
--shot=$SHOT \
--output_dir=results/
# calculate accuracy
python src/evaluate.py --shot $SHOT True --task=MalayMMLU \
--base_model=meta-llama/Meta-Llama-3-8B-Instruct \
--output_folder=output/ --token $TOKEN
# calculate accuracy
PRED_FILE=output/MalayMMLU_result_Meta-Llama-3-8B-Instruct_False_0shot.csv
python src/calculate_accuracies.py --pred_files $PRED_FILE \
--shot=$SHOT \
--output_dir=output/
# calculate accuracy for all prediction files in a folder
PRED_DIR=output/
python src/calculate_accuracies.py --all --pred_dir $PRED_DIR \
--shot=$SHOT \
--output_dir=output/
The steps and usage are similar for evaluate_pixtral.py, evaluate_qwen_vl.py, evaluate_intern_vl.py
API_KEY
: OpenAI API key
# prediction
python src/evaluate_gpt.py --model gpt-3.5-turbo --api_key $API_KEY --shot $SHOT
- Download the prediction file (
jsonl
file ) from OpenAI platform - Rename the file in following format:
MalayMMLU_{$MODEL}_{$SHOT}shot.jsonl
- Example:
MalayMMLU_gpt3_0shot.jsonl
- Example:
# calculate accurcacy
python src/calculate_accuracies.py --pred_files $PRED_FILE \
--shot=$SHOT \
--output_dir=output/ --closed
# calculate accuracy for all prediction files in a folder
python src/calculate_accuracies.py --all --pred_dir $PRED_DIR \
--shot=$SHOT \
--output_dir=output/ --closed
The steps and usage are similar for evaluate_glm.py
@InProceedings{MalayMMLU2024,
author = {Poh, Soon Chang and Yang, Sze Jue and Tan, Jeraelyn Ming Li and Chieng, Lawrence Leroy Tze Yao and Tan, Jia Xuan and Yu, Zhenyu and Foong, Chee Mun and Chan, Chee Seng},
title = {MalayMMLU: A Multitask Benchmark for the Low-Resource Malay Language},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2024},
month = {November},
year = {2024},
}
Suggestions and opinions (both positive and negative) are greatly welcome. Please contact the author by sending email to cs.chan at um.edu.my
.
The code base is built upon IndoMMLU