The dataset for Generating Multiple Choice Questions from Scientific Literature via Large Language Models
Welcome to the repository for the MCQ_via_LLM_Dataset, a specialized dataset designed for generating high-quality Multiple Choice Questions (MCQs) from scientific literature using Large Language Models (LLMs). This dataset serves as a crucial resource for researchers and developers working at the intersection of natural language processing and scientific data interpretation.
In recent years, the rapid advancement of Large Language Models (LLMs) has opened new horizons in the field of natural language processing, particularly in the automated generation of educational content. Our study introduces a systematic approach to creating high-quality MCQs from scientific literature, leveraging the power of LLMs.
The primary objective of this research is to explore the potential of LLMs in generating diverse and accurate MCQs from scientific texts, with a specific focus on materials science. We have curated a specialized dataset by extracting information from extensive scientific literature, emphasizing five critical tasks:
- Common Science Knowledge Q&A: Questions designed to test foundational scientific knowledge.
- Digital Data Extraction: MCQs that assess the ability to extract and interpret data from digital formats.
- Detailed Understanding: Questions that require a deep comprehension of scientific concepts and literature.
- Reasoning and Interpretation: MCQs focused on logical reasoning and the interpretation of scientific findings.
- Safety Judgments: Questions assessing knowledge and application of safety protocols in scientific contexts.
Our approach involves using carefully crafted prompts with LLMs to automate the generation of MCQs. The process includes several key steps:
- Data Collection: Extracting relevant scientific content from a wide range of literature in the field of materials science.
- Prompt Engineering: Designing prompts that guide LLMs to generate high-quality, contextually accurate MCQs.
- Validation: Conducting rigorous validation to ensure the relevance, accuracy, and educational value of the generated MCQs.
The MCQ_via_LLM_Dataset offers several significant contributions:
- High-Quality Dataset: A unique dataset tailored for generating MCQs from scientific literature, suitable for both educational and research purposes.
- Benchmark for LLM Evaluation: The dataset serves as a benchmark to evaluate the problem-solving capabilities of various LLMs in the domain of materials science.
- Insights into LLM Performance: Our experimental results provide a comprehensive analysis of the strengths and weaknesses of different LLMs, offering valuable insights for future applications in scientific and educational contexts.
The experimental evaluation of our dataset reveals the potential of LLMs in producing diverse and high-quality MCQs. The results highlight the current capabilities of LLMs in handling different types of scientific data and generating meaningful educational content. Additionally, our study sheds light on the limitations and challenges faced by these models, paving the way for future research and improvements in the field.
We envision several potential future directions based on our findings:
- Enhancing LLM Capabilities: Further refining LLMs to improve their ability to generate accurate and contextually relevant MCQs.
- Expanding Dataset Domains: Extending the dataset to cover other scientific fields beyond materials science.
- Real-World Applications: Exploring the use of LLM-generated MCQs in educational settings and online learning platforms.
To use the MCQ_via_LLM_Dataset, please follow the instructions below:
- Clone the Repository:
git clone https://github.com/logos000/MCQ_via_LLM_Dataset.git
- Explore the Dataset: The dataset files are organized into different folders based on the five critical tasks. Review the README files for detailed descriptions and usage guidelines.
For any questions or inquiries, please feel free to contact us at [sl186@illinois.edu].