This project is an implementation of Direct Preference Optimization, an alternative to RLHF for aligning Large Language Models (LLMs) to human. The algorithm is described in the research paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model .
Direct Preference Optimization (DPO) is a promising and efficient technique for fine-tuning Large Language Models (LLMs) aligned with human preferences. Compared to traditional Reinforcement Learning From Human Feedback (RLHF), DPO eliminates the need for a separate reward model and simplifies the training process, leading to better stability and computational efficiency.
The key insight in Direct Preference Optimization is replacing the complex reward modeling process in RLHF with a simple loss function that directly optimizes for human preferences in closed form. It does this by simply increasing the log probability of the tokens in the human prefered responses, and decreasing the log probability of the tokens in the human disprefered responses, given a preferences dataset, which basically makes the model have an implicit reward function that is directly optimized for human preferences. Through this clever math trick, the process now becomes much simpler and more efficient than RLHF, as it does not require a separate reward model, and it is also more stable, as it does not use other methods like PPO for fine-tuning.
The DPO loss function is defined as follows:
where:
-
$\pi_{\theta}$ is the language model we want to fine-tune -
$\pi_\text{ref}$ is a reference model, usually a frozen version of the original pre-trained language model -
$D$ is the dataset of preferences -
$x$ is a sample prompt from the dataset$D$ -
$y_w$ is the human prefered response to the prompt$x$ -
$y_l$ is the human disprefered response to the prompt$x$ -
$\beta$ is a hyperparameter that controls the amount of divergence from the reference model$\pi_\text{ref}$
The DPO loss function can be broken down into two main terms, the first term represents the log probability of the human-preferred response
The hyperparameter
For a detailed explanation, you can check my blog post Unveiling the Hidden Reward System in Language Models: A Dive into DPO