Learning from human feedback is crucial in aligning large language models (LLMs) with human values and intentions [51], ensuring they are helpful, honest, and harmless [5]. Reinforcement learning from human feedback (RLHF) [18, 62, 73] is a popular method for fine-tuning language models to achieve effective alignment. While the classical RLHF approach [62, 70] has shown impressive results, it presents optimization challenges due to its multi-stage procedure, which involves training a reward model and then optimizing a policy model to maximize that reward [13].
Recently, researchers have been exploring simpler offline algorithms. Direct Preference Optimization (DPO) [66] is one such approach. DPO reparameterizes the reward function in RLHF to directly learn a policy model from preference data, eliminating the need for an explicit reward model. It has gained widespread practical adoption due to its simplicity and stability. In DPO, the implicit reward is formulated using the log ratio of the likelihood of a response between the current policy model and the supervised fine-tuned (SFT) model. However, this reward formulation is not directly aligned with
Figure 1: SimPO and DPO mainly differ in their reward formulation, as indicated in the shaded box. SimPO outperforms DPO significantly across a range of settings on AlpacaEval 2 and Arena-Hard.
the metric used to guide generation, which is approximately the average log likelihood of a response generated by the policy model. We hypothesize that this discrepancy between training and inference may lead to suboptimal performance.
Table 1: Length-controlled (LC) and raw win rate (WR), and generation lengths of top models on the AlpacaEval 2 Leaderboard. Bold are the models we trained.
In this work, we propose SimPO, a simple yet effective offline preference optimization algorithm (Figure 1). The core of our algorithm aligns the reward function in the preference optimization objective with the generation metric. SimPO consists of two major components: (1) a length-normalized reward, calculated as the average log probability of all tokens in a response using the policy model, and (2) a target reward margin to ensure the reward difference between winning and losing responses exceeds this margin. In summary, SimPO has the following properties:
• Simplicity: SimPO does not require a reference model, making it more lightweight and easier to implement compared to DPO and other reference-based methods.
• Significant performance advantage: Despite its simplicity, SimPO significantly outperforms DPO and its latest variants (e.g., a recent reference-free objective ORPO [42]). The performance advantage is consistent across various training setups and extensive chat-based evaluations, including AlpacaEval 2 [55, 28] and the challenging Arena-Hard [54] benchmark. It achieves up to a 6.4 point improvement on AlpacaEval 2 and a 7.5 point improvement on Arena-Hard compared to DPO (Figure 1).
• Minimal length exploitation: SimPO does not significantly increase response length compared to the SFT or DPO models (Table 1), indicating minimal length exploitation [28, 71, 85].
Extensive analysis shows that SimPO utilizes preference data more effectively, leading to a more accurate likelihood ranking of winning and losing responses on a held-out validation set, which in turn translates to a better policy model. As shown in Table 1, our Gemma-2-9B-it-SimPO model achieves state-of-the-art performance, with a 72.4% length-controlled win rate on AlpacaEval 2 and a 59.1% win rate on Arena-Hard, establishing it as the strongest open-source model under 10B parameters. Most notably, when evaluated on Chatbot Arena [17] with real user votes, our model significantly improved upon the initial Gemma-2-9B-it model, advancing from 36th to 25th place and ranking first among all <10B models on the leaderboard.2
In this section, we first introduce the background of DPO (§2.1). Then we identify the discrepancy between DPO’s reward and the likelihood metric used for generation, and propose an alternative reference-free reward formulation that mitigates this issue (§2.2). Finally, we derive the SimPO objective by incorporating a target reward margin term into the Bradley-Terry model (§2.3).
2.1 Background: Direct Preference Optimization (DPO)
DPO [66] is one of the most popular preference optimization methods. Instead of learning an explicit reward model [62], DPO reparameterizes the reward function r using a closed-form expression with the optimal policy:
where is the policy model, is the reference policy, typically the supervised fine-tuned (SFT) model, and Z(x) is the partition function. By incorporating this reward formulation into the BradleyTerry (BT) ranking objective [11], , DPO expresses the probability of preference data with the policy model rather than the reward model, yielding the following objective:
where are preference pairs consisting of the prompt, the winning response, and the losing response from the preference dataset D.
2.2 A Simple Reference-Free Reward Aligned with Generation
Discrepancy between reward and generation for DPO. Using Eq. (1) as the implicit reward has the following drawbacks: (1) it requires a reference model during training, which incurs additional memory and computational costs; and (2) it creates a mismatch between the reward optimized in training and the log-likelihood optimized during inference, where no reference model is involved. This means that in DPO, for any triple , satisfying the reward ranking does not necessarily mean that the likelihood ranking is met (here is the average log-likelihood in Eq. (3)). In our experiments, we observed that only of the triples from the training set satisfy this condition when trained with DPO (Figure 4b). This observation aligns with a concurrent work [14], which finds that existing models trained with DPO exhibit random ranking accuracy in terms of average log-likelihood, even after extensive preference optimization.
Length-normalized reward formulation. One solution is to use the summed token log probability as the reward, but this suffers from length bias–longer sequences tend to have lower log probabilities. Consequently, when is longer than , optimizing the summed log probability as a reward forces the model to artificially inflate probabilities for longer sequences to ensure receives a higher reward than . This overcompensation increases the risk of degeneration. To address this issue, we consider using the average log-likelihood as the implicit reward:
This metric is commonly used for ranking options in beam search [35, 53] and multiple-choice tasks within language models [12, 41, 62]. Naturally, we consider replacing the reward formulation in DPO with , so that it aligns with the likelihood metric that guides generation. This results in a length-normalized reward:
where is a constant that controls the scaling of the reward difference. We find that normalizing the reward with response lengths is crucial; removing the length normalization term from the reward formulation results in a bias toward generating longer but lower-quality sequences (see Section 4.4 for more details). Consequently, this reward formulation eliminates the need for a reference model, enhancing memory and computational efficiency compared to reference-dependent algorithms.
2.3 The SimPO Objective
Target reward margin. Additionally, we introduce a target reward margin term, , to the Bradley-Terry objective to ensure that the reward for the winning response, , exceeds the
reward for the losing response, , by at least
The margin between two classes is known to influence the generalization capabilities of classifiers [1, 10, 22, 31].3 In standard training settings with random model initialization, increasing the target margin typically improves generalization. In preference optimization, the two classes are the winning and losing responses for a single input. In practice, we observe that generation quality initially improves with an increasing target margin but degrades when the margin becomes too large (§4.3). One of DPO’s variants, IPO [6], also formulates a target reward margin similar to SimPO. However, its full objective is not as effective as SimPO (§4.1).
Objective. Finally, we obtain the SimPO objective by plugging Eq. (4) into Eq. (5):
In summary, SimPO employs an implicit reward formulation that directly aligns with the generation metric, eliminating the need for a reference model. Additionally, it introduces a target reward margin to help separating the winning and losing responses. In Appendix F, we provide a gradient analysis of SimPO and DPO to further understand the differences between the two methods.
Preventing catastrophic forgetting without KL regularization. Although SimPO does not impose KL regularization, we find that a combination of practical factors ensures effective learning from preference data while maintaining generalization, leading to an empirically low KL divergence from the reference model. These factors are: (1) a small learning rate, (2) a preference dataset that covers diverse domains and tasks, and (3) the intrinsic robustness of LLMs to learn from new data without forgetting prior knowledge. We present KL divergence experiments in Section 4.4.
Models and training settings. We perform preference optimization with two families of models, Llama-3-8B [2] and Mistral-7B [44], under two setups: Base and Instruct. In this section, our goal is to understand the performance of SimPO vs. other preference optimization methods in different experimental setups. Our strongest model is based on Gemma-2-9B (Instruct setup) with a stronger reward model, RLHFlow/ArmoRM-Llama3-8B-v0.1 [84] (Table 1). We will present and discuss these results in Appendix J.
For the Base setup, we follow the training pipeline of Zephyr [80]. First, we train a base model (i.e., mistralai/Mistral-7B-v0.1, or meta-llama/Meta-Llama-3-8B) on the UltraChat-200k dataset [25] to obtain an SFT model. Then, we perform preference optimization on the UltraFeedback dataset [23] using the SFT model as the starting point. This setup provides a high level of transparency, as the SFT models are trained on open-source data.
For the Instruct setup, we use an off-the-shelf instruction-tuned model (i.e., meta-llama/Meta- Llama-3-8B-Instruct, or mistralai/Mistral-7B-Instruct-v0.2) as the SFT models.4 These models have undergone extensive instruction-tuning processes, making them more powerful and robust than the SFT models in the Base setup. However, they are also more opaque because their RLHF procedure is not publicly disclosed. To mitigate the distribution shift between SFT models and the preference optimization process, we generate the preference dataset using the SFT models following [79]. This makes our Instruct setup closer to an on-policy setting. Specifically, we use prompts from the UltraFeedback dataset and regenerate the chosen and rejected response pairs with the SFT models. For each prompt x, we generate 5 responses using the SFT model with a sampling temperature of 0.8. We then use llm-blender/PairRM [45] to score the 5 responses, selecting the
Table 2: Evaluation details for AlpacaEval 2 [55], Arena-Hard [54], and MT-Bench [99]. The baseline model refers to the model compared against. GPT-4 Turbo corresponds to GPT-4-Preview-1106.
highest-scoring one as and the lowest-scoring one as . We only generated data in a single pass instead of iteratively as in [79].5
In summary, we have four setups: Llama-3-Base, Llama-3-Instruct, Mistral-Base, and Mistral-Instruct. We believe these configurations represent the state-of-the-art, placing our models among the top performers on various leaderboards. We encourage future research to adopt these settings for better and fairer comparisons of different algorithms. Additionally, we find that tuning hyperparameters is crucial for achieving optimal performance with all the offline preference optimization algorithms, including DPO and SimPO. Generally, for SimPO, setting between 2.0 and 2.5 and between 0.5 and 1.5 leads to good performance across all setups. For more details, please refer to Appendix B.
Evaluation benchmarks. We primarily assess our models using three of the most popular openended instruction-following benchmarks: MT-Bench [99], AlpacaEval 2 [55], and Arena-Hard v0.1 [54]. These benchmarks evaluate the models’ versatile conversational abilities across a diverse set of queries and have been widely adopted by the community (details in Table 2). AlpacaEval 2 consists of 805 questions from 5 datasets, and MT-Bench covers 8 categories with 80 questions. The most recently released Arena-Hard is an enhanced version of an MT-Bench, incorporating 500 welldefined technical problem-solving queries. We report scores following each benchmark’s evaluation protocol. For AlpacaEval 2, we report both the raw win rate (WR) and the length-controlled win rate (LC) [28]. The LC metric is specifically designed to be robust against model verbosity. For Arena-Hard, we report the win rate (WR) against the baseline model. For MT-Bench, we report the average MT-Bench score with GPT-4 and GPT-4-Preview-1106 as the judge model.6 For decoding details, please refer to Appendix B. We also evaluate on downstream tasks from the Huggingface Open Leaderboard benchmarks [9], with additional details in in Appendix C.
Table 3: Various preference optimization objectives given pref- erence data is an input, and are the winning and losing responses.
Baselines. We compare SimPO with other offline preference optimization methods listed in Table 3.7 RRHF [91] and SLiC-HF [96] are ranking losses. RRHF uses length-normalized log-likelihood, similar to SimPO’s reward function, while SLiCHF uses log-likelihood directly and includes an SFT objective. IPO [6] is a theoretically grounded approach method that avoids DPO’s assumption that pairwise preferences can be replaced with pointwise rewards. CPO [88] uses sequence likelihood as a reward and trains alongside an SFT objective. KTO [29] learns from non-paired preference data.
Table 4: AlpacaEval 2 [55], Arena-Hard [54], and MT-Bench [99] results under the four settings. LC and WR denote length-controlled and raw win rate, respectively. We train SFT models for Base settings on the UltraChat dataset. For Instruct settings, we use off-the-shelf models as the SFT model.
ORPO [42]8 introduces a reference-model-free odd ratio term to directly contrast winning and losing responses with the policy model and jointly trains with the SFT objective. R-DPO [64] is a modified version of DPO that includes an additional regularization term to prevent exploitation of length. We thoroughly tune the hyperparameters for each baseline and report the best performance. We find that many variants of DPO do not empirically present an advantage over standard DPO. Further details can be found in Appendix B.
In this section, we present main results of our experiments, highlighting the superior performance of SimPO on various benchmarks and ablation studies (§4.1). We provide an in-depth understanding of the following components: (1) length normalization (§4.2), (2) the margin term (§4.3), and (3) why SimPO outperforms DPO (§4.4). Unless otherwise specified, the ablation studies are conducted using the Mistral-Base setting.
4.1 Main Results and Ablations
SimPO consistently and significantly outperforms existing preference optimization methods. As shown in Table 4, while all preference optimization algorithms enhance performance over the SFT model, SimPO, despite its simplicity, achieves the best overall performance across all benchmarks and settings. These consistent and significant improvements highlight the robustness and effectiveness of SimPO. Notably, SimPO outperforms the best baseline by 3.6 to 4.8 points on the AlpacaEval 2 LC win rate across various settings. On Arena-Hard, SimPO consistently achieves superior performance,
Table 5: Ablation studies under Mistral-Base and Mistral-Instruct settings. We ablate each key design of SimPO: (1) removing length normalization in Eq. (4) (i.e., w/o LN); (2) setting target reward margin to be 0 in Eq. (6) (i.e.,
Figure 2: Effect of length normalization (LN). (a) Relationship between reward margin and length difference between winning and losing responses. (b) Spearman correlation between average log probability and response length for SimPO. (c) Spearman correlation for SimPO without LN.
though it is occasionally surpassed by CPO [88]. We find that CPO generates responses that are, on average, 50% longer than those generated by SimPO (See Table 10). Arena-Hard might favor longer generations due to the absence of a length penalty in its evaluation.
Benchmark quality varies. Although all three benchmarks are widely adopted, we find that MTBench exhibits poor separability across different methods. Minor differences between methods on MT-Bench may be attributed to randomness, likely due to the limited scale of its evaluation data and its single-instance scoring protocol. This finding aligns with observations reported in [54]. In contrast, AlpacaEval 2 and Arena-Hard provide more meaningful distinctions between different methods. We observe that the win rate on Arena-Hard is significantly lower than on AlpacaEval 2, indicating that Arena-Hard is a more challenging benchmark.9
The Instruct setting introduces significant performance gains. Across all benchmarks, we observe that the Instruct setting consistently outperforms the Base setting. This improvement is likely due to the higher quality of SFT models used for initialization and the generation of more high-quality preference data by these models.
Both key designs in SimPO are crucial. In Table 5, we demonstrate results from ablating each key design of SimPO: (1) removing length normalization in Eq. (4) (i.e., w/o LN); (2) setting the target reward margin to be 0 in Eq. (6) (i.e., ). Removing the length normalization has the most negative impact on the results. Our examination reveals that this leads to the generation of long and repetitive patterns, substantially degrading the overall quality of the output (See Appendix E). Setting to 0 yields also leads to a performance degradation compared to SimPO, indicating that it is not the optimal target reward margin. In the following subsections, we conduct in-depth analyses to better understand both design choices.
4.2 Length Normalization (LN) Prevents Length Exploitation
LN leads to an increase in the reward difference for all preference pairs, regardless of their length. The Bradley-Terry objective in Eq. (5) essentially aims to optimize the reward difference
Figure 3: Study of the margin . (a) Reward accuracy and AlpacaEval2 LC win rate under different values. (b) Reward difference distribution under different values. (c) Log likelihood distribution on chosen responses under different
to exceed the target margin . We investigate the relationship between the learned reward differences and the length difference between the winning and losing responses from the training set of UltraFeedback. We measure the difference of reward (Eq. (4)) using the SFT model, the SimPO model, and a model trained with SimPO but without length normalization. We present the results in Figure 2a and observe that SimPO with LN consistently achieves a positive reward margin for all response pairs, regardless of their length difference, and consistently improves the margin over the SFT model. In contrast, SimPO without LN results in a negative reward difference for preference pairs when the winning response is shorter than the losing response, indicating that the model learns poorly for these instances.
Removing LN results in a strong positive correlation between the reward and response length, leading to length exploitation. Figures 2b and 2c illustrate the average log likelihood (versus response length on a held-out set for models trained with SimPO and SimPO without LN. The model trained without LN exhibits a much stronger positive Spearman correlation between likelihood and response length compared to SimPO, indicating a tendency to exploit length bias and generate longer sequences (see Appendix E). In contrast, SimPO results in a Spearman correlation coefficient similar to the SFT model (see Figure 6a).
4.3 The Impact of Target Reward Margin in SimPO
Influence of on reward accuracy and win rate. We investigate how the target reward margin in SimPO affects the reward accuracy on a held-out set and win rate on AlpacaEval 2, presenting the results in Figure 3a. Reward accuracy is measured as the percentage of preference pairs where the winning response ends up having a higher reward for the winning response than the losing response (i.e., ). We observe that reward accuracy increases with on both benchmarks, indicating that enforcing a larger target reward margin effectively improves reward accuracy. However, the win rate on AlpacaEval 2 first increases and then decreases with , suggesting that generation quality is not solely determined by the reward margin.
Impact of on the reward distribution. We visualize the distribution of the learned reward margin and the reward of winning responses under varying values in Figure 2b and Figure 2c. Notably, increasing tends to flatten both distributions and reduce the average log likelihood of winning sequences. This initially improves performance but can eventually lead to model degeneration. We hypothesize that there is a trade-off between accurately approximating the true reward distribution and maintaining a well-calibrated likelihood when setting the value. Further exploration of this balance is deferred to future work.
4.4 In-Depth Analysis of DPO vs. SimPO
In this section, we compare SimPO to DPO in terms of (1) likelihood-length correlation, (2) reward formulation, (3) reward accuracy, and (4) algorithm efficiency. We demonstrate that SimPO outperforms DPO in terms of reward accuracy and efficiency.
DPO reward implicitly facilitates length normalization. Although the DPO reward expression (with the partition function excluded) lacks an explicit term for length normalization, the logarithmic ratio between the policy model and the reference model can serve to
Figure 4: Comparison between SimPO and DPO, measured on UltraFeedback. (a) Spearman correlation between average log probability and response length for DPO. (b) Contingency table of rankings based on DPO rewards and the average log likelihood (measured on the training set). (c) Reward accuracy of DPO and SimPO.
Figure 5: Comparison between SimPO and DPO (continued). (a) With different in DPO and SimPO, KL divergence from the policy model to the reference model on . (b) AlpacaEval2 LC win rate of DPO and SimPO with different . (c) Runtime and memory usage for DPO and SimPO.
implicitly counteract length bias. As shown in Table 6 and Figure 4a, employing DPO reduces the Spearman correlation coefficient between average log likelihood and response length compared to the approach without any length normalization (referred to as “SimPO w/o LN”). However, it still exhibits a stronger positive correlation when compared to SimPO.10
Table 6: Spearman correlation be- tween average log likelihood of differ- ent models and response length on a held-out set.
DPO reward mismatches generation likelihood. There is a divergence between DPO’s reward formulation, , and the average log likelihood metric, , which directly impacts generation. As shown in Figure 4b, among the instances on the UltraFeedback training set where , almost half of the pairs have . In contrast, SimPO directly
employs the average log likelihood (scaled by ) as the reward expression, thereby eliminating the discrepancy completely, as demonstrated in Figure 6b.
DPO lags behind SimPO in terms of reward accuracy. In Figure 4c, we compare the reward accuracy of SimPO and DPO, assessing how well their final learned rewards align with preference labels on a held-out set. SimPO consistently achieves higher reward accuracy than DPO, suggesting that our reward design facilitates better generalization and leads to higher quality generations.
KL divergence of SimPO and DPO. In Figure 5a, we present the KL divergence between the policy model trained with DPO and SimPO and the reference model with different , measured on the winning responses from a held-out set during training. Figure 5b shows the corresponding AlpacaEval 2 LC win rate. Although SimPO does not apply any form of regularization against the reference model, the KL divergence of SimPO is reasonably small. Increasing reduces the KL divergence for both DPO and SimPO, with DPO exhibiting a more pronounced reduction at higher values. In this particular setting (Mistral-base), Figure 5b demonstrates that a smaller improve AlpacaEval 2 performance, despite the higher KL divergence.11 We hypothesize that when the reference model is weak, strictly constraining the policy model to the reference model may not be beneficial. As a caveat, while we did not observe any training collapse or degeneration with proper tuning, in principle, SimPO could potentially lead to reward hacking without explicit regularization against the reference model. In such a scenario, the model might achieve a low loss but degenerate.
SimPO is more memory and compute-efficient than DPO. Another benefit of SimPO is its efficiency as it does not use a reference model. Figure 5c illustrates the overall run time and per-GPU peak memory usage of SimPO and DPO in the Llama-3-Base setting using 8×H100 GPUs. Compared to a vanilla DPO implementation,12 SimPO cuts run time by roughly 20% and reduces GPU memory usage by about 10%, thanks to eliminating forward passes with the reference model.
Reinforcement learning from human feedback. RLHF is a technique that aligns large language models with human preferences and values [18, 102, 62, 7]. The classical RLHF pipeline typically comprises three phases: supervised fine-tuning [101, 76, 33, 21, 48, 25, 82, 15, 86], reward model training [32, 60, 16, 56, 37, 50], and policy optimization [70, 4]. Proximal Policy Optimization (PPO) [70] is a widely used algorithm in the third stage of RLHF. The RLHF framework is also widely applied to various applications, such as mitigating toxicity [3, 49, 97], ensuring safety [24], enhancing helpfulness [78, 83], searching and navigating the web [61], and improving model reasoning abilities [36]. Recently, [13] has highlighted challenges across the whole RLHF pipeline from preference data collection to model training. Further research has also demonstrated that RLHF can lead to biased outcomes, such as verbose outputs from the model [28, 71, 85].
Offline vs. iterative preference optimization. Given that online preference optimization algorithms are complex and difficult to optimize [100, 69], researchers have been exploring more efficient and simpler alternative offline algorithms. Direct Preference Optimization (DPO) [66] is a notable example. However, the absence of an explicit reward model in DPO limits its ability to sample preference pairs from the optimal policy. To address this, researchers have explored augmenting preference data using a trained SFT policy [96] or a refined SFT policy with rejection sampling [59], enabling the policy to learn from data generated by the optimal policy. Further studies have extended this approach to an iterative training setup, by continuously updating the reference model with the most recent policy model or generating new preference pairs at each iteration [27, 46, 67, 87, 92]. In this work, we focus exclusively on offline settings, avoiding any iterative training processes.
Preference optimization objectives. A variety of preference optimization objectives have been proposed besides DPO. Ranking objectives allow for comparisons among more than two instances [26, 58, 72, 91]. Another line of work explores simpler preference optimization objectives that do not rely on a reference model [42, 89], similar to SimPO. [8] proposes a method to jointly optimize instructions and responses, finding it effectively improves DPO. [98] focuses on post-training extrapolation between the SFT and the aligned model to further enhance model performance. In this work, we compare SimPO to a series of offline algorithms, including RRHF [91], SLiC-HF [96], DPO [66], IPO [6], CPO [88], KTO [29], ORPO [42], and R-DPO [64], and find that SimPO can outperform them in both efficiency and performance. Recently, [75] proposed a generalized preference optimization framework unifying different offline algorithms, and SimPO can be seen as a special case.
In this work, we propose SimPO, a simple and effective preference optimization algorithm that consistently outperforms existing approaches across various training setups. By aligning the reward function with the generation likelihood and introducing a target reward margin, SimPO eliminates the need for a reference model and achieves strong performance without exploiting the length bias. Extensive analysis demonstrates that the key designs in SimPO are crucial and validates the efficiency and effectiveness of SimPO. A detailed discussion of the limitations can be found in Appendix A.
The authors would like to thank Li Dong, Tianyu Gao, Tanya Goyal, Di Jin, Yuchen Lin, Kaifeng Lyu, Sadhika Malladi, Eric Mitchell, Lewis Tunstall, Haoxiang Wang, Wei Xiong, Zhen Xu, Libing Yang, Zhiyu Zhao, and members of the Princeton NLP group for their valuable feedback and discussions. We thank Niklas Muennighoff for his advice on training and reproducing training KTO models. We thank Haoran Xu for helping verify our CPO runs. Mengzhou Xia is supported by an Apple Scholars in AIML Fellowship. This research is also funded by the National Science Foundation (IIS-2211779) and a Sloan Research Fellowship.
[1] Alan Agresti. Categorical data analysis, volume 792. John Wiley & Sons, 2012.
[2] AI@Meta. Llama 3 model card. 2024.
[3] Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571, 2024.
[4] Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30, 2017.
[5] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, John Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Christopher Olah, and Jared Kaplan. A general language assistant as a laboratory for alignment. ArXiv, abs/2112.00861, 2021.
[6] Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences. ArXiv, abs/2310.12036, 2023.
[7] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
[8] Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover. Comparing bad apples to good oranges: Aligning large language models via joint preference optimization. arXiv preprint arXiv:2404.00530, 2024.
[9] Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open LLM leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_ leaderboard, 2023.
[10] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, 1992.
[11] Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952.
[12] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020.
[13] Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
[14] Angelica Chen, Sadhika Malladi, Lily H Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, and Kyunghyun Cho. Preference learning algorithms do not learn preference rankings. NeurIPS, 2024.
[15] Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. AlpaGasus: Training a better Alpaca with fewer data. In ICLR, 2024.
[16] Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. ODIN: Disentangled reward mitigates hacking in RLHF. arXiv preprint arXiv:2402.07319, 2024.
[17] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
[18] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
[19] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
[20] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[21] Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned LLM, 2023.
[22] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273–297, 1995.
[23] Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. UltraFeedback: Boosting language models with high-quality feedback. In ICML, 2024.
[24] Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe RLHF: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
[25] Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. In EMNLP, 2023.
[26] Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, SHUM KaShun, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023.
[27] Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024.
[28] Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled AlpacaEval: A simple way to debias automatic evaluators. ArXiv, abs/2404.04475, 2024.
[29] Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. KTO: Model alignment as prospect theoretic optimization. ArXiv, abs/2402.01306, 2024.
[30] Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, 2018.
[31] David Firth and Heather Turner. Bradley-terry models in R: the BradleyTerry2 package. Journal of Statistical Software, 48(9), 2012.
[32] Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
[33] Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A dialogue model for academic research. Blog post, April, 1:6, 2023.
[34] Ulrich Germann. Greedy decoding for statistical machine translation in almost linear time. In NAACL, 2003.
[35] Alex Graves. Sequence transduction with recurrent neural networks. ArXiv, abs/1211.3711, 2012.
[36] Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane DwivediYu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu. Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642, 2024.
[37] Alex Havrilla, Sharath Raparthy, Christoforus Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, and Roberta Railneau. GLoRe: When, where, and how to improve LLM reasoning via global and local refinements. arXiv preprint arXiv:2402.10963, 2024.
[38] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
[39] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019.
[40] Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. Learning to write with cooperative discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1638–1649, 2018.
[41] Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038–7051, 2021.
[42] Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic preference optimization without reference model. ArXiv, abs/2403.07691, 2024.
[43] Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. ArXiv, abs/2307.04657, 2023.
[44] Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L’elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B. ArXiv, abs/2310.06825, 2023.
[45] Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. LLM-Blender: Ensembling large language models with pairwise ranking and generative fusion. In ACL, 2023.
[46] Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, and Chanjun Park. sDPO: Don’t use your data all at once. ArXiv, abs/2403.19270, 2024.
[47] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[48] Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Minh Nguyen, Oliver Stanley, Richárd Nagyfi, et al. Openassistant conversations-democratizing large language model alignment. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
[49] Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR, 2023.
[50] Nathan Lambert, Valentina Pyatkin, Jacob Daniel Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Raghavi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hanna Hajishirzi. RewardBench: Evaluating reward models for language modeling. ArXiv, abs/2403.13787, 2024.
[51] Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.
[52] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
[53] Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. Deep reinforcement learning for dialogue generation. In Jian Su, Kevin Duh, and Xavier Carreras, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Austin, Texas, November 2016. Association for Computational Linguistics.
[54] Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The Arena-Hard pipeline, April 2024.
[55] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaEval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
[56] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
[57] Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In ACL, pages 3214–3252, 2022.
[58] Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, et al. LiPO: Listwise preference optimization through learning-to-rank. arXiv preprint arXiv:2402.01878, 2024.
[59] Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J Liu, and Jialu Liu. Statistical rejection sampling improves preference optimization. In The Twelfth International Conference on Learning Representations, 2024.
[60] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
[61] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
[62] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
[63] Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024.
[64] Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. ArXiv, abs/2403.19159, 2024.
[65] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[66] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2023.
[67] Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences. ArXiv, abs/2404.03715, 2024.
[68] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
[69] Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, and Yelong Shen. Efficient RLHF: Reducing the memory usage of PPO. arXiv preprint arXiv:2309.00754, 2023.
[70] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[71] Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in RLHF. arXiv preprint arXiv:2310.03716, 2023.
[72] Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking optimization for human alignment. In AAAI, 2024.
[73] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
[74] Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, et al. Understanding the performance gap between online and offline alignment algorithms. arXiv preprint arXiv:2405.08448, 2024.
[75] Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, and Bilal Piot. Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024.
[76] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023.
[77] Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.
[78] Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Finetuning language models for factuality. In The Twelfth International Conference on Learning Representations, 2024.
[79] Hoang Tran, Chris Glaze, and Braden Hancock. Iterative DPO alignment. Technical report, Snorkel AI, 2023.
[80] Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of LM alignment. ArXiv, abs/2310.16944, 2023.
[81] Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zi-Han Lin, Yuk-Kit Cheng, Sanmi Koyejo, Dawn Xiaodong Song, and Bo Li. DecodingTrust: A comprehensive assessment of trustworthiness in GPT models. In NeurIPS, 2023.
[82] Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. OpenChat: Advancing open-source language models with mixed-quality data. In ICLR, 2024.
[83] Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, and Tong Zhang. Arithmetic control of LLMs for diverse user preferences: Directional preference alignment with multi-objective rewards. ArXiv, abs/2402.18571, 2024.
[84] Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. arXiv preprint arXiv:2406.12845, 2024.
[85] Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
[89] Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682, 2023.
[90] Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is DPO superior to PPO for LLM alignment? a comprehensive study. arXiv preprint arXiv:2404.10719, 2024.
[91] Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. RRHF: Rank responses to align language models with human feedback. In NeurIPS, 2023.
[92] Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
[93] Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. WildBench: Benchmarking LLMs with challenging tasks from real users in the wild. arXiv e-prints, pages arXiv–2406, 2024.
[94] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics.
[95] Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations, 2024.
[96] Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. SLiC-HF: Sequence likelihood calibration with human feedback. ArXiv, abs/2305.10425, 2023.
[97] Chujie Zheng, Pei Ke, Zheng Zhang, and Minlie Huang. Click: Controllable text generation with sequence likelihood contrastive learning. In Findings of ACL, 2023.
[98] Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, and Nanyun Peng. Weak-to-strong extrapolation expedites alignment. arXiv preprint arXiv:2404.16792, 2024.
[99] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In NeurIPS Datasets and Benchmarks Track, 2023.
[100] Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, et al. Secrets of RLHF in large language models part I: PPO. arXiv preprint arXiv:2307.04964, 2023.
[101] Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. LIMA: Less is more for alignment. NeurIPS, 2023.
[102] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
More in-depth theoretical analysis. Despite the empirical success and intuitive motivation of SimPO, a more rigorous theoretical analysis is necessary to fully understand the factors contributing to its effectiveness. Additionally, we introduce an additional hyperparameter, the target reward margin, which requires manual tuning. Future work could explore how to determine the optimal margin automatically and provide a more theoretical understanding of SimPO.
Safety and honesty. SimPO is designed to optimize the generation quality of language models by pushing the margin between the average log likelihood of the winning response and the losing response to exceed a target reward margin. However, it does not explicitly consider safety and honesty aspects, which are crucial for real-world applications. Future work should explore integrating safety and honesty constraints into SimPO to ensure that the generated responses are not only high-quality but also safe and honest. The dataset used in this work, UltraFeedback [23], primarily focuses on helpfulness, and future research may consider a more comprehensive study utilizing larger-scale preference datasets [43, 95] and evaluation benchmarks [81] that place a strong emphasis on safety aspects. Nonetheless, we observe that this method consistently achieves high TruthfulQA [57] performance compared to other objectives in Table 9, suggesting its potential for safety alignment.
Performance drop on math. We observed that preference optimization algorithms generally decrease downstream task performance, particularly on reasoning-heavy tasks like GSM8k, as shown in Table 9. SimPO occasionally results in performance comparable to or worse than DPO. We hypothesize that this may be related to the choice of training datasets, hyperparameters used for training, or a mismatch of chat templates used for downstream task evaluations. One explanation is that the preference optimization objective may not be effectively increasing the likelihood of preferred sequences despite increasing the reward margin. [63] first observed this phenomenon and point out that this can hinder learning from math preference pairs where changing one token can flip the label (e.g., changing 2 + 2 = 4 to 2 + 2 = 5). They propose a simple regularization strategy to add back a reference-model calibrated supervised fine-tuning loss to the preference optimization objective, and effectively mitigate this issue. Future work may consider integrating this regularization strategy into SimPO to improve performance on reasoning-heavy tasks.
We find that hyperparameter tuning is crucial for achieving optimal performance of preference optimization methods. However, the importance of careful hyperparameter tuning may have been underestimated in prior research, potentially leading to suboptimal baseline results. To ensure a fair comparison, we conduct thorough hyperparameter tuning for all methods compared in our experiments.
General training hyperparameters. For the Base training setups, we train SFT models using the UltraChat-200k dataset [25] with the following hyperparameters: a learning rate of 2e-5, a batch size of 128, a max sequence length of 2048, and a cosine learning rate schedule with 10% warmup steps for 1 epoch. All the models are trained with an Adam optimizer [47].
For the preference optimization stage, we conduct
preliminary experiments to search for batch sizes in [32, 64, 128] and training epochs in [1, 2, 3]. We find that a batch size of 128 and a single training epoch generally yield the best results across all methods. Therefore, we fix these values for all preference optimization experiments. Additionally, we set the max sequence length to be 2048 and apply a cosine learning rate schedule with 10% warmup steps on the preference optimization dataset.
Method-specific training hyperparameters. We have noticed that the optimal learning rate varies for different preference optimization methods and greatly influences the benchmark performance. Therefore, we individually search the learning rates in the range of [3e-7, 5e-7, 6e-7, 1e-6] for each
Table 7: Various preference optimization objectives and hyperparameter search range.
Figure 6: (a) Likelihood-length correlation plot for Mistral-SFT fine-tuned on UltraChat-200k. (a) Contingency table rankings based on SimPO rewards and the average log likelihood (measured on the training set).
method. Table 7 shows the detailed information on method-specific hyperparameters search ranges for baselines.13 Table 8 shows SimPO’s hyperparameters used under each setting.
Decoding hyperparameters. For AlpacaEval 2, we use a sampling decoding strategy to generate responses, with a temperature of 0.7 for the Mistral-Base setting following temperature of 0.5 for the Mistral-Instruct setting following Snorkel-Mistral-PairRM-DPO, and a temperature of 0.9 for both Llama 3 settings.15 For Arena-Hard, we use the default greedy decoding for all settings and methods. For MT-Bench, we follow the official decoding configuration which defines different sampling temperatures for different categories.
Computation environment. All the training experiments in this paper were conducted on 8×H100 GPUs based on the alignment-handbook repo.16
We present the standard deviation of AlpacaEval 2 and the 95% confidence interval of Arena-Hard in Table 10. All these metrics are reasonable and do not exhibit any significant outliers or instability.
Length normalization decreases generation length and improves generation quality. Removing length normalization from the SimPO objective results in an approach similar to Contrastive Preference Optimization (CPO) [88], which interpolates reward maximization with a supervised fine-tuning loss and has demonstrated strong performance in machine translation. However, without the supervised fine-tuning loss, the reward maximization objective without length normalization is suboptimal in preference optimization.
We analyze the generation length of models trained with or without length normalization on AlpacaEval 2 and Arena-Hard. As shown in Figure 6, length normalization significantly decrease the generation length by up to 25% compared to when it is not used in most cases. However, even though the generation length is shorter, the models with length normalization consistently achieve much higher win rates on both benchmarks. This suggests that length normalization can effectively control the verbosity of the generated responses, and meanwhile improve the generation quality.
Table 9: Downstream task evaluation results of tasks on the huggingface open leaderboard.
Length is not a reliable indicator of generation quality. We further analyze the generation length of models trained with different methods on AlpacaEval 2 and Arena-Hard, as shown in Table 10. Generally, we find that no single method consistently generates longer or shorter responses across all settings. Additionally, even though some methods may generate longer responses, they do not necessarily achieve better win rates on the benchmarks. This indicates that the length of the generated responses is not a reliable indicator of generation quality.
SimPO demonstrates minimal exploitation of response length. We observe that SimPO has a shorter generation length compared to DPO in the Llama-3-Instruct case but exhibits a higher generation length in other settings, with up to 26% longer responses on AlpacaEval 2. Conversely, SimPO only increases length by only around 5% on Arena-Hard compared to DPO. It is fair to say that the generation length heavily depends on the evaluation benchmark. A stronger indicator is that SimPO consistently achieves a higher length-controlled win rate on AlpacaEval 2 compared to the raw win rate, demonstrating minimal exploitation of response length.
represent the gradient weight in SimPO and DPO, respectively. It can be seen that the differences are twofold: (1) comparing the gradient weights and , SimPO’s gradient weight does not involve the reference model and has a straightforward interpretation: the weights will be higher for samples where the policy model incorrectly assigns higher likelihood to than ; (2) comparing the gradient updates, SimPO’s gradients on and are length-normalized, while DPO’s are not. This corresponds to the empirical findings [64] that DPO may exploit length bias: longer sequences with more tokens will receive larger gradient updates in DPO, dominating the training process.
We present the win rate heatmap of Mistral-Base and Mistral-Instruct on AlpacaEval 2 and ArenaHard in Figure 7 and Figure 8, respectively. Based on this analysis, we present qualitative examples of responses generated by a SimPO model, a DPO model and the baseline model GPT-4-Preview-1106 on AlpacaEval 2.
Figure 9: An AlpacaEval 2 generation from the MistralBase model after training with DPO.
Table 10: Detailed results of AlpacaEval 2 and Arena-Hard. LC means length-controlled win rate, WR means raw win rate, and STD means standard deviation of win rate. Length is the average generation length. For Arena-Hard, we report the win rate and 95% confidence interval.
Table 11: Average response lengths on AlpacaEval 2 and Arena-Hard trained with Mistral-Base or Mistral-Instruct.
Figure 10: An AlpacaEval 2 generation from the Mistral-Base model after training with SimPO. Compared to the output generated by the DPO model, as shown in Figure 9, the generation by SimPO is better structured with hierarchical discussions, making the information more clearly presented and readable.
Figure 11: A case study on AlpacaEval 2 demonstrates that Llama-3-Instruct, trained with SimPO, provides a better formatted and more detailed answer than both Llama-3-Base, also trained with SimPO, and the baseline model GPT-4-1106-Preview. This illustrates how the instruction setting typically outperforms the base setting.
Table 12: Results of Llama-3-Instruct (8B) setting, utilizing preference labels annotated by a stronger reward model (ArmoRM [84], we term it as version 0.2).
Table 13: Downstream task evaluation results of tasks on the huggingface open leaderboard.
In this section, we update the Llama-3-Instruct setting, primarily by utilizing a stronger reward model to annotate our generated preference dataset.
Enhanced reward model yields significantly better results. In our previous version, we use PairRM [45] as our reward model to rank generated candidate responses. The results, presented in Table 12, show that switching the reward model from PairRM [45] to ArmoRM [84] for ranking the data markedly improves model performance. This underscores the importance of a high-quality preference optimization dataset for enhancing performance. Notably, SimPO has achieved a 53.7 LC win rate on AlpacaEval 2 and 36.5 on Arena-Hard, surpassing the previous version by 9.0 and 2.7 points, respectively.
We use the following hyperparameters for SimPO under the Llama-3 Instruct v0.2 setting: and . The other hyperparameters (e.g., learning rate, batch size, max sequence lengths) are kept the same as the original Llama-3-8B-Instruct setting.
Strong SFT model and high-quality policy data diminish algorithm differences. With a strong SFT model like Llama-3-8B-Instruct, and as the preference optimization data quality improves, the differences between algorithms become less pronounced. For instance, DPO achieved a similar win rate as SimPO in terms of raw win rate, and DPO, IPO, and R-DPO all exhibited comparable raw win rates on Arena-Hard. However, SimPO maintains an advantage by producing shorter sequences, resulting in a significantly better LC win rate on AlpacaEval 2.
Stronger downstream task performance. The v0.2 version also shows improved performance in downstream tasks across various objectives. However, DPO, IPO, R-DPO, and SimPO continue to experience a decline in reasoningintensive domains such as GSM8K. In contrast, objectives that include an SFT component maintain their performance in mathematical tasks.
Incorporating SFT regularization in SimPO. Several reference-free algorithms, including RRHF [91], SLiC-HF [96], CPO [88], and ORPO [42], employ SFT regularization in their objectives. SFT regularization can be an effective method to prevent reward hacking, ensuring that the solution maintains low loss without resulting in degraded generations. We also experiment with the integration of an SFT loss in SimPO, yielding the following objective:
As shown in Table 14, the addition of the SFT regularization leads to a decrease in performance on AlpacaEval 2. However, we note that SFT regularization provides substantial benefits to certain tasks such as GSM8K, as shown in Table 12. These contrasting results suggest that the impact of SFT in preference optimization may vary depending on the training setup and the nature of the task. Further comprehensive studies on this topic are left for future research.
Since the release of the paper, we have had inquiries from researchers about whether the key design elements of SimPO—length normalization and target reward margin—could benefit DPO. By doing so, we will derive the following two objectives:
An intuitive understanding of how length normalization could benefit DPO is that, despite DPO’s reward design being implicitly normalized by the reference model, the policy model might still exploit length bias from the data, resulting in a disproportionately high probability for longer sequences. Applying length normalization could help mitigate this effect.
We train models with the objectives mentioned above and compare their performance to that of DPO and SimPO, as shown in Table 15.
The results indicate that, unlike SimPO, length normalization and target reward margin do not consistently benefit DPO. Specifically, length normalization significantly improves DPO performance only in the Mistral-Base setting, where the preference optimization dataset shows a strong length bias. However, it does not provide a benefit in the Mistral-Instruct setting, where the lengths of winning and losing responses are comparable. This is likely because DPO already includes an implicit instance-wise target reward margin via the reference model, as shown in the derivation below.
Table 15: Applying length normalization (LN) and target reward margin (
Performance degradation on other benchmarks for Llama-3-SimPO checkpoints. After releasing the Llama-3-SimPO checkpoints, we received extensive feedback about performance degradation on benchmarks measuring specific capabilities, such as MMLU and GSM8K. To investigate this issue, we continued training the Llama-3-8B-Instruct model with different learning rates, as reported in Table 16. We find that using a higher learning rate results in a stronger model in chat-oriented benchmarks, at the cost of catastrophic forgetting on GSM8K and MMLU.17 With a smaller learning rate, the model’s performance on chat benchmarks is slightly worse, but its performance on GSM8K and MMLU is better retained. This demonstrates a trade-off between chat-oriented benchmarks and other benchmarks when continuing training from a strong instruction-tuned model.
Table 16: Results on AlpacaEval 2, ZeroEval GSM, and ZeroEval MMLU when continuing training from Llama-3-8B-Instruct with different learning rates. * indicates the released checkpoint.
Applying SimPO to Gemma 2 models presents a different trend. We evaluate SimPO using Google’s recently released Gemma-2-9B-it model [77], which represents a strong open-source model. For training data, we generate up to 5 responses per prompt from the UltraFeedback dataset [23] and use the ArmoRM model [84] to annotate preferences between responses. We compare our SimPO against a DPO-trained variant, both fine-tuned from the Gemma-2-9B-it base model. As shown in Appendix J, SimPO demonstrates superior performance on chat benchmarks like AlpacaEval 2 and Arena-Hard while maintaining the model’s original zero-shot capabilities on tasks like GSM8K and MMLU. Notably, we find that varying the learning rate during fine-tuning has minimal impact on the model’s performance. These results suggest an underlying property difference between the Llama-3 checkpoints and the Gemma 2 checkpoints, and might be worth further investigation.
Gemma-2-9B-it-SimPO significantly improved the ranking of the Gemma-2-9B-it model on Chatbot Arena. During the development stage, we relied solely on automated metrics to evaluate the model’s performance. To determine if these metrics aligned with real user preferences, we submitted our best-performing model, Gemma-2-9B-it-SimPO, to the Chatbot Arena leaderboard hosted by LMSYS [17]. We find that our model improved the original Gemma-2-9B-it ranking from 36th to 25th, making the SimPO variant the top-ranked <10B model on the Chatbot Arena leaderboard based on real user votes as of September 16th, 2024.
Table 17: Benchmark performance of Gemma-2-9B trained with DPO and SimPO on UltraFeedback (responses regenerated with Gemma-2-9B-it, following the same dataset construction process as Llama-3-Instruct (8B) described in Section 3). SimPO results in better instruction following performance than DPO without degrading math abilities (GSM) or general knowledge (MMLU) of the original model. * indicates the released checkpoint.