Merge branch 'main' into push-generation-with-tiny

Fix typos (#4690 )
--- a/docs/source/grpo_trainer.md
+++ b/docs/source/grpo_trainer.md
@@ -137,6 +137,33 @@ $$

 This constant is recommended to be the maximum completion length. To use this formulation, set `loss_type="dr_grpo"` in the [`GRPOConfig`].

 Alternatively, in the [SAPO paper](https://huggingface.co/papers/2511.20347), the Qwen team proposes replacing the "hard" clipping mechanism of GRPO with a smooth, temperature-controlled soft gating mechanism. While GRPO zeroes out gradients when the policy deviates too far from the reference, SAPO uses a soft trust region that smoothly decays the gradient weight. This allows the model to retain useful learning signals from "near-on-policy" tokens while suppressing noise from extreme deviations.

 The loss function is defined as:

 $$
 \mathcal{L}_{\text{SAPO}}(\theta) = - \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} f_{i,t} \left( \frac{\pi_\theta(o_{i,t} | q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,<t})} \right) \hat{A}_{i,t}
 $$

 The soft-gating function  \\( f_{i,t} \\) is defined using the sigmoid function  \\( \sigma \\) as:

 $$
 f_{i,t}(x) = \sigma \left( \tau_{i,t} (x - 1) \right) \cdot \frac{4}{\tau_{i,t}}
 $$

 The temperature  \\( \tau_{i,t} \\) is chosen based on the sign of the advantage  \\( \hat{A}_{i,t} \\):

 $$
 \tau_{i,t} = \begin{cases} 
 \tau_{\text{pos}}, & \text{if } \hat{A}_{i,t} > 0 \\
 \tau_{\text{neg}}, & \text{otherwise}
 \end{cases}
 $$

 They recommends using asymmetric temperatures,  \\( \tau_{\text{neg}} > \tau_{\text{pos}} \\) (defaults are  \\( \tau_{\text{pos}}=1.0, \tau_{\text{neg}}=1.05 \\) ). This ensures that the model is penalized more strictly for "bad" actions to prevent instability, while being more permissive with "good" actions.

 To use this formulation, set `loss_type="sapo"` in the [`GRPOConfig`].

 ## Logged metrics

 While training and evaluating, we record the following reward metrics:
@@ -159,14 +186,10 @@ While training and evaluating, we record the following reward metrics:
 - `frac_reward_zero_std`: The fraction of samples in the generation batch with a reward std of zero, implying there is little diversity for that prompt (all answers are correct or incorrect).
 - `entropy`: Average entropy of token predictions across generated completions. (If `mask_truncated_completions=True`, masked sequences tokens are excluded.)
 - `kl`: The average KL divergence between the model and the reference model, calculated over generated completions. Logged only if `beta` is nonzero.
 - `clip_ratio/region_mean`: The ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities where the GRPO objective is clipped to stay within the trust region:
  $$
  \text{clip}\left( r_{i,t}(\theta), 1 - \epsilon_\mathrm{low}, 1 + \epsilon_\mathrm{high} \right)\,, \qquad r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}\,.
  $$
  A higher value means more tokens are clipped, which constrains how much the policy $\pi_\theta$ can change.
 - `clip_ratio/low_mean`: The average ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the lower bound of the trust region:  \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\)
 - `clip_ratio/low_min`: The minimum ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the lower bound of the trust region:  \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\)
 - `clip_ratio/high_mean`: The average ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the upper bound of the trust region:  \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\)
 - `clip_ratio/region_mean`: The ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities where the GRPO objective is clipped to stay within the trust region:  \\( \text{clip}\left( r_{i,t}(\theta), 1 - \epsilon_\mathrm{low}, 1 + \epsilon_\mathrm{high} \right)\,, \quad r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \\). A higher value means more tokens are clipped, which constrains how much the policy $\pi_\theta$ can change.
 - `clip_ratio/low_mean`: The average ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the lower bound of the trust region:  \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\).
 - `clip_ratio/low_min`: The minimum ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the lower bound of the trust region:  \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\).
 - `clip_ratio/high_mean`: The average ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the upper bound of the trust region:  \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\).
 - `clip_ratio/high_max`: The maximum ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the upper bound of the trust region:  \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\).

 ## Customization
--- a/docs/source/paper_index.md
+++ b/docs/source/paper_index.md
@@ -1,11 +1,38 @@
 # Paper Index

 > [!WARNING]
 > Section under construction. Feel free to contribute!
 > Section under construction. Feel free to contribute! See https://github.com/huggingface/trl/issues/4407.

 ## Group Relative Policy Optimization

 Papers relating to the [`GRPOTrainer`]
 Papers relating to the [`GRPOTrainer`].

 ### DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

 **📜 Paper**: https://huggingface.co/papers/2402.03300

 Introduces Group Relative Policy Optimization (GRPO) and shows strong math-reasoning gains from math-centric pretraining plus group-relative PPO-style optimization. Used in TRL via [`GRPOTrainer`].

 ```python
 from trl import GRPOConfig, GRPOTrainer

 # The paper doesn't specify its hyperparameters, so here we provide hyperparameters from "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning" instead.
 training_args = GRPOConfig(
    loss_type="grpo",
    beta=0.001,  # "the KL coefficient to 0.001"
    epsilon=10.0, # "the GRPO clip ratio ϵ to 10"
    num_generations=16,  # "For each question, we sample 16 outputs..."
    max_completion_length=32_768,  # "...with a maximum length of 32,768"
    steps_per_generation=16,  # "To accelerate training, each rollout generates 8,192 outputs, which are randomly split into 16 minibatches"
    # "resulting in a training batch size of 512". One way to achieve this setting with 1 device is per_device_train_batch_size=4, gradient_accumulation_steps=128
    per_device_train_batch_size=4,
    gradient_accumulation_steps=128,  
 )
 trainer = GRPOTrainer(
    ...,
    args=training_args,
 )
 ```

 ### Group Sequence Policy Optimization

@@ -86,7 +113,7 @@ training_args = GRPOConfig(
    per_device_train_batch_size=512, # mini-batch size for training in the paper, DAPO paper: section 4.1
    num_generations=16, # number of sample responses in the paper, DAPO paper: section 4.1
    max_completion_length=20480, #  maximum number of tokens for generation in the paper, DAPO paper: section 4.1
    beta=0.0 # section 2.3, DAPO paper
    beta=0.0, # section 2.3, DAPO paper

 )
 # Soft Overlong Punishment
@@ -411,16 +438,16 @@ from trl import GRPOConfig

 training_args = GRPOConfig(
    ...,
    beta=0.001,  # the paper don't specify the value used, so we use the value from "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning"
    beta=0.001,  # the paper doesn't specify the value used, so we use the value from "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning"
    use_bias_correction_kl=True,
 )
 ```

 ## Direct Policy Optimization

 Papers relating to the [`DPOTrainer`]
 - Papers relating to the [`DPOTrainer`]

 ### Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model
 ### Direct Preference Optimization: Your Language Model is Secretly a Reward Model

 **📜 Paper**: https://huggingface.co/papers/2305.18290

@@ -441,7 +468,7 @@ training_args = DPOConfig(

 **📜 Paper**: https://huggingface.co/papers/2310.12036

 A new general objective,  \\( \Psi \\)$PO, bypasses both key approximations in reinforcement learning from human preferences, allowing for theoretical analysis and empirical superiority over DPO. To reproduce the paper's setting, use this configuration: To reproduce the paper's setting, use this configuration:
 A new general objective,  \\( \Psi \\)PO, bypasses both key approximations in reinforcement learning from human preferences, allowing for theoretical analysis and empirical superiority over DPO. To reproduce the paper's setting, use this configuration: To reproduce the paper's setting, use this configuration:

 ```python
 from trl import DPOConfig
@@ -641,6 +668,46 @@ training_args = DPOConfig(

 These parameters only appear in the [published version](https://aclanthology.org/2025.tacl-1.22.pdf)

 ### Statistical Rejection Sampling Improves Preference Optimization

 **📜 Paper**: https://huggingface.co/papers/2309.06657

 Proposes **RSO**, selecting stronger preference pairs via statistical rejection sampling to boost offline preference optimization; complements DPO/SLiC. They also introduce a new loss defined as:

 $$
 \mathcal{L}_{\text{hinge-norm}}(\pi_\theta)
 = \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}
 \left[
 \max\left(0,\; 1 - \left[\gamma \log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \gamma \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}\right]\right)
 \right]
 $$

 To train with RSO-filtered data and the hinge-norm loss, you can use the following code:

 ```python
 from trl import DPOConfig, DPOTrainer

 dataset = ...

 def rso_accept(example):  # replace with your actual filter/score logic
    return example["rso_keep"]

 train_dataset = train_dataset.filter(rso_accept)

 training_args = DPOConfig(
    loss_type="hinge",
    beta=0.05,  # correspond to gamma in the paper
 )

 trainer = DPOTrainer(
    ...,
    args=training_args,
    train_dataset=train_dataset,
 )
 trainer.train()

 ```

 ## Kahneman–Tversky Optimization

 Papers relating to the [`experimental.kto.KTOTrainer`]
@@ -721,6 +788,26 @@ SFTConfig(
 )
 ```

 ## Parameter-Efficient Fine-Tuning (PEFT)

 For general details on using PEFT with TRL, please refer to the [PEFT Integration](peft_integration) guide.

 ### LoRA: Low-Rank Adaptation of Large Language Models

 **📜 Paper**: https://huggingface.co/papers/2106.09685

 Low-Rank Adaptation (LoRA) reduces the number of trainable parameters and GPU memory usage in large-scale pre-trained models while maintaining or improving performance on downstream tasks. TRL integrates LoRA via the [PEFT library](https://huggingface.co/docs/peft/index) and can be easily enabled in any TRL trainer by passing a [`~peft.LoraConfig`] to the `peft_config` argument. Here is an example of using LoRA with the [`SFTTrainer`]:

 ```python
 from trl import SFTTrainer
 from peft import LoraConfig

 trainer = SFTTrainer(
    ...,
    peft_config=LoraConfig(),
 )
 ```

 ## Reinforce Leave-One-Out

 Papers relating to the [`RLOOTrainer`]
@@ -818,9 +905,11 @@ dataset = dataset.map(add_margin)
 ```

 ## Distillation

 Papers relating to training a student model with the help of a teacher model.

 ### On-Policy Distillation

 **📰 Blog**: https://thinkingmachines.ai/blog/on-policy-distillation/

 On-Policy Distillation involves a student model generating rollouts for each batch of training data. We subsequently obtain the probability distributions for each token of the rollouts from both the student and teacher models. The student model is then optimized to minimize the negative Kullback-Leibler (KL) divergence between its own token distributions and those of the teacher model.
--- a/docs/source/rloo_trainer.md
+++ b/docs/source/rloo_trainer.md
@@ -143,15 +143,10 @@ While training and evaluating, we record the following reward metrics:
 - `frac_reward_zero_std`: The fraction of samples in the generation batch with a reward std of zero, implying there is little diversity for that prompt (all answers are correct or incorrect).
 - `entropy`: Average entropy of token predictions across generated completions. (If `mask_truncated_completions=True`, masked sequences tokens are excluded.)
 - `kl`: The average KL divergence between the model and the reference model, calculated over generated completions. Logged only if `beta` is nonzero.
 - `clip_ratio/region_mean`: The ratio of sequence probabilities where the RLOO objective is clipped to stay within the trust region:
  $$
  \text{clip}\left( r_{i}(\theta), 1 - \epsilon_\mathrm{low}, 1 + \epsilon_\mathrm{high} \right)\,, \qquad r_{i}(\theta) = \frac{\pi_\theta(o_{i} \mid q)}{\pi_{\theta_{\text{old}}}(o_{i} \mid q)}\,.
  $$

    A higher value means more samples are clipped, which constrains how much the policy $\pi_\theta$ can change.
 - `clip_ratio/low_mean`: The average ratio of sequence probabilities that were clipped on the lower bound of the trust region:  \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\)
 - `clip_ratio/low_min`: The minimum ratio of sequence probabilities that were clipped on the lower bound of the trust region:  \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\)
 - `clip_ratio/high_mean`: The average ratio of sequence probabilities that were clipped on the upper bound of the trust region:  \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\)
 - `clip_ratio/region_mean`: The ratio of sequence probabilities where the RLOO objective is clipped to stay within the trust region:  \\( \text{clip}\left( r_{i}(\theta), 1 - \epsilon_\mathrm{low}, 1 + \epsilon_\mathrm{high} \right)\,, \quad r_{i}(\theta) = \frac{\pi_\theta(o_{i} \mid q)}{\pi_{\theta_{\text{old}}}(o_{i} \mid q)} \\). A higher value means more samples are clipped, which constrains how much the policy $\pi_\theta$ can change.
 - `clip_ratio/low_mean`: The average ratio of sequence probabilities that were clipped on the lower bound of the trust region:  \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\).
 - `clip_ratio/low_min`: The minimum ratio of sequence probabilities that were clipped on the lower bound of the trust region:  \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\).
 - `clip_ratio/high_mean`: The average ratio of sequence probabilities that were clipped on the upper bound of the trust region:  \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\).
 - `clip_ratio/high_max`: The maximum ratio of sequence probabilities that were clipped on the upper bound of the trust region:  \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\).

 ## Customization
--- a/examples/datasets/prm800k.py
+++ b/examples/datasets/prm800k.py
@@ -68,7 +68,7 @@ def process_example(example):
                labels = previous_labels[:] + [label]
                outputs.append({"prompt": prompt, "completions": completions, "labels": labels})

        # Now, exapand the previous completions and labels
        # Now, expand the previous completions and labels
        if step["chosen_completion"] is not None:
            chosen_completion = step["completions"][step["chosen_completion"]]
            label = chosen_completion["rating"] == 1
--- a/examples/notebooks/grpo_ministral3_vl.ipynb
+++ b/examples/notebooks/grpo_ministral3_vl.ipynb
@@ -381,7 +381,7 @@
        "            extraction_config=[LatexExtractionConfig()],\n",
        "        )\n",
        "        if len(gold_parsed) == 0:\n",
        "            # Skip unparseable examples\n",
        "            # Skip unparsable examples\n",
        "            correctness.append(True)  # Treat as correct to avoid penalizing\n",
        "            print(\"Failed to parse gold solution: \", sol)\n",
        "            continue\n",
@@ -459,8 +459,8 @@
        "    # Parameters that control the data preprocessing\n",
        "    per_device_train_batch_size=2,\n",
        "    max_completion_length=1024, # default: 256            # Max completion length produced during training\n",
        "    num_generations=2, # 2, # default: 8                  # Number of generations produced during trainig for comparison\n",
        "    max_prompt_length=2048, # default: 512                # Max prompt lenght of the input prompt used for generation during training\n",
        "    num_generations=2, # 2, # default: 8                  # Number of generations produced during training for comparison\n",
        "    max_prompt_length=2048, # default: 512                # Max prompt length of the input prompt used for generation during training\n",
        "\n",
        "    fp16=False,\n",
        "    bf16=False,\n",
--- a/examples/notebooks/grpo_qwen3_vl.ipynb
+++ b/examples/notebooks/grpo_qwen3_vl.ipynb
@@ -327,7 +327,7 @@
        "            extraction_config=[LatexExtractionConfig()],\n",
        "        )\n",
        "        if len(gold_parsed) == 0:\n",
        "            # Skip unparseable examples\n",
        "            # Skip unparsable examples\n",
        "            correctness.append(True)  # Treat as correct to avoid penalizing\n",
        "            print(\"Failed to parse gold solution: \", sol)\n",
        "            continue\n",
@@ -405,7 +405,7 @@
        "    # Parameters that control the data preprocessing\n",
        "    per_device_train_batch_size=2,\n",
        "    max_completion_length=1024, # default: 256            # Max completion length produced during training\n",
        "    num_generations=2, # 2, # default: 8                  # Number of generations produced during trainig for comparison\n",
        "    num_generations=2, # 2, # default: 8                  # Number of generations produced during training for comparison\n",
        "\n",
        "    fp16=True,\n",
        "\n",
--- a/examples/notebooks/grpo_rnj_1_instruct.ipynb
+++ b/examples/notebooks/grpo_rnj_1_instruct.ipynb
@@ -341,8 +341,8 @@
        "    # Parameters that control the data preprocessing\n",
        "    per_device_train_batch_size=8,\n",
        "    max_completion_length=256, # default: 256             # Max completion length produced during training\n",
        "    num_generations=8, # default: 8                       # Number of generations produced during trainig for comparison\n",
        "    max_prompt_length=512,  # default: 512                # Max prompt lenght of the input prompt used for generation during training\n",
        "    num_generations=8, # default: 8                       # Number of generations produced during training for comparison\n",
        "    max_prompt_length=512,  # default: 512                # Max prompt length of the input prompt used for generation during training\n",
        "\n",
        "    # Parameters related to reporting and saving\n",
        "    output_dir=output_dir,                                # Where to save model checkpoints and logs\n",
--- a/examples/notebooks/sft_ministral3_vl.ipynb
+++ b/examples/notebooks/sft_ministral3_vl.ipynb
@@ -286,7 +286,7 @@
        "id": "bF4GtNO2ne1k"
      },
      "source": [
        "Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to mantain memory usage low but you can configure it."
        "Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to maintain memory usage low but you can configure it."
      ]
    },
    {
--- a/examples/notebooks/sft_qwen_vl.ipynb
+++ b/examples/notebooks/sft_qwen_vl.ipynb
@@ -244,7 +244,7 @@
        "id": "bF4GtNO2ne1k"
      },
      "source": [
        "Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to mantain memory usage low but you can configure it."
        "Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to maintain memory usage low but you can configure it."
      ]
    },
    {
--- a/examples/notebooks/sft_trl_lora_qlora.ipynb
+++ b/examples/notebooks/sft_trl_lora_qlora.ipynb
@@ -429,7 +429,7 @@
        "id": "Gz4ggYeeLWAg"
      },
      "source": [
        "Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to mantain memory usage low but you can configure it."
        "Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to maintain memory usage low but you can configure it."
      ]
    },
    {
--- a/tests/test_data_utils.py
+++ b/tests/test_data_utils.py
@@ -585,7 +585,7 @@ class TestApplyChatTemplate(TrlTestCase):
        # Define test case
        test_case = {
            "prompt": [
                {"content": "Whats the temperature in London?", "role": "user"},
                {"content": "What's the temperature in London?", "role": "user"},
            ]
        }
        # Test with tools
--- a/tests/test_rewards.py
+++ b/tests/test_rewards.py
@@ -113,8 +113,8 @@ class TestAccuracyReward:
        assert rewards[0] == 0.0

    @require_math_latex
    def test_accuracy_reward_unparseable_gold(self):
        """Test accuracy_reward with an unparseable gold solution."""
    def test_accuracy_reward_unparsable_gold(self):
        """Test accuracy_reward with an unparsable gold solution."""
        completion = [
            [{"content": "Answer is forty two."}],
            [{"content": r"Some other content. \boxed{43}."}],
@@ -181,7 +181,7 @@ class TestReasoningAccuracyReward:
        assert rewards[1] == 0.0

    @require_math_latex
    def test_unparseable_gold_solution_yields_none_reward(self):
    def test_unparsable_gold_solution_yields_none_reward(self):
        completions = [
            [{"content": r"<think> Reasoning content </think> \boxed{42}"}],
        ]
--- a/trl/experimental/bco/bco_trainer.py
+++ b/trl/experimental/bco/bco_trainer.py
@@ -41,12 +41,12 @@ from transformers import (
    PreTrainedModel,
    PreTrainedTokenizerBase,
    ProcessorMixin,
    TrainerCallback,
    TrainingArguments,
    is_comet_available,
    is_sklearn_available,
    is_wandb_available,
 )
 from transformers.trainer_callback import TrainerCallback
 from transformers.trainer_utils import EvalLoopOutput, has_length
 from transformers.utils import is_peft_available

--- a/trl/experimental/cpo/cpo_trainer.py
+++ b/trl/experimental/cpo/cpo_trainer.py
@@ -38,10 +38,10 @@ from transformers import (
    PreTrainedModel,
    PreTrainedTokenizerBase,
    ProcessorMixin,
    TrainerCallback,
    is_comet_available,
    is_wandb_available,
 )
 from transformers.trainer_callback import TrainerCallback
 from transformers.trainer_utils import EvalLoopOutput
 from transformers.utils import is_peft_available, is_torch_fx_proxy

--- a/trl/experimental/gkd/gkd_trainer.py
+++ b/trl/experimental/gkd/gkd_trainer.py
@@ -30,8 +30,8 @@ from transformers import (
    PreTrainedModel,
    PreTrainedTokenizerBase,
    ProcessorMixin,
    TrainerCallback,
 )
 from transformers.trainer_callback import TrainerCallback
 from transformers.trainer_utils import EvalPrediction
 from transformers.utils import is_liger_kernel_available, is_peft_available

--- a/trl/experimental/gold/gold_trainer.py
+++ b/trl/experimental/gold/gold_trainer.py
@@ -29,7 +29,7 @@ from accelerate import PartialState
 from accelerate.utils import DistributedType, broadcast_object_list, gather_object, is_peft_model
 from datasets import Dataset, IterableDataset
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
 from transformers import AutoTokenizer, is_bitsandbytes_available
 from transformers import AutoTokenizer, TrainerCallback, TrainerControl, TrainerState, is_bitsandbytes_available
 from transformers.data.data_collator import DataCollator
 from transformers.feature_extraction_utils import FeatureExtractionMixin
 from transformers.generation.configuration_utils import GenerationConfig
@@ -38,7 +38,6 @@ from transformers.integrations.integration_utils import is_wandb_available
 from transformers.modeling_utils import PreTrainedModel
 from transformers.processing_utils import ProcessorMixin
 from transformers.tokenization_utils_base import PreTrainedTokenizerBase
 from transformers.trainer_callback import TrainerCallback, TrainerControl, TrainerState
 from transformers.trainer_utils import EvalPrediction
 from transformers.utils import (
    is_flash_attn_2_available,
--- a/trl/experimental/minillm/minillm_trainer.py
+++ b/trl/experimental/minillm/minillm_trainer.py
@@ -18,8 +18,13 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from datasets import Dataset, IterableDataset
 from transformers import AutoModelForCausalLM, PreTrainedModel, PreTrainedTokenizerBase, ProcessorMixin
 from transformers.trainer_callback import TrainerCallback
 from transformers import (
    AutoModelForCausalLM,
    PreTrainedModel,
    PreTrainedTokenizerBase,
    ProcessorMixin,
    TrainerCallback,
 )
 from transformers.utils import is_peft_available

 from ...models import prepare_deepspeed
--- a/trl/experimental/orpo/orpo_trainer.py
+++ b/trl/experimental/orpo/orpo_trainer.py
@@ -38,11 +38,11 @@ from transformers import (
    PreTrainedModel,
    PreTrainedTokenizerBase,
    ProcessorMixin,
    TrainerCallback,
    is_comet_available,
    is_torch_xla_available,
    is_wandb_available,
 )
 from transformers.trainer_callback import TrainerCallback
 from transformers.trainer_utils import EvalLoopOutput
 from transformers.utils import is_peft_available, is_torch_fx_proxy

--- a/trl/experimental/prm/prm_trainer.py
+++ b/trl/experimental/prm/prm_trainer.py
@@ -30,8 +30,8 @@ from transformers import (
    PreTrainedModel,
    PreTrainedTokenizerBase,
    ProcessorMixin,
    TrainerCallback,
 )
 from transformers.trainer_callback import TrainerCallback
 from transformers.trainer_utils import EvalPrediction
 from transformers.utils import is_peft_available

--- a/trl/trainer/dpo_trainer.py
+++ b/trl/trainer/dpo_trainer.py
@@ -39,6 +39,7 @@ from transformers import (
    PreTrainedModel,
    PreTrainedTokenizerBase,
    ProcessorMixin,
    TrainerCallback,
 )
 from transformers.data.data_collator import DataCollatorMixin
 from transformers.integrations import (
@@ -47,7 +48,6 @@ from transformers.integrations import (
    is_wandb_available,
 )
 from transformers.models.auto.modeling_auto import MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
 from transformers.trainer_callback import TrainerCallback
 from transformers.trainer_utils import EvalLoopOutput
 from transformers.utils import is_liger_kernel_available, is_peft_available

@@ -434,7 +434,7 @@ class DPOTrainer(BaseTrainer):
            if args.per_device_train_batch_size == 1:
                logger.warning(
                    "You are using a per_device_train_batch_size of 1 with padding-free training. Using a batch size "
                    "of 1 anihilate the benefits of padding-free training. Please consider increasing the batch size "
                    "of 1 annihilate the benefits of padding-free training. Please consider increasing the batch size "
                    "to at least 2."
                )
        self.padding_free = args.padding_free
--- a/trl/trainer/reward_config.py
+++ b/trl/trainer/reward_config.py
@@ -37,7 +37,7 @@ class RewardConfig(TrainingArguments):
        model_init_kwargs (`dict[str, Any]`, *optional*):
            Keyword arguments for [`~transformers.AutoModelForCausalLM.from_pretrained`], used when the `model`
            argument of the [`RewardTrainer`] is provided as a string. If you're training a MoE architecture and want
            to include the load balancing/auxilliary loss as a part of the final loss, remember to set
            to include the load balancing/auxiliary loss as a part of the final loss, remember to set
            `output_router_logits=True` in this dictionary.
        chat_template_path (`str`, *optional*):
            If specified, sets the model's chat template. This can either be the path to a tokenizer (local directory
@@ -116,7 +116,9 @@ class RewardConfig(TrainingArguments):
        default=None,
        metadata={
            "help": "Keyword arguments for `AutoModelForCausalLM.from_pretrained`, used when the `model` argument of "
            "the `RewardTrainer` is provided as a string."
            "the `RewardTrainer` is provided as a string. If you're training a MoE architecture and want to include "
            "the load balancing/auxiliary loss as a part of the final loss, remember to set "
            "`output_router_logits=True` in this dictionary."
        },
    )
    chat_template_path: str | None = field(
--- a/trl/trainer/reward_trainer.py
+++ b/trl/trainer/reward_trainer.py
@@ -34,9 +34,9 @@ from transformers import (
    DataCollator,
    PreTrainedModel,
    PreTrainedTokenizerBase,
    TrainerCallback,
 )
 from transformers.data.data_collator import DataCollatorMixin
 from transformers.trainer_callback import TrainerCallback
 from transformers.trainer_utils import EvalPrediction
 from transformers.utils import is_peft_available

--- a/trl/trainer/sft_config.py
+++ b/trl/trainer/sft_config.py
@@ -37,7 +37,7 @@ class SFTConfig(TrainingArguments):
        model_init_kwargs (`dict[str, Any]`, *optional*):
            Keyword arguments for [`~transformers.AutoModelForCausalLM.from_pretrained`], used when the `model`
            argument of the [`SFTTrainer`] is provided as a string. If you're training a MoE architecture and want to
            include the load balancing/auxilliary loss as a part of the final loss, remember to set
            include the load balancing/auxiliary loss as a part of the final loss, remember to set
            `output_router_logits=True` in this dictionary.
        chat_template_path (`str`, *optional*):
            If specified, sets the model's chat template. This can either be the path to a tokenizer (local directory
@@ -146,7 +146,7 @@ class SFTConfig(TrainingArguments):
        metadata={
            "help": "Keyword arguments for `AutoModelForCausalLM.from_pretrained`, used when the `model` argument of "
            "the `SFTTrainer` is provided as a string. If you're training a MoE architecture and want to include the "
            "load balancing/auxilliary loss as a part of the final loss, remember to set `output_router_logits=True` "
            "load balancing/auxiliary loss as a part of the final loss, remember to set `output_router_logits=True` "
            "in this dictionary."
        },
    )
--- a/trl/trainer/sft_trainer.py
+++ b/trl/trainer/sft_trainer.py
@@ -508,7 +508,7 @@ class SFTTrainer(BaseTrainer):
              using `<ModelArchitecture>.from_pretrained` (where `<ModelArchitecture>` is derived from the model
              config) with the keyword arguments in `args.model_init_kwargs`.
            - A [`~transformers.PreTrainedModel`] object.
            If you're training a model with an MoE architecture and want to include the load balancing/auxilliary loss
            If you're training a model with an MoE architecture and want to include the load balancing/auxiliary loss
            as a part of the final loss, remember to set the `output_router_logits` config of the model to `True`.
        args ([`SFTConfig`], *optional*):
            Configuration for this trainer. If `None`, a default configuration is used.
@@ -753,7 +753,7 @@ class SFTTrainer(BaseTrainer):
            if args.per_device_train_batch_size == 1 and not args.packing:
                logger.warning(
                    "You are using a per_device_train_batch_size of 1 with padding-free training. Using a batch size "
                    "of 1 anihilate the benefits of padding-free training. Please consider increasing the batch size "
                    "of 1 annihilate the benefits of padding-free training. Please consider increasing the batch size "
                    "to at least 2."
                )
Author	SHA1	Message	Date
Quentin Gallouédec	70b9360292	Merge branch 'main' into push-generation-with-tiny	23 hours ago
Quentin Gallouédec	e5503ea400	Fix typos (#4690 )	1 day ago
Quentin Gallouédec	73a6470f1c	Merge branch 'main' into push-generation-with-tiny	1 day ago
Quentin Gallouédec	3432f7be1d	Import `TrainerCallback` from top-level transformers (#4694 )	1 day ago
Susant	036ae820b3	[docs] Adds GRPO, RSO and LoRA to Paper Index (#4441 ) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>	3 days ago
casinca	9ee39654a9	Docs(`grpo_trainer.md`): Added Qwen SAPO details under `Loss Types` (#4681 ) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>	4 days ago