6 Commits

Author SHA1 Message Date
  Quentin Gallouédec 70b9360292
Merge branch 'main' into push-generation-with-tiny 23 hours ago
  Quentin Gallouédec e5503ea400
Fix typos (#4690) 1 day ago
  Quentin Gallouédec 73a6470f1c
Merge branch 'main' into push-generation-with-tiny 1 day ago
  Quentin Gallouédec 3432f7be1d
Import `TrainerCallback` from top-level transformers (#4694) 1 day ago
  Susant 036ae820b3
[docs] Adds GRPO, RSO and LoRA to Paper Index (#4441) 3 days ago
  casinca 9ee39654a9
Docs(`grpo_trainer.md`): Added Qwen SAPO details under `Loss Types` (#4681) 4 days ago
24 changed files with 170 additions and 57 deletions
Split View
  1. +31
    -8
      docs/source/grpo_trainer.md
  2. +96
    -7
      docs/source/paper_index.md
  3. +4
    -9
      docs/source/rloo_trainer.md
  4. +1
    -1
      examples/datasets/prm800k.py
  5. +3
    -3
      examples/notebooks/grpo_ministral3_vl.ipynb
  6. +2
    -2
      examples/notebooks/grpo_qwen3_vl.ipynb
  7. +2
    -2
      examples/notebooks/grpo_rnj_1_instruct.ipynb
  8. +1
    -1
      examples/notebooks/sft_ministral3_vl.ipynb
  9. +1
    -1
      examples/notebooks/sft_qwen_vl.ipynb
  10. +1
    -1
      examples/notebooks/sft_trl_lora_qlora.ipynb
  11. +1
    -1
      tests/test_data_utils.py
  12. +3
    -3
      tests/test_rewards.py
  13. +1
    -1
      trl/experimental/bco/bco_trainer.py
  14. +1
    -1
      trl/experimental/cpo/cpo_trainer.py
  15. +1
    -1
      trl/experimental/gkd/gkd_trainer.py
  16. +1
    -2
      trl/experimental/gold/gold_trainer.py
  17. +7
    -2
      trl/experimental/minillm/minillm_trainer.py
  18. +1
    -1
      trl/experimental/orpo/orpo_trainer.py
  19. +1
    -1
      trl/experimental/prm/prm_trainer.py
  20. +2
    -2
      trl/trainer/dpo_trainer.py
  21. +4
    -2
      trl/trainer/reward_config.py
  22. +1
    -1
      trl/trainer/reward_trainer.py
  23. +2
    -2
      trl/trainer/sft_config.py
  24. +2
    -2
      trl/trainer/sft_trainer.py

+ 31
- 8
docs/source/grpo_trainer.md View File

@@ -137,6 +137,33 @@ $$

This constant is recommended to be the maximum completion length. To use this formulation, set `loss_type="dr_grpo"` in the [`GRPOConfig`].

Alternatively, in the [SAPO paper](https://huggingface.co/papers/2511.20347), the Qwen team proposes replacing the "hard" clipping mechanism of GRPO with a smooth, temperature-controlled soft gating mechanism. While GRPO zeroes out gradients when the policy deviates too far from the reference, SAPO uses a soft trust region that smoothly decays the gradient weight. This allows the model to retain useful learning signals from "near-on-policy" tokens while suppressing noise from extreme deviations.

The loss function is defined as:

$$
\mathcal{L}_{\text{SAPO}}(\theta) = - \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} f_{i,t} \left( \frac{\pi_\theta(o_{i,t} | q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,<t})} \right) \hat{A}_{i,t}
$$

The soft-gating function \\( f_{i,t} \\) is defined using the sigmoid function \\( \sigma \\) as:

$$
f_{i,t}(x) = \sigma \left( \tau_{i,t} (x - 1) \right) \cdot \frac{4}{\tau_{i,t}}
$$

The temperature \\( \tau_{i,t} \\) is chosen based on the sign of the advantage \\( \hat{A}_{i,t} \\):

$$
\tau_{i,t} = \begin{cases}
\tau_{\text{pos}}, & \text{if } \hat{A}_{i,t} > 0 \\
\tau_{\text{neg}}, & \text{otherwise}
\end{cases}
$$

They recommends using asymmetric temperatures, \\( \tau_{\text{neg}} > \tau_{\text{pos}} \\) (defaults are \\( \tau_{\text{pos}}=1.0, \tau_{\text{neg}}=1.05 \\) ). This ensures that the model is penalized more strictly for "bad" actions to prevent instability, while being more permissive with "good" actions.

To use this formulation, set `loss_type="sapo"` in the [`GRPOConfig`].

## Logged metrics

While training and evaluating, we record the following reward metrics:
@@ -159,14 +186,10 @@ While training and evaluating, we record the following reward metrics:
- `frac_reward_zero_std`: The fraction of samples in the generation batch with a reward std of zero, implying there is little diversity for that prompt (all answers are correct or incorrect).
- `entropy`: Average entropy of token predictions across generated completions. (If `mask_truncated_completions=True`, masked sequences tokens are excluded.)
- `kl`: The average KL divergence between the model and the reference model, calculated over generated completions. Logged only if `beta` is nonzero.
- `clip_ratio/region_mean`: The ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities where the GRPO objective is clipped to stay within the trust region:
$$
\text{clip}\left( r_{i,t}(\theta), 1 - \epsilon_\mathrm{low}, 1 + \epsilon_\mathrm{high} \right)\,, \qquad r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}\,.
$$
A higher value means more tokens are clipped, which constrains how much the policy $\pi_\theta$ can change.
- `clip_ratio/low_mean`: The average ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\)
- `clip_ratio/low_min`: The minimum ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\)
- `clip_ratio/high_mean`: The average ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the upper bound of the trust region: \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\)
- `clip_ratio/region_mean`: The ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities where the GRPO objective is clipped to stay within the trust region: \\( \text{clip}\left( r_{i,t}(\theta), 1 - \epsilon_\mathrm{low}, 1 + \epsilon_\mathrm{high} \right)\,, \quad r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \\). A higher value means more tokens are clipped, which constrains how much the policy $\pi_\theta$ can change.
- `clip_ratio/low_mean`: The average ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\).
- `clip_ratio/low_min`: The minimum ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\).
- `clip_ratio/high_mean`: The average ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the upper bound of the trust region: \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\).
- `clip_ratio/high_max`: The maximum ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the upper bound of the trust region: \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\).

## Customization


+ 96
- 7
docs/source/paper_index.md View File

@@ -1,11 +1,38 @@
# Paper Index

> [!WARNING]
> Section under construction. Feel free to contribute!
> Section under construction. Feel free to contribute! See https://github.com/huggingface/trl/issues/4407.

## Group Relative Policy Optimization

Papers relating to the [`GRPOTrainer`]
Papers relating to the [`GRPOTrainer`].

### DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

**📜 Paper**: https://huggingface.co/papers/2402.03300

Introduces Group Relative Policy Optimization (GRPO) and shows strong math-reasoning gains from math-centric pretraining plus group-relative PPO-style optimization. Used in TRL via [`GRPOTrainer`].

```python
from trl import GRPOConfig, GRPOTrainer

# The paper doesn't specify its hyperparameters, so here we provide hyperparameters from "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning" instead.
training_args = GRPOConfig(
loss_type="grpo",
beta=0.001, # "the KL coefficient to 0.001"
epsilon=10.0, # "the GRPO clip ratio ϵ to 10"
num_generations=16, # "For each question, we sample 16 outputs..."
max_completion_length=32_768, # "...with a maximum length of 32,768"
steps_per_generation=16, # "To accelerate training, each rollout generates 8,192 outputs, which are randomly split into 16 minibatches"
# "resulting in a training batch size of 512". One way to achieve this setting with 1 device is per_device_train_batch_size=4, gradient_accumulation_steps=128
per_device_train_batch_size=4,
gradient_accumulation_steps=128,
)
trainer = GRPOTrainer(
...,
args=training_args,
)
```

### Group Sequence Policy Optimization

@@ -86,7 +113,7 @@ training_args = GRPOConfig(
per_device_train_batch_size=512, # mini-batch size for training in the paper, DAPO paper: section 4.1
num_generations=16, # number of sample responses in the paper, DAPO paper: section 4.1
max_completion_length=20480, # maximum number of tokens for generation in the paper, DAPO paper: section 4.1
beta=0.0 # section 2.3, DAPO paper
beta=0.0, # section 2.3, DAPO paper

)
# Soft Overlong Punishment
@@ -411,16 +438,16 @@ from trl import GRPOConfig

training_args = GRPOConfig(
...,
beta=0.001, # the paper don't specify the value used, so we use the value from "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning"
beta=0.001, # the paper doesn't specify the value used, so we use the value from "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning"
use_bias_correction_kl=True,
)
```

## Direct Policy Optimization

Papers relating to the [`DPOTrainer`]
- Papers relating to the [`DPOTrainer`]

### Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model
### Direct Preference Optimization: Your Language Model is Secretly a Reward Model

**📜 Paper**: https://huggingface.co/papers/2305.18290

@@ -441,7 +468,7 @@ training_args = DPOConfig(

**📜 Paper**: https://huggingface.co/papers/2310.12036

A new general objective, \\( \Psi \\)$PO, bypasses both key approximations in reinforcement learning from human preferences, allowing for theoretical analysis and empirical superiority over DPO. To reproduce the paper's setting, use this configuration: To reproduce the paper's setting, use this configuration:
A new general objective, \\( \Psi \\)PO, bypasses both key approximations in reinforcement learning from human preferences, allowing for theoretical analysis and empirical superiority over DPO. To reproduce the paper's setting, use this configuration: To reproduce the paper's setting, use this configuration:

```python
from trl import DPOConfig
@@ -641,6 +668,46 @@ training_args = DPOConfig(

These parameters only appear in the [published version](https://aclanthology.org/2025.tacl-1.22.pdf)

### Statistical Rejection Sampling Improves Preference Optimization

**📜 Paper**: https://huggingface.co/papers/2309.06657

Proposes **RSO**, selecting stronger preference pairs via statistical rejection sampling to boost offline preference optimization; complements DPO/SLiC. They also introduce a new loss defined as:

$$
\mathcal{L}_{\text{hinge-norm}}(\pi_\theta)
= \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}
\left[
\max\left(0,\; 1 - \left[\gamma \log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \gamma \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}\right]\right)
\right]
$$

To train with RSO-filtered data and the hinge-norm loss, you can use the following code:

```python
from trl import DPOConfig, DPOTrainer

dataset = ...

def rso_accept(example): # replace with your actual filter/score logic
return example["rso_keep"]

train_dataset = train_dataset.filter(rso_accept)

training_args = DPOConfig(
loss_type="hinge",
beta=0.05, # correspond to gamma in the paper
)

trainer = DPOTrainer(
...,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()

```

## Kahneman–Tversky Optimization

Papers relating to the [`experimental.kto.KTOTrainer`]
@@ -721,6 +788,26 @@ SFTConfig(
)
```

## Parameter-Efficient Fine-Tuning (PEFT)

For general details on using PEFT with TRL, please refer to the [PEFT Integration](peft_integration) guide.

### LoRA: Low-Rank Adaptation of Large Language Models

**📜 Paper**: https://huggingface.co/papers/2106.09685

Low-Rank Adaptation (LoRA) reduces the number of trainable parameters and GPU memory usage in large-scale pre-trained models while maintaining or improving performance on downstream tasks. TRL integrates LoRA via the [PEFT library](https://huggingface.co/docs/peft/index) and can be easily enabled in any TRL trainer by passing a [`~peft.LoraConfig`] to the `peft_config` argument. Here is an example of using LoRA with the [`SFTTrainer`]:

```python
from trl import SFTTrainer
from peft import LoraConfig

trainer = SFTTrainer(
...,
peft_config=LoraConfig(),
)
```

## Reinforce Leave-One-Out

Papers relating to the [`RLOOTrainer`]
@@ -818,9 +905,11 @@ dataset = dataset.map(add_margin)
```

## Distillation

Papers relating to training a student model with the help of a teacher model.

### On-Policy Distillation

**📰 Blog**: https://thinkingmachines.ai/blog/on-policy-distillation/

On-Policy Distillation involves a student model generating rollouts for each batch of training data. We subsequently obtain the probability distributions for each token of the rollouts from both the student and teacher models. The student model is then optimized to minimize the negative Kullback-Leibler (KL) divergence between its own token distributions and those of the teacher model.


+ 4
- 9
docs/source/rloo_trainer.md View File

@@ -143,15 +143,10 @@ While training and evaluating, we record the following reward metrics:
- `frac_reward_zero_std`: The fraction of samples in the generation batch with a reward std of zero, implying there is little diversity for that prompt (all answers are correct or incorrect).
- `entropy`: Average entropy of token predictions across generated completions. (If `mask_truncated_completions=True`, masked sequences tokens are excluded.)
- `kl`: The average KL divergence between the model and the reference model, calculated over generated completions. Logged only if `beta` is nonzero.
- `clip_ratio/region_mean`: The ratio of sequence probabilities where the RLOO objective is clipped to stay within the trust region:
$$
\text{clip}\left( r_{i}(\theta), 1 - \epsilon_\mathrm{low}, 1 + \epsilon_\mathrm{high} \right)\,, \qquad r_{i}(\theta) = \frac{\pi_\theta(o_{i} \mid q)}{\pi_{\theta_{\text{old}}}(o_{i} \mid q)}\,.
$$

A higher value means more samples are clipped, which constrains how much the policy $\pi_\theta$ can change.
- `clip_ratio/low_mean`: The average ratio of sequence probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\)
- `clip_ratio/low_min`: The minimum ratio of sequence probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\)
- `clip_ratio/high_mean`: The average ratio of sequence probabilities that were clipped on the upper bound of the trust region: \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\)
- `clip_ratio/region_mean`: The ratio of sequence probabilities where the RLOO objective is clipped to stay within the trust region: \\( \text{clip}\left( r_{i}(\theta), 1 - \epsilon_\mathrm{low}, 1 + \epsilon_\mathrm{high} \right)\,, \quad r_{i}(\theta) = \frac{\pi_\theta(o_{i} \mid q)}{\pi_{\theta_{\text{old}}}(o_{i} \mid q)} \\). A higher value means more samples are clipped, which constrains how much the policy $\pi_\theta$ can change.
- `clip_ratio/low_mean`: The average ratio of sequence probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\).
- `clip_ratio/low_min`: The minimum ratio of sequence probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\).
- `clip_ratio/high_mean`: The average ratio of sequence probabilities that were clipped on the upper bound of the trust region: \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\).
- `clip_ratio/high_max`: The maximum ratio of sequence probabilities that were clipped on the upper bound of the trust region: \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\).

## Customization


+ 1
- 1
examples/datasets/prm800k.py View File

@@ -68,7 +68,7 @@ def process_example(example):
labels = previous_labels[:] + [label]
outputs.append({"prompt": prompt, "completions": completions, "labels": labels})

# Now, exapand the previous completions and labels
# Now, expand the previous completions and labels
if step["chosen_completion"] is not None:
chosen_completion = step["completions"][step["chosen_completion"]]
label = chosen_completion["rating"] == 1


+ 3
- 3
examples/notebooks/grpo_ministral3_vl.ipynb View File

@@ -381,7 +381,7 @@
" extraction_config=[LatexExtractionConfig()],\n",
" )\n",
" if len(gold_parsed) == 0:\n",
" # Skip unparseable examples\n",
" # Skip unparsable examples\n",
" correctness.append(True) # Treat as correct to avoid penalizing\n",
" print(\"Failed to parse gold solution: \", sol)\n",
" continue\n",
@@ -459,8 +459,8 @@
" # Parameters that control the data preprocessing\n",
" per_device_train_batch_size=2,\n",
" max_completion_length=1024, # default: 256 # Max completion length produced during training\n",
" num_generations=2, # 2, # default: 8 # Number of generations produced during trainig for comparison\n",
" max_prompt_length=2048, # default: 512 # Max prompt lenght of the input prompt used for generation during training\n",
" num_generations=2, # 2, # default: 8 # Number of generations produced during training for comparison\n",
" max_prompt_length=2048, # default: 512 # Max prompt length of the input prompt used for generation during training\n",
"\n",
" fp16=False,\n",
" bf16=False,\n",


+ 2
- 2
examples/notebooks/grpo_qwen3_vl.ipynb View File

@@ -327,7 +327,7 @@
" extraction_config=[LatexExtractionConfig()],\n",
" )\n",
" if len(gold_parsed) == 0:\n",
" # Skip unparseable examples\n",
" # Skip unparsable examples\n",
" correctness.append(True) # Treat as correct to avoid penalizing\n",
" print(\"Failed to parse gold solution: \", sol)\n",
" continue\n",
@@ -405,7 +405,7 @@
" # Parameters that control the data preprocessing\n",
" per_device_train_batch_size=2,\n",
" max_completion_length=1024, # default: 256 # Max completion length produced during training\n",
" num_generations=2, # 2, # default: 8 # Number of generations produced during trainig for comparison\n",
" num_generations=2, # 2, # default: 8 # Number of generations produced during training for comparison\n",
"\n",
" fp16=True,\n",
"\n",


+ 2
- 2
examples/notebooks/grpo_rnj_1_instruct.ipynb View File

@@ -341,8 +341,8 @@
" # Parameters that control the data preprocessing\n",
" per_device_train_batch_size=8,\n",
" max_completion_length=256, # default: 256 # Max completion length produced during training\n",
" num_generations=8, # default: 8 # Number of generations produced during trainig for comparison\n",
" max_prompt_length=512, # default: 512 # Max prompt lenght of the input prompt used for generation during training\n",
" num_generations=8, # default: 8 # Number of generations produced during training for comparison\n",
" max_prompt_length=512, # default: 512 # Max prompt length of the input prompt used for generation during training\n",
"\n",
" # Parameters related to reporting and saving\n",
" output_dir=output_dir, # Where to save model checkpoints and logs\n",


+ 1
- 1
examples/notebooks/sft_ministral3_vl.ipynb View File

@@ -286,7 +286,7 @@
"id": "bF4GtNO2ne1k"
},
"source": [
"Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to mantain memory usage low but you can configure it."
"Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to maintain memory usage low but you can configure it."
]
},
{


+ 1
- 1
examples/notebooks/sft_qwen_vl.ipynb View File

@@ -244,7 +244,7 @@
"id": "bF4GtNO2ne1k"
},
"source": [
"Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to mantain memory usage low but you can configure it."
"Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to maintain memory usage low but you can configure it."
]
},
{


+ 1
- 1
examples/notebooks/sft_trl_lora_qlora.ipynb View File

@@ -429,7 +429,7 @@
"id": "Gz4ggYeeLWAg"
},
"source": [
"Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to mantain memory usage low but you can configure it."
"Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to maintain memory usage low but you can configure it."
]
},
{


+ 1
- 1
tests/test_data_utils.py View File

@@ -585,7 +585,7 @@ class TestApplyChatTemplate(TrlTestCase):
# Define test case
test_case = {
"prompt": [
{"content": "Whats the temperature in London?", "role": "user"},
{"content": "What's the temperature in London?", "role": "user"},
]
}
# Test with tools


+ 3
- 3
tests/test_rewards.py View File

@@ -113,8 +113,8 @@ class TestAccuracyReward:
assert rewards[0] == 0.0

@require_math_latex
def test_accuracy_reward_unparseable_gold(self):
"""Test accuracy_reward with an unparseable gold solution."""
def test_accuracy_reward_unparsable_gold(self):
"""Test accuracy_reward with an unparsable gold solution."""
completion = [
[{"content": "Answer is forty two."}],
[{"content": r"Some other content. \boxed{43}."}],
@@ -181,7 +181,7 @@ class TestReasoningAccuracyReward:
assert rewards[1] == 0.0

@require_math_latex
def test_unparseable_gold_solution_yields_none_reward(self):
def test_unparsable_gold_solution_yields_none_reward(self):
completions = [
[{"content": r"<think> Reasoning content </think> \boxed{42}"}],
]


+ 1
- 1
trl/experimental/bco/bco_trainer.py View File

@@ -41,12 +41,12 @@ from transformers import (
PreTrainedModel,
PreTrainedTokenizerBase,
ProcessorMixin,
TrainerCallback,
TrainingArguments,
is_comet_available,
is_sklearn_available,
is_wandb_available,
)
from transformers.trainer_callback import TrainerCallback
from transformers.trainer_utils import EvalLoopOutput, has_length
from transformers.utils import is_peft_available



+ 1
- 1
trl/experimental/cpo/cpo_trainer.py View File

@@ -38,10 +38,10 @@ from transformers import (
PreTrainedModel,
PreTrainedTokenizerBase,
ProcessorMixin,
TrainerCallback,
is_comet_available,
is_wandb_available,
)
from transformers.trainer_callback import TrainerCallback
from transformers.trainer_utils import EvalLoopOutput
from transformers.utils import is_peft_available, is_torch_fx_proxy



+ 1
- 1
trl/experimental/gkd/gkd_trainer.py View File

@@ -30,8 +30,8 @@ from transformers import (
PreTrainedModel,
PreTrainedTokenizerBase,
ProcessorMixin,
TrainerCallback,
)
from transformers.trainer_callback import TrainerCallback
from transformers.trainer_utils import EvalPrediction
from transformers.utils import is_liger_kernel_available, is_peft_available



+ 1
- 2
trl/experimental/gold/gold_trainer.py View File

@@ -29,7 +29,7 @@ from accelerate import PartialState
from accelerate.utils import DistributedType, broadcast_object_list, gather_object, is_peft_model
from datasets import Dataset, IterableDataset
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from transformers import AutoTokenizer, is_bitsandbytes_available
from transformers import AutoTokenizer, TrainerCallback, TrainerControl, TrainerState, is_bitsandbytes_available
from transformers.data.data_collator import DataCollator
from transformers.feature_extraction_utils import FeatureExtractionMixin
from transformers.generation.configuration_utils import GenerationConfig
@@ -38,7 +38,6 @@ from transformers.integrations.integration_utils import is_wandb_available
from transformers.modeling_utils import PreTrainedModel
from transformers.processing_utils import ProcessorMixin
from transformers.tokenization_utils_base import PreTrainedTokenizerBase
from transformers.trainer_callback import TrainerCallback, TrainerControl, TrainerState
from transformers.trainer_utils import EvalPrediction
from transformers.utils import (
is_flash_attn_2_available,


+ 7
- 2
trl/experimental/minillm/minillm_trainer.py View File

@@ -18,8 +18,13 @@ import torch
import torch.nn as nn
import torch.nn.functional as F
from datasets import Dataset, IterableDataset
from transformers import AutoModelForCausalLM, PreTrainedModel, PreTrainedTokenizerBase, ProcessorMixin
from transformers.trainer_callback import TrainerCallback
from transformers import (
AutoModelForCausalLM,
PreTrainedModel,
PreTrainedTokenizerBase,
ProcessorMixin,
TrainerCallback,
)
from transformers.utils import is_peft_available

from ...models import prepare_deepspeed


+ 1
- 1
trl/experimental/orpo/orpo_trainer.py View File

@@ -38,11 +38,11 @@ from transformers import (
PreTrainedModel,
PreTrainedTokenizerBase,
ProcessorMixin,
TrainerCallback,
is_comet_available,
is_torch_xla_available,
is_wandb_available,
)
from transformers.trainer_callback import TrainerCallback
from transformers.trainer_utils import EvalLoopOutput
from transformers.utils import is_peft_available, is_torch_fx_proxy



+ 1
- 1
trl/experimental/prm/prm_trainer.py View File

@@ -30,8 +30,8 @@ from transformers import (
PreTrainedModel,
PreTrainedTokenizerBase,
ProcessorMixin,
TrainerCallback,
)
from transformers.trainer_callback import TrainerCallback
from transformers.trainer_utils import EvalPrediction
from transformers.utils import is_peft_available



+ 2
- 2
trl/trainer/dpo_trainer.py View File

@@ -39,6 +39,7 @@ from transformers import (
PreTrainedModel,
PreTrainedTokenizerBase,
ProcessorMixin,
TrainerCallback,
)
from transformers.data.data_collator import DataCollatorMixin
from transformers.integrations import (
@@ -47,7 +48,6 @@ from transformers.integrations import (
is_wandb_available,
)
from transformers.models.auto.modeling_auto import MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
from transformers.trainer_callback import TrainerCallback
from transformers.trainer_utils import EvalLoopOutput
from transformers.utils import is_liger_kernel_available, is_peft_available

@@ -434,7 +434,7 @@ class DPOTrainer(BaseTrainer):
if args.per_device_train_batch_size == 1:
logger.warning(
"You are using a per_device_train_batch_size of 1 with padding-free training. Using a batch size "
"of 1 anihilate the benefits of padding-free training. Please consider increasing the batch size "
"of 1 annihilate the benefits of padding-free training. Please consider increasing the batch size "
"to at least 2."
)
self.padding_free = args.padding_free


+ 4
- 2
trl/trainer/reward_config.py View File

@@ -37,7 +37,7 @@ class RewardConfig(TrainingArguments):
model_init_kwargs (`dict[str, Any]`, *optional*):
Keyword arguments for [`~transformers.AutoModelForCausalLM.from_pretrained`], used when the `model`
argument of the [`RewardTrainer`] is provided as a string. If you're training a MoE architecture and want
to include the load balancing/auxilliary loss as a part of the final loss, remember to set
to include the load balancing/auxiliary loss as a part of the final loss, remember to set
`output_router_logits=True` in this dictionary.
chat_template_path (`str`, *optional*):
If specified, sets the model's chat template. This can either be the path to a tokenizer (local directory
@@ -116,7 +116,9 @@ class RewardConfig(TrainingArguments):
default=None,
metadata={
"help": "Keyword arguments for `AutoModelForCausalLM.from_pretrained`, used when the `model` argument of "
"the `RewardTrainer` is provided as a string."
"the `RewardTrainer` is provided as a string. If you're training a MoE architecture and want to include "
"the load balancing/auxiliary loss as a part of the final loss, remember to set "
"`output_router_logits=True` in this dictionary."
},
)
chat_template_path: str | None = field(


+ 1
- 1
trl/trainer/reward_trainer.py View File

@@ -34,9 +34,9 @@ from transformers import (
DataCollator,
PreTrainedModel,
PreTrainedTokenizerBase,
TrainerCallback,
)
from transformers.data.data_collator import DataCollatorMixin
from transformers.trainer_callback import TrainerCallback
from transformers.trainer_utils import EvalPrediction
from transformers.utils import is_peft_available



+ 2
- 2
trl/trainer/sft_config.py View File

@@ -37,7 +37,7 @@ class SFTConfig(TrainingArguments):
model_init_kwargs (`dict[str, Any]`, *optional*):
Keyword arguments for [`~transformers.AutoModelForCausalLM.from_pretrained`], used when the `model`
argument of the [`SFTTrainer`] is provided as a string. If you're training a MoE architecture and want to
include the load balancing/auxilliary loss as a part of the final loss, remember to set
include the load balancing/auxiliary loss as a part of the final loss, remember to set
`output_router_logits=True` in this dictionary.
chat_template_path (`str`, *optional*):
If specified, sets the model's chat template. This can either be the path to a tokenizer (local directory
@@ -146,7 +146,7 @@ class SFTConfig(TrainingArguments):
metadata={
"help": "Keyword arguments for `AutoModelForCausalLM.from_pretrained`, used when the `model` argument of "
"the `SFTTrainer` is provided as a string. If you're training a MoE architecture and want to include the "
"load balancing/auxilliary loss as a part of the final loss, remember to set `output_router_logits=True` "
"load balancing/auxiliary loss as a part of the final loss, remember to set `output_router_logits=True` "
"in this dictionary."
},
)


+ 2
- 2
trl/trainer/sft_trainer.py View File

@@ -508,7 +508,7 @@ class SFTTrainer(BaseTrainer):
using `<ModelArchitecture>.from_pretrained` (where `<ModelArchitecture>` is derived from the model
config) with the keyword arguments in `args.model_init_kwargs`.
- A [`~transformers.PreTrainedModel`] object.
If you're training a model with an MoE architecture and want to include the load balancing/auxilliary loss
If you're training a model with an MoE architecture and want to include the load balancing/auxiliary loss
as a part of the final loss, remember to set the `output_router_logits` config of the model to `True`.
args ([`SFTConfig`], *optional*):
Configuration for this trainer. If `None`, a default configuration is used.
@@ -753,7 +753,7 @@ class SFTTrainer(BaseTrainer):
if args.per_device_train_batch_size == 1 and not args.packing:
logger.warning(
"You are using a per_device_train_batch_size of 1 with padding-free training. Using a batch size "
"of 1 anihilate the benefits of padding-free training. Please consider increasing the batch size "
"of 1 annihilate the benefits of padding-free training. Please consider increasing the batch size "
"to at least 2."
)



Loading…
Cancel
Save
Baidu
map