* apply_rotary_pos_emb should be called
* fix position_embeddings usage in granitemoehybrid
* setting `self.rotary_emb` to None only in hybrid models. Safer, since all modules are highly modular.
* minor
* adding `position_embedding_type` to the config.
* review cleanup
* modeling too
* rewrite conditionally applying rope
* resolve rotary_emb issue
no need to gather all parameters to save model when using deepspeed and LoRA
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Ferdinand Mom <47445085+3outeille@users.noreply.github.com>
* Fix BLT training_ci overfit test by disabling cache and adjusting training thresholds
* Fix BLT training_ci overfit test by disabling cache and adjusting training thresholds
* Fix BLT training_ci overfit test by disabling cache and adjusting training thresholds
* Format BLT tests with ruff
* Fix BLT training CI with custom weight initialization and overfit test
* Fix BLT training CI with custom weight initialization and overfit test
* Fix BLT training CI with custom weight initialization and overfit test
* Fix BLT training CI with custom weight initialization and overfit test
* Fix BLT training CI with custom weight initialization and overfit test
* Fix BLT training CI with custom weight initialization and overfit test
* Update BLT init logic and adjust repo checks for non-functional model wrappers
* Fix repo/config checks by marking BLT Text/Vision models as placeholders
* Fix repo/config checks by marking BLT Text/Vision models as placeholders
* Fix repo/config checks by marking BLT Text/Vision models as placeholders
* Document BLT weight initialization sources and restore default overfit thresholds
* Align BLT weight init with nn.init
* Fix BLT init weights and remove modular conversion issues
* fixes circle ci failures
* fix
* fix
* fix recurrent_gemma overfit generation with cache
* Fix recurrent_gemma overfit generation with cache
* rerun circleci
* rerun circleci
* Log RecurrentGemma cache exception in training mixin
* ci: rerun
* ci: rerun
* ci: rerun
---------
Co-authored-by: Ferdinand Mom <47445085+3outeille@users.noreply.github.com>
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Overview
Transformers provides multiple inference optimization techniques to make models fast, affordable, and accessible. Options include alternative attention mechanisms for reduced memory traffic, code compilation for faster execution, and optimized kernels for throughput. Stack these techniques for maximum performance.
> [!NOTE]
> Memory and speed are closely related but not the same. Shrinking your memory footprint makes a model "faster" because there is less data to move around. Pure speed optimizations don't always reduce memory and sometimes increase usage. Choose the appropriate optimization based on your use case and hardware.
Use the table below to pick an optimization technique.
This guide gives you a quick start on Transformers optimizations.
## Compilation
[torch.compile](./perf_torch_compile) reduces Python overhead, fuses operations, and creates kernels tuned for your shapes and hardware. The first run warms it up and subsequent runs use the faster compiled path.
Pass a [fixed size cache](./kv_cache#fixed-size-cache) to [`~GenerationMixin.generate`] to trigger `torch.compile` automatically.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
> Avoid calling `torch.compile(model)` outside of [`~GenerationMixin.generate`] to prevent the model from recompiling every step.
## Attention backends
Alternative [attention backends](./attention_interface) lower memory traffic. For example, FlashAttention tiles attention computations and avoids large intermediate tensors to reduce memory footprint.
Set `attn_implementation` in [`~PreTrainedModel.from_pretrained`] to load an optimized attention backend.
```py
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", attn_implementation="flash_attention_2")
```
## Kernels
Kernels fuse operations to boost throughput and reduce memory usage. The [Kernels](https://huggingface.co/docs/kernels/en/index) library loads optimized compute kernels from the [Hub](https://huggingface.co/kernels-community) in a flexible and version-safe way.
The example below loads an optimized FlashAttention-2 kernel without installing the package.
[Quantization](./quantization/overview) shrinks the size of every parameter which lowers memory footprint and increases speed because you can do more operations.
Pass a quantization config to the `quantization_config` argument in [`~PreTrainedModel.from_pretrained`]. Each quantization backend has a different config with different arguments. The example below quantizes a model to 4-bits and configures the computation dtype with the [bitsandbytes](./quantization/bitsandbytes) backend.
```py
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
[Caching](./kv_cache) speeds up generation by reusing past keys and values instead of recomputing them for every token. To offset and reduce the memory cost of storing past keys and values, Transformers
supports offloading the cache to the CPU. Only the current layer remains on the GPU.
Use the `cache_implementation` argument in [`~GenerationMixin.generate`] to set a cache strategy.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
[Parallelism](./perf_infer_gpu_multi) distributes a model across devices so models too big for one device run fast. This approach uses more memory due to sharding overhead and communication to sync results.
[Tensor parallelism](./perf_infer_gpu_multi) splits a model layer across devices. Set `tp_plan="auto"` in [`~PreTrainedModel.from_pretrained`] to enable it.
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", tp_plan="auto")
print(model._tp_plan)
```
## Continuous batching
[Continuous batching](./continuous_batching) maximizes throughput by keeping the GPU busy with dynamic scheduling and chunked prefill. [Serving](./serving.md) applications use it to process multiple incoming requests concurrently.
Use [`~ContinuousMixin.generate_batch`] to enable continuous batching.
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
@@ -105,6 +127,7 @@ class KernelConfig(PushToHubMixin):
2. Each kernel value is either a string of the form 'org/repo:layer_name' or a dict mapping device types ("cuda", "rocm", "xpu", "npu") to such strings.
3. Each device key in a dict is one of "cuda", "rocm", "xpu", or "npu".
4. Each repo_name is a valid repository and layer name in the format 'org/repo:layer_name' (i.e., a string containing both a slash and a colon).
5. If a local path is detected, it should be in the format '/abs/path:layer_name'. The absolute path must include the `package_name`, like "/home/user/layer_norm".
Args:
model: The model instance whose modules are checked for registered kernel_layer_name attributes.
@@ -114,14 +137,13 @@ class KernelConfig(PushToHubMixin):
or if a repo_name is not a valid 'org/repo:layer_name' string.
"""
MAPPING_FORMAT = """
For single device form remote
{
"RMSNorm":
"kernels-community/layer_norm:LlamaRMSNorm",
...
},
or
For multiple devices form remote
{
"RMSNorm": {
"cuda":
@@ -132,6 +154,23 @@ class KernelConfig(PushToHubMixin):
},
...
}
For single device form local
{
"RMSNorm":
"/abs/path:LlamaRMSNorm",
...
},
For multiple devices form local
{
"RMSNorm": {
"cuda":
"/abs/path:LlamaRMSNorm",
"rocm":
"/abs/path:LlamaRMSNorm",
...
},
...
}
"""
self.store_registered_layer_names(model)
# Validate that the kernel mapping is a dict
@@ -149,7 +188,7 @@ class KernelConfig(PushToHubMixin):
if isinstance(kernel, str):
if "/" not in kernel or ":" not in kernel:
raise ValueError(
f"Kernel mapping for '{layer_name}' must be a valid repo name with a layer name (e.g., 'org/repo:layer_name'), got: {kernel}"
f"Kernel mapping for '{layer_name}' must be a valid repo name with a layer name (e.g., 'org/repo:layer_name' or '/abs/path:layer_name'), got: {kernel}"
)
elif isinstance(kernel, dict):
@@ -159,9 +198,8 @@ class KernelConfig(PushToHubMixin):
if not isinstance(repo_name, str) or "/" not in repo_name or ":" not in repo_name:
raise ValueError(
f"Kernel mapping for '{layer_name}' must be a valid repo name with a layer name (e.g., 'org/repo:layer_name'), got: {repo_name}"
f"Kernel mapping for '{layer_name}' must be a valid repo name with a layer name (e.g., 'org/repo:layer_name' or '/abs/path:layer_name'), got: {repo_name}"
)
else:
raise ValueError(f"Kernel mapping must follow the format: {MAPPING_FORMAT}, got: {kernel}")
@@ -174,18 +212,13 @@ class KernelConfig(PushToHubMixin):
...
},
or
or for local path:
{
"RMSNorm": {
"cuda":
"kernels-community/layer_norm:LlamaRMSNorm",
"rocm":
"kernels-community/layer_norm:LlamaRMSNorm",
...
},
"RMSNorm":
"/home/user/liger_kernels:LigerRMSNorm",
...
}
},
into a nested mapping:
@@ -200,6 +233,20 @@ class KernelConfig(PushToHubMixin):
}
}
or for local path:
{
"RMSNorm": {
"cuda": {
Mode.INFERENCE: LocalLayerRepository(
repo_path=Path("/home/user/liger_kernels"),
package_name="liger_kernels",
layer_name="LigerRMSNorm",
)
}
}
}
that's compatible with the kernels library.
The device is inferred from the model's parameters if not provided.
@@ -217,11 +264,17 @@ class KernelConfig(PushToHubMixin):
@require_torch_large_accelerator(memory=48) # Tested on A100 but requires around 48GiB
@require_bitsandbytes
@slow
def test_model_21b_a3b_generation(self):
EXPECTED_TEXT_COMPLETION = "User: Hey, are you conscious? Can you talk to me?\nAssistant: I don't have consciousness in the way humans do. I'm a text-based AI created to process and generate responses based on patterns in data." # fmt: skip
EXPECTED_TEXT_COMPLETION = "User: Hey, are you conscious? Can you talk to me?\nAssistant: \nI don't have consciousness in the way humans do. I don't feel emotions, have thoughts, or experience awareness. However, I'm" # fmt: skip
text = tokenizer.decode(generated_ids[0], skip_special_tokens=True).strip("\n")
self.assertEqual(EXPECTED_TEXT_COMPLETION, text)
def test_shortened_model_generation(self):
# This is gibberish which is expected as the model are the first x layers of the original 28B model
EXPECTED_TEXT_COMPLETION = 'User: Hey, are you conscious? Can you talk to me?\nAssistant: 不了的 tongues说话 dagat绵席裹着头phones<mask:11>odikèkèk<mask:11><mask:11>bun褶席席地说起来这么说的话的话retti upside upsideolate疡疡疡' # fmt: skip
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.