* simplify using custom resolution for sam3 and sam3_video inference
* revert auto format
* use setters and properties
* Fix docstring
* update dict to correctly save image_size to file for backward compatibility
* Fix Apertus model crash on float16 hardware
Initialize XIELU activation with correct dtype from config (using config.dtype instead of default bfloat16) to prevent promotion to float32 and subsequent crashes on Turing/float16 GPUs.
* refactor: Move `ACT2CLS` import to top-level in Apertus models.
* remove null values from saved preporcessor file for fast image processor
* preserve explicit None values != class default
* Fix flava test
* extend to video processor
* Cb example more args
* Remove useless sync
* Better new tokens, and no more BS1 on outputs
* Add dynamic to compile to avoid many graphs
* Sort prefix to maximize cache hits
* More robust ways to retrieve results in test
* Style
* Update src/transformers/generation/continuous_batching/continuous_api.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* stack lists of tensors in BatchFeature, improve error messages, add tests
* remove unnecessary stack in fast image processors and video processors
* make style
* fix tests
* parallelize and cleanup
* simplify offloading
* fix
* oupsi
* add env variable to deactivate
* revert threading -> safetensors does not release the GIL
* comment
* create helper
* move it to accelerate integration
* Make sam3 tests pass on XPU
* Update flm2 tests GT for XPU
* Remove the skip tests of local mask for XPU
* Pass position_ids to varlen FA2
* Change modular also
* Skip FA2 bwd tests
* Make style
* Increase rtol
* Adapt to the main branch
* fix cuda 1
---------
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Automatic release
* Install transformers from within the build
* setuptools
* Check build doesn't need to exist anymore
* Check build doesn't need to exist anymore
* -y
* torch install for pipeline
* TestPypi upload
* Fine tune
* Fine tune
* Update release instructions
* Update .github/workflows/release.yml
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* raise
* comment
* fix
* add test
* fix
* add back return
* small
* raise after report
* typos
* fix
* patch
* switch name
* doc
* oupsi that was commented out
* add mask generation fine-tuning docs
* initial commit
* update video text to text
* fix autoprocessor
* bump model, update API
* add torch.compile
* Add results
* Update docs/source/en/tasks/image_text_to_text.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/image_text_to_text.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/image_text_to_text.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/mask_generation.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/mask_generation.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/mask_generation.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/mask_generation.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/video_text_to_text.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/mask_generation.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/mask_generation.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/mask_generation.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/video_text_to_text.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/video_text_to_text.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/image_text_to_text.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/tasks/mask_generation.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update image_text_to_text.md
* Update docs/source/en/tasks/video_text_to_text.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* No more size 0 cuda graph
* Better tests for CB
* compile fix for CB test
* style
* More cleanup and cuda exclusive
* Returned to slow tests
* Change decorator order
* Restore XPU change
* Rebase fixes
* enable xpu in fp8_gemm
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
* refine the code
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
* updated
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
* fix
* style
* small fix
---------
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
* feat: Remove logits upcast when computing loss
When the CausalLM loss is used, the upcast is done in the loss function
utils, so this is redundant.
https://github.com/huggingface/transformers/issues/42709
Branch: GraniteOptionalUpcast-42709
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* chore: make fix-copies
https://github.com/huggingface/transformers/issues/42709
Branch: GraniteOptionalUpcast-42709
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Override Transformers defaults by GGUF defaults
In some models, GGUF uses default or fixed values different from this
library. To integrate GGUF-based models without additional configuration,
we need some kind of compatibility layer.
This commit provides additional mapping to provide GGUF-specific default
values to initialize parameters in this library.
Currently, only fixed "norm_topk_prob" value of Qwen3 MoE (True) is
defined because (a) it differs from the default value of this library
(False) and (b) if this parameter is incorrectly set, it results in
almost completely garbled output.
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
* Apply suggestions from code review
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
---------
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* fix: support tensor labels in DataCollatorWithFlattening
- Add tensor to list conversion in DataCollatorWithFlattening
- Convert input_ids and labels to list if they are tensors
- Add tests for both tensor and list labels
- Fixes#42599
* style: fix whitespace linting errors
* style: apply ruff format to test file
* Only call `torch.autocast` if it will have an effect
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* whitespace
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* fixup
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* fix copies
---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Arthur <arthur.zucker@gmail.com>
* Only default `rope_parameters` to empty `dict` if there is something to put in it
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* Add warning
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* Also catch explicit `{}`
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* Handle model with layer types and rope theta but not rope parameters
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* Add an alternative scenario in case patch_offsets is None
* Fixup
* Fix an error
* Simplified the function
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* switch
* remove now useless save_function
* a bit more involved than i thought
* all converters
* fix
* pretty print
* fix
* trainer
* update musicgen.md docs
* marc comments
* doc and last missed instances
* CI
---------
Co-authored-by: Wauplin <lucainp@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* On commit to bind them all
* nits
* smnall update
* elif
* super small nit
* BPE!
* fix
* up up up
* fix?
* one typo
* per model updates
* more model specific updates
* more per model updates
* more model specific updates
* simplify default merges
* fiuxp
* update
* update
* style
* fix
* fix colpali
* nits
* simpler regex + big shitty bird
* fixup and fix
* fix codellama
* up
* fix pop on none
* fix parkeet
* fix llama
* big fixup
* fix markul lm
* update common
* fix mbart
* fix seamlessm4T
* fix comment
* torch tests
* nnits and revert UNK idx change
* oh only one deberta
* torch tests
* add convert from spm per model!
* fix last 2 for pegasus
* fix torch tests
* fixes
* fix tests
* check versioned files
* fix processor auto test
* fix custom tok clip
* try this fix
* modeling rag
* fix rag
* roformer the Tokenizers way
* up
* updatge
* fix unk
* update
* fix roberta
* if there is no mapped class and no tokenizer.json its fucked -> just have the mapped class ready!
* fix the rest
* fix copies
* fix doc and copies
* fix mbart50
* fix deberta_v2 test
* fix and simplify whisper :)
* fix big bird default was worng
* fix final
* fixup
* small nit
* a weird way to fix fuyu?
* default xlm roberta to fix kosmo behaviour!
* remove small errors
* last fix?
* fix pixtral
* style
* fix
* quality ta radce
* fix?
* remove something
* remov one code that shouldd not have been there!
* fix ?
* fixup
* update
* fix for custom code
* add a custom model path to make sure custom stuff is registe
* fix trust remote code
* exceeded
* don't
* ouppsy for cohere
* why is this one also affected
* fixup
* fixup
* nits
* fix idefics3 tests
* okay read the processor
* fix the layout.... models
* nits
* codellama needs the bos passed
* fix dpr
* fix?
* fixup
* distilbert defaults
* fix
* clvp update to PythonTokenizer
* bloom
* style
* layoutxlm
* style
* olmo
* only pop when we don't convert from tokenizer.json
* fixup
* hub issue
* id
* fix
---------
Co-authored-by: itazap <ita.zaporozhets@huggingface.co>
Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-45.ec2.internal>
* Fixed UT and kernel loading logic.
* Revision based on comments
* Simplify code
* make style
* simplify CB part
* retrigger ci
---------
Co-authored-by: vasqu <antonprogamer@gmail.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* FIX Error when trying to load non-LoRA PEFT
This PR fixes a bug that prevented non-LoRA PEFT adapters to be loaded
into a transformers model. A test for this has been added.
Additionally, also testing if a non-LoRA adapter can be added to a
transformers model. This was not broken but still lacked test coverage.
* Reviewer feedback: Remove check completely
* Let transformers know when a model is being traced via jax.jit
Move check earler
* Add docstring
* add docstring
---------
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* nit on dac!
* first commit
* fixes
* passing all API tests
* nit
* update
* test update
* make style
* make
* fixes
* fix
* test update
* return attention mask by default
* doc update
* make
* make
* fix
* do not load everything in advance
* fix
* fix
* fix
* fix
* fix memory leaks during conversion
* oupsi
* fix device_map
* add doc
* fix
* doc
* make it a method
* doc
* first shot at test
* fix test
* fix
* revert test: cpu mem too hard to track correctly
* fix
* begin test ci training
* add better logging
* add better logging + training loop
* fix sentence + grad_norm assert
* create circlci config fort training
* fix ci to detect training_ci job
* add -s for pytest CI
* add generate assert as well
* make training ci trigger for every model change instead
* set eos_token_id to 0 otherwise it will stop generating too soon
* refactor
* moving logging and metrics to proper files
* update marker in pyproject.toml
* linting
* linting again
* reduce pytest worker
* fix deadlock in test
* add license
* dont show logs during test
* loosen threshold a bit
* Initialize cache state in RecurrentGemmaRecurrentBlock based on batch… (#42627)
* Initialize cache state in RecurrentGemmaRecurrentBlock based on batch size
* loosen a bit the threshold
* Revert "loosen a bit the threshold"
This reverts commit d3d42e1e3b.
* skipping BLT for now (until it get fixed)
* docs: clarify recommended usage of max_new_tokens in generate()
* Move max_new_tokens recommendation into GenerationConfig docstring
* Remove Note about max_new_tokens and max_length
* Update src/transformers/generation/configuration_utils.py
Apply reviewer suggestion for clearer token explanation
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update src/transformers/generation/configuration_utils.py
Apply reviewer suggestion for clearer token explanation
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Remove Note about max_new_tokens and max_length
* Docs: clean up empty line in text_generation.md
* Docs: simplify max_length explanation in generation config
* style: fix trailing whitespace in configuration_utils.py
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
* Just-in-time (JIT) asynchronous checkpointing using SIGTERM signals and cuda streams.
* Fix failing ci tests
* Update JIT checkpoint code to remove CUDA streams and async checkpointing. Introduce sentinal file to identify incomplete checkpoints. Update trainer arg doc and tests.
* Fix sentinel file save path to checkpoint folder, update checkpoint related envs with HF_ prefix.
* Refactor JIT checkpoint logic: rename methods and variables for clarity, improve SIGTERM handling, and update related tests.
* Remove support for environment variable overrides in `TrainingArguments` and update related documentation.
* Apply style fixes
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Adapt some test case on npu
* Adapt some test case on npu
---------
Co-authored-by: mamba-chen <chenhao388@huawei.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
* Flip the default return type for `apply_chat_template` to match the underlying tokenizer
* Remove test_tokenization_for_chat tests, which no longer do anything useful
* Remove test_tokenization_for_chat tests, which no longer do anything useful
* Fix test_encode_message tests
* Fix test_encode_message tests
* nit fix
* Trigger tests
* Remove test_tokenization_for_chat
* make fixup
* Add a little test to make sure that doesn't happen again
* make fixup
* Update modeling_sam3_tracker.py
* Apply docstring to modular_sam3_tracker.py for consistency
Added custom introduction to auto_docstring for Sam3TrackerPreTrainedModel. to match typo fix in modeling_sam3_tracker.py
* Fix SAM3 tracker docstrings in modular and regenerate modeling
* fix(Qwen3VLCausalLMOutputWithPast): missing `hidden_states` and `attentions` kwargs
* revert kwargs for the base class
* make regenerated model files
* symmetrical change for `modular_qwen3_vl_moe.py`
* regenerated `modeling_qwen3_vl_moe.py`
* Add **kwargs into every Model.forward()
* Add the test back in
* And the others I missed
* Fix udop test
* Fix fast2speech2conformer test
* make fixup
* extend FA2 and other cases to XPU, we expect all model cases except CUDAGraph
specific, CUDA compute capability specific and FA3 specific can run XPU.
For FA3, we are develioping
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
* fix style
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
* Update modeling_mimi.py
---------
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
* Create cache when training in case generate needs being called
* Align modular
* fixes
* cohere
* fix modular
* fix
* review
---------
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* draft
* draft
* draft
* draft
* draft
* draft
* draft
* draft
* draft
* draft
* draft
* fail to see the check
* fail to see the check
* fail to see the check
* fail to see the check
* fail to see the check
* Apply style fixes
* fail to see the check
* fail to see the check
* fail to see the check
* Apply repo. consistency fixes
* fail to see the check
* Apply repo. consistency fixes
* fail to see the check
* delete
* Apply repo. consistency fixes
* comment
* Apply repo. consistency fixes
* comment
* Apply repo. consistency fixes
* comment
* Apply repo. consistency fixes
* comment
* Apply repo. consistency fixes
* back
* check
* check
* check
* check
* check
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add SDPA and Flash Attention support for PatchTST model
- Add _supports_sdpa = True and _supports_flash_attn = True to PatchTSTPreTrainedModel
- The existing PatchTSTAttention class already uses ALL_ATTENTION_FUNCTIONS
to select the attention implementation based on config._attn_implementation
- Fix test_modeling_patchtst.py _prepare_for_class for dynamic batch sizes
* Guard PatchTST positional init under ZeRO-3
* Force SDPA in PatchTST regression integration test
* Use sdpa attn in PatchTST regression test
* fixups re tests
---------
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: vasqu <antonprogamer@gmail.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* transfer commit
* Allow fullgraph and little fixes
* fix when no measurements
* CB is better at handling compile. Also can be benched.
* Style
* Add sumarrized by default
* Doc and better CG logic
* CG logic rollback
* Update to Fa3 thx to Anton
* style
* Fix mixed torch.Tensor and DTensor error by registering fsdp forward for trainer.mode.generate
* Apply fsdp forward register when fsdp is enabled
---------
Co-authored-by: yiminzme <yiminzme@gmail.com>
Co-authored-by: Ferdinand Mom <47445085+3outeille@users.noreply.github.com>
Fix three typos found in code comments:
- 'avaoid' → 'avoid' in modeling_utils.py
- 'weigth' → 'weight' in trainer_utils.py
- 'Templace' → 'Template' in convert_slow_tokenizer.py
These typos appeared in TODO comments and inline documentation.
* fix resume from epoch >= 1
* add test checking order of sampled data points
* add require_torch_non_multi_accelerator decorator to test method
* move the epoch setting of epoch_dataloader before iterating over it
* make fixup
echo "Memory monitor started with PID $(cat memory_monitor.pid)"
# Give it a moment to start
sleep 2
# Verify it's running
ps aux | grep memory_monitor | grep -v grep || echo "Warning: memory monitor may not be running"
- name: Install utilities
run: |
apt-get install -y nano
- name: Store Slack infos
#because the SSH can be enabled dynamically if the workflow failed, so we need to store slack infos to be able to retrieve them during the waitforssh step
#because the SSH can be enabled dynamically if the workflow failed, so we need to store slack infos to be able to retrieve them during the waitforssh step
@@ -125,9 +125,9 @@ If you're contributing a **vision-language model** (or any multimodal model that
All new models should use the modular architecture pattern. Create a `modular_<model_name>.py` file using the modular model converter:
- Use the CLI, [`transformers add-new-model-like`](https://github.com/huggingface/transformers/blob/main/src/transformers/cli/add_new_model_like.py) to generate a modular skeleton and get started
- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well. [Modular guide](./docs/source/en/modular_transformers.md#implementing-a-modular-file) shows a quick way to set up a modular file.
- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well. [Modular guide](https://huggingface.co/docs/transformers/modular_transformers#implementing-a-modular-file) shows a quick way to set up a modular file.
- Reuse existing patterns from similar models as much as possible
- You can make the model compatible with inference engines such as vLLM or SGLang, and enable zero-effort integration. See specific requirements for model implementation in ["Transformers modeling backend"](./docs/source/en/transformers_as_backend.md#multimodal-models)
- You can make the model compatible with inference engines such as vLLM or SGLang, and enable zero-effort integration. See specific requirements for model implementation in ["Transformers modeling backend"](https://huggingface.co/docs/transformers/transformers_as_backend#multimodal-models)
@@ -105,7 +105,7 @@ class Llama5Tokenizer(TokenizersBackend):
self._vocab = vocab
if merges is not None:
self._merges = merges
self._merges = merges or []
else:
self._merges = generate_merges(filtered_vocab)
@@ -339,7 +339,7 @@ _We aim for this to be fixed and released in a following release candidate in th
### Remote code incompatibility
A lot of paths were removed and reworked; paths like `transformers.tokenization_utils` and `transformers.tokenization_utils_fast`, which no longer exist.
We'll be working on backwards compatibility for these before version 5 is fully released.
These now redirect to `transformers.tokenization_utils_sentencepiece` and `transformers.tokenization_utils_tokenizers` respectively; please update imports accordingly.
_We aim for this to be fixed and released in a following release candidate in the week that follows RC0._
- related to 1., it is not possible to set proxies from your script. To handle proxies, you must set the `HTTP_PROXY` / `HTTPS_PROXY` environment variables
- `hf_transfer` and therefore `HF_HUB_ENABLE_HF_TRANSFER` have been completed dropped in favor of `hf_xet`. This should be transparent for most users. Please let us know if you notice any downside!
`typer-slim` has been added as required dependency, used to implement both `hf` and `transformers` CLIs.
`typer-slim` has been added as required dependency, used to implement both `hf` and `transformers` CLIs.
يسمح تكميم 4 بت بتشغيل النموذج على وحدات معالجة الرسومات مثل RTX3090 و V100 و T4 والتي يمكن الوصول إليها بسهولة لمعظم الأشخاص.
لمزيد من المعلومات حول التكميم ولمعرفة كيف يمكن تكميم النماذج لطلب ذاكرة GPU VRAM أقل حتى من 4 بت، نوصي بالاطلاع على تنفيذ [`AutoGPTQ`](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#autogptq-integration%60).
لمزيد من المعلومات حول التكميم ولمعرفة كيف يمكن تكميم النماذج لطلب ذاكرة GPU VRAM أقل حتى من 4 بت، نوصي بالاطلاع على تنفيذ [`GPT-QModel`](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#gptqmodel).
> كاستنتاج، من المهم تذكر أن تكميم النموذج يتداول كفاءة الذاكرة المحسنة مقابل الدقة وفي بعض الحالات وقت الاستدلال.
@@ -13,103 +13,145 @@ rendered properly in your Markdown viewer.
-->
# Attention Interface
# Attention backends
This page describes how to use the `AttentionInterface` in order to register custom attention functions to use with
supported models.
All attention implementations perform the same computation. Every token is compared to every other token. The difference is *how* the computation is performed. Basic attention scales poorly because it materializes the full attention matrix in memory, creating bottlenecks that slow down inference. Optimized implementations rearrange the math to reduce memory traffic for faster, more affordable inference.
## Customizing attention function
The [`AttentionInterface`] provides optimized attention implementations. It decouples the attention implementation from the model implementation to simplify experimentation with different functions. Add new backends easily with this consistent interface.
Most recent models can now switch from one attention function used in the Attention layer to the other, thanks to a simple mapping.
By default, we provide the implementation for [`sdpa`](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html),
[`flash_attention_2`](https://github.com/Dao-AILab/flash-attention) and [`flex_attention`](https://pytorch.org/docs/stable/nn.attention.flex_attention.html#module-torch.nn.attention.flex_attention)
as well as `eager`, which is a simple matrix multiplication without any optimization on top.
This is the setting you can usually choose when instantiating a model:
| attention backend | description |
|---|---|
| `"flash_attention_3"` | improves FlashAttention-2 by also overlapping operations and fusing forward and backward passes more tightly |
| `"flash_attention_2"` | tiles computations into smaller blocks and uses fast on-chip memory |
| `"flex_attention"` | framework for specifying custom attention patterns (sparse, block-local, sliding window) without writing low-level kernels by hand |
But what if you wanted to create your own attention function? Or simply play around with existing ones, adding
a few statements here and there? You can now do so with the `AttentionInterface`! Here is an example:
Switch between attention backends at runtime without reloading the model using [`~PreTrainedModel.set_attn_implementation`].
```python
from transformers import AutoModelForCausalLM, AttentionInterface
from transformers.integrations.sdpa_attention import sdpa_attention_forward
import torch
```py
model.set_attn_implementation("sdpa")
```
model_id = "meta-llama/Llama-3.2-1B"
### Kernels
def my_new_sdpa(*args, **kwargs):
print("I just entered the attention computation")
return sdpa_attention_forward(*args, **kwargs)
Download and load compiled compute kernels directly from the [Hub](https://huggingface.co/models?other=kernels) at runtime with the [Kernels](https://huggingface.co/docs/kernels/index) library. This avoids packaging issues from mismatched PyTorch or CUDA versions.
You will see it prints "I just entered the attention computation" as many times as there are layers in the model (with this example, 16 times).
### SDPA context manager
## Dynamically switching attention function
PyTorch's scaled dot product attention (SDPA) selects the fastest attention function for CUDA backends automatically. It defaults to the PyTorch C++ implementation for other backends.
You could dynamically change the model's attention function as well:
Force SDPA to use a specific implementation with the [torch.nn.attention.sdpa_kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html) context manager.
```python
# Back to use original sdpa implementation
model.set_attn_implementation("sdpa")
```py
import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
and it will stop printing the statements, as it now uses the `sdpa` attention.
This allows to quickly change an attention function, without needing to reload the model!
## Backbone-specific attention
## Different attention per backbone in multimodal models
Multimodal models use different backbones for each modality. Optimize performance by assigning specific attention functions to each backbone. Some vision backbones perform better in fp32, for example, which FlashAttention does not support.
For multimodal models different attention functions may work better for each backbone module. For example, some vision backbones perform better in fp32, but are incompatible with FlashAttention. To continue using FlashAttention while keeping the vision encoder in fp32, create a dict and map each config to an attention implementation as shown below.
Map vision backbones to different attention functions with a dict while the text backbone continues to use FlashAttention. Keys in the attention implementation must match sub-config names.
```python
```py
from transformers import AutoModelForImageTextToText
## What about new args needed in my custom attention function?
## Create a new attention function
Customize or create new attention functions by adding them to the attention registry with [`AttentionInterface.register`]. Models use these functions through the `attn_implementation` argument.
But indeed, what if the new function requires a new arg to be properly used? It's no issue! Models supporting the
`AttentionInterface` propagate kwargs all the way to the Attention layers, and to the used attention function. That way,
you can simply pass the arg (as a kwargs, i.e. you need to qualify the name of the arg) in the model's forward, and it will be correctly used in the attention. However, custom attention functions have some limitations. In particular, it must follow the signature and return format of other attention functions, i.e.
This example customizes the attention function to print a statement for each layer.
```python
import torch
from transformers import AutoModelForCausalLM, AttentionInterface
from transformers.integrations.sdpa_attention import sdpa_attention_forward
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B", attn_implementation="my_new_sdpa")
model(torch.ones(1, 5, dtype=int))
```
You can also add new arguments to the attention function. Models supporting [`AttentionInterface`] propagate kwargs to attention layers and the attention function. Pass arguments as kwargs in the model's forward function. Custom attention functions must follow this signature and return format.
```python
import torch
from transformers import AutoModelForCausalLM, AttentionInterface
from transformers.integrations.sdpa_attention import sdpa_attention_forward
If in doubt about what args/kwargs a given model sends to the attention function, simply check that model's modeling code on [GitHub](https://github.com/huggingface/transformers/tree/main/src/transformers/models)!
Check a model's [modeling code](https://github.com/huggingface/transformers/tree/main/src/transformers/models) to confirm what arguments and kwargs it sends to the attention function.
## Accessing current available implementations
### AttentionMaskInterface
Most of the time, you will simply need to `register` a new function. If, however, you need to access an existing one,
and/or perform a few checks, the preferred way is to use the global `ALL_ATTENTION_FUNCTIONS`. It behaves the same way you
would expect from a usual Python dictionary:
```python
>>> from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
Having a new attention function may mean that you need a new format of attention mask to decide what key and value tokens
the query tokens should attend to. This is now possible with the `AttentionMaskInterface`! It works in the same way as
the `AttentionInterface`:
Configure which key and value tokens queries attend to with [`AttentionMaskInterface`]. Some attention functions require this configuration. Customize the attention mask function and add it to the registry with [`AttentionMaskInterface.register`].
```python
import torch
from transformers import AttentionMaskInterface
from transformers.masking_utils import sdpa_mask
import torch
def my_new_sdpa_mask(*args, **kwargs):
print("I just entered the attention mask computation")
The reason you have to register it is because we need to automatically correct your mask format based on the attention implementation (for example, flex attention uses a BlockMask format, while sdpa uses a 4D tensor).
By default, if you do not register an attention mask function along with your attention function, mask creation will be skipped
and `attention_mask=None` will be passed along to the Attention layers.
Registered attention masks automatically correct the mask format for the attention implementation. For example, FlexAttention uses a [BlockMask](https://docs.pytorch.org/docs/stable/nn.attention.flex_attention.html?utm_source=chatgpt.com#torch.nn.attention.flex_attention.BlockMask) format, while SDPA uses a 4D tensor. Without a registered attention mask function, mask creation is skipped and `attention_mask=None` passes to the model's attention layers.
The default signature of the attention mask functions is the following:
This is the default signature for an attention mask function.
```python
def custom_attention_mask(
@@ -191,6 +206,6 @@ def custom_attention_mask(
) -> Optional[torch.Tensor]:
```
It mostly works thanks to the `mask_function`, which is a `Callable` in the form of [torch's mask_mod functions](https://pytorch.org/blog/flexattention/), taking 4 indices as input and returning a boolean to indicate if this position should take part in the attention computation.
The `mask_function` argument is a `Callable` that mimics PyTorch's [mask_mod](https://pytorch.org/blog/flexattention/) functions. It takes 4 indices as input and returns a boolean. This boolean indicates if the position contributes to the attention computation.
If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py).
Use this [workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py) for torch export if `mask_function` fails to create a mask.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Chat message patterns
Chat models expect conversations as a list of dictionaries. Each dictionary uses `role` and `content` keys. The `content` key holds the user message passed to the model. Large language models accept text and tools and multimodal models combine text with images, videos, and audio.
Transformers uses a unified format where each modality type is specified explicitly, making it straightforward to mix and match inputs in a single message.
This guide covers message formatting patterns for each modality, tools, batch inference, and multi-turn conversations.
## Text
Text is the most basic content type. It's the foundation for all other patterns. Pass your message to `"content"` as a string.
```py
message = [
{
"role": "user",
"content": "Explain the French Bread Law."
}
]
```
You could also use the explicit `"type": "text"` format to keep your code consistent when you add images, videos, or audio later.
```py
message = [
{
"role": "user",
"content": [{"type": "text", "text": "Explain the French Bread Law."}]
}
]
```
## Tools
[Tools](./chat_extras) are functions a chat model can call, like getting real-time weather data, instead of generating it on its own.
The `assistant` role handles the tool request. Set `"type": "function"` in the `"tool_calls"` key and provide your tool to the `"function"` key. Append the assistant's tool request to your message.
The `tool` role handles the result. Append it in `"content"`. This value should always be a string.
```py
message.append({"role": "tool", "content": "22"})
```
## Multimodal
Multimodal models extend this format to handle images, videos, and audio. Each input specifies its `"type"` and provides the media with `"url"` or `"path"`.
### Image
Set `"type": "image"` and use `"url"` for links or `"path"` for local files.
{"type": "text", "text": "What type of pastries are these?"},
],
}
]
```
## Batched
Batched inference processes multiple conversations in a single forward pass to improve throughput and efficiency. Wrap each conversation in its own list, then pass them together as a list of lists.
{"type": "text", "text": "What type of pastry is this?"}
]
},
],
]
```
## Multi-turn
Conversations span multiple exchanges, alternating between `"user"` and `"assistant"` roles. Each turn adds a new message to the list, giving the model access to the full conversation history. This context helps the model generate more appropriate responses.
{"type": "text", "text": "What pastry is shown in the image?"}
]
},
{
"role": "assistant",
"content": [{"type": "text", "text": "This is kouign amann, a laminated dough pastry (i.e., dough folded with layers of butter) that also incorporates sugar between layers so that during baking the sugar caramelizes."}]
@@ -14,52 +14,30 @@ rendered properly in your Markdown viewer.
-->
# Continuous Batching
# Continuous batching
Continuous Batching (CB) is an advanced technique to optimize the inference of transformer models by dynamically grouping multiple requests into batches. This approach maximizes GPU utilization and throughput, specifically for workloads with many variable-length inputs.
Continuous batching maximizes GPU utilization. It increases throughput and reduces latency by using dynamic scheduling to rearrange the batch at each step. The system removes completed requests and adds new requests immediately to prevent GPU idling. Chunked prefill prevents expensive prefill work from stalling the batch while still allowing new requests still join.
We are particularly interested in having Continuous Batching in transformers for the following use cases:
- Evaluation of models on large datasets with variable-length inputs
- Generating outputs for multiple sequences for GRPO policies
Continuous batching works with [transformers serve](./serving), a server for deploying local models, and [`~ContinuousMixin.generate_batch`].
CB is what makes inference engines like vLLM or SGLang efficient. That being said, transformers does not aim to be a production-ready inference engine, but a complete framework for model development. For this reason, CB is available in `transformers serve`.
## generate_batch
If you are not familiar with some of the core concepts CB is built upon, we invite you to read the associated blog post: [Continuous Batching: Efficient Inference for Large Language Models](https://huggingface.co/blog/continuous-batching). _broken link for now_
## API Reference
## Usage Examples
The main way to use CB in transformers is via the `generate_batch` method.
Unlike `generate`, CB takes already tokenized inputs, known as input IDs. Each sequence of input IDs is represented as a list of integers, in python: `list[int]`. Since
For a more detailed example, please refer to: [examples/continuous_batching](./path/to/example)
### `generate_batch` example
We have created a `ContinuousMixin` that is inherited by the `GenerationMixin` so that all auto regressive text models support CB.
This adds the `generate_batch` method to all models that inherit from `GenerationMixin`.
You can use it as follows:
The [`~ContinuousMixin.generate_batch`] method works with all autoregressive text models. It accepts a list of tokenized inputs and a [`GenerationConfig`] to configure generation settings.
```py
import datasets
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
If you want more control w.r.t. how you want to schedule requests using CB, you can use the `ContinuousBatchingManager` class directly.
This is what we use in `transformers serve` because requests arrive asynchronously and we can leverage the asynchronous nature of the CB process to make things more efficient.
## ContinuousBatchingManager
Under the hood, the `ContinuousBatchingManager` creates a background thread that receives inputs from a python `queue.Queue` which it uses to get requests to batch in each forward pass.
The [`ContinuousBatchingManager`] orchestrates the background thread by pulling requests from the queue and filling the GPU to capacity. Every iteration checks for finished requests and schedules new ones to join the batch. Use this manager to customize request scheduling.
Note that the manager is thread safe!
Call [`~ContinuousMixin.init_continuous_batching`] to initialize the manager with a [`GenerationConfig`] and [`~ContinuousBatchingManager.start`] the background thread.
```py
import datasets
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
from transformers.generation.continuous_batching import RequestStatus
# this is for demonstration purposes only, in practice this is most useful to do concurrently
for i, input in enumerate(simple_batch_inputs):
request_id = manager.add_request(input_ids=input, request_id=f"request_{i}") # if you do not specify a request_id, one will be generated for you
Use [`~ContinuousBatchingManager.add_request`] to asynchronously submit individual requests. Provide a specific request id or the manager wgenerates one automatically.
# Can be done in an other thread
for id, request in manager.get_result():
```py
for i, input_ids in enumerate(simple_batch_inputs):
# you can also get results for a specific request id
result = manager.get_result(request_id="request_5") # this is blocking and will wait for the result to be ready
Use the `request_id` of a specific request to get its results. This is a blocking operation that waits until the result is ready.
# or get results for a request that is streaming
```py
result = manager.get_result(request_id="request_5")
```
Stream partial results for a specific request with [`~ContinuousBatchingManager.request_id_iter`].
```py
manager.add_request(
input_ids=input,
input_ids=input_ids,
request_id="streaming_request",
stream=True,
)
@@ -146,49 +110,65 @@ for chunk in manager.request_id_iter(request_id="streaming_request"):
# FIXME: stop iteration in `request_id_iter` when finished instead of doing it externally
if chunk.status == RequestStatus.FINISHED:
break
```
Call [`~ContinuousBatchingManager.stop`] to terminate the manager.
# stop the background thread before exiting the process
```py
manager.stop()
```
## Supported & Unsupported Features
## PagedAttention
### Supported Features
PagedAttention breaks large key-value caches into smaller, non-contiguous fixed-size pages to avoid GPU memory fragmentation and support variable-length requests. Transformers automatically enables PagedAttention when using continuous batching.
- Dynamic scheduling of variable-length requests
- Chunked prefill
- Paged Attention Cache
- Sliding window attention
- Chat templates
You could explicitly enable PagedAttention when instantiating a model rather than waiting for [`~ContinuousMixin.generate_batch`] to dynamically enable it.
### Unsupported Features
```py
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-4B-Instruct-2507",
attn_implementation="paged|flash_attention_2",
device_map="cuda",
torch_dtype=torch.bfloat16
)
```
At the moment, the following features are not supported with CB. We plan to add support to the following:
## Sliding window attention
- Prefix caching
- Beam search
- tool calling
Sliding window attention limits the backward context of a token to save compute. Generation cost stays proportional to window size. This reduces compute per step and simplifies continuous batching.
The others are unplanned, but depending on community requests we might consider adding them:
Transformers models like Mistral and Gemma 2 natively support sliding window attention. Manually enable it in the model config if the architecture supports it. This helps with fine-tuning or running custom experiments.
Usage remains the same with [`~ContinuousMixin.generate_batch`].
You can use CB in `transformers serve` by passing the `--continuous-batching` flag when starting the server.
## How it works
## Monitoring
The [`ContinuousMixin`] class serves as the main interface for continuous batching through [`~ContinuousMixin.generate_batch`]. This method internally creates a [`ContinuousBatchingManager`].
We have added `opentelemetry` support to Continuous Batching to help you monitor its performance in production. To enable it, you need to install the `opentelemetry` extra when installing `transformers`:
[`ContinuousBatchingManager`] manages requests by creating a background thread for the generation loop and adding requests to the queue. The manager is thread-safe, allowing asynchronous request additions while the model generates.
```sh
# this installs `opentelemetry-api`, `opentelemetry-sdk` and `opentelemetry-exporter-otlp`
pip install transformers[open-telemetry]
```
The [`Scheduler`] selects requests for processing at each step based on the token budget. [`FIFOScheduler`] is the default scheduler. It prioritizes decoding requests over prefilling requests and assigns them to specific memory blocks. [`PrefillFirstScheduler`] prioritizes prefill requests instead.
[`ContinuousBatchingManager`] runs the model forward pass for the scheduled requests. It then collects and returns the results.
This will enable traces and metrics collection in CB. You will then have to setup the backend to collect and visualize the traces and metrics.
## Resources
The [Continuous batching](https://huggingface.co/blog/continuous_batching) blog post explains KV caching, chunked prefill, and ragged batching with dynamic scheduling in more detail.
@@ -360,7 +360,7 @@ Quantization reduces the size of model weights by storing them in a lower precis
If you aren't limited by your GPU, you don't necessarily need to quantize your model because it can increase latency slightly (except for AWQ and fused AWQ modules) due to the extra step required to quantize and dequantize the weights.
> [!TIP]
> There are many quantization libraries (see the [Quantization](./quantization) guide for more details) available, such as Quanto, AQLM, VPTQ, AWQ, and AutoGPTQ. Feel free to try them out and see which one works best for your use case. We also recommend reading the [Overview of natively supported quantization schemes in 🤗 Transformers](https://hf.co/blog/overview-quantization-transformers) blog post which compares AutoGPTQ and bitsandbytes.
> There are many quantization libraries (see the [Quantization](./quantization) guide for more details) available, such as Quanto, AQLM, VPTQ, AWQ, and GPT-QModel. Feel free to try them out and see which one works best for your use case. We also recommend reading the [Overview of natively supported quantization schemes in 🤗 Transformers](https://hf.co/blog/overview-quantization-transformers) blog post for a comparison of different approaches.
Use the Model Memory Calculator below to estimate and compare how much memory is required to load a model. For example, try estimating the memory required to load [Mistral-7B-v0.1](https://hf.co/mistralai/Mistral-7B-v0.1).
@@ -286,7 +286,7 @@ Overall, we saw that running OctoCoder in 8-bit precision reduced the required G
4-bit quantization allows the model to be run on GPUs such as RTX3090, V100, and T4 which are quite accessible for most people.
For more information on quantization and to see how one can quantize models to require even less GPU VRAM memory than 4-bit, we recommend looking into the [`AutoGPTQ`](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#autogptq-integration%60) implementation.
For more information on quantization and to see how one can quantize models to require even less GPU VRAM memory than 4-bit, we recommend looking into the [`GPT-QModel`](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#gptqmodel) implementation.
> As a conclusion, it is important to remember that model quantization trades improved memory efficiency against accuracy and in some cases inference time.
@@ -14,7 +14,7 @@ rendered properly in your Markdown viewer.
-->
*This model was released on 2025-05-06 and added to Hugging Face Transformers on 2025-10-07.*
*This model was released on 2025-05-06 and added to Hugging Face Transformers on 2025-12-02.*
# FastVLM
@@ -57,14 +57,14 @@ Setting it for the entire model, e.g.
will result in an error.
### Formatting Prompts with Chat Templates
### Formatting Prompts with Chat Templates
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s `apply_chat_template` method.
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s `apply_chat_template` method.
**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.
**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
-->
*This model was released on 2025-12-01 and added to Hugging Face Transformers on 2025-12-01.*
**Official Website**: [Baidu AI Studio](https://aistudio.baidu.com/paddleocr) | **arXiv**: [Technical Report](https://arxiv.org/pdf/2510.14528)
**PaddleOCR-VL** is a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.
1. **Compact yet Powerful VLM Architecture:** We present a novel vision-language model that is specifically designed for resource-efficient inference, achieving outstanding performance in element recognition. By integrating a NaViT-style dynamic high-resolution visual encoder with the lightweight ERNIE-4.5-0.3B language model, we significantly enhance the model’s recognition capabilities and decoding efficiency. This integration maintains high accuracy while reducing computational demands, making it well-suited for efficient and practical document processing applications.
2. **SOTA Performance on Document Parsing:** PaddleOCR-VL achieves state-of-the-art performance in both page-level document parsing and element-level recognition. It significantly outperforms existing pipeline-based solutions and exhibiting strong competitiveness against leading vision-language models (VLMs) in document parsing. Moreover, it excels in recognizing complex document elements, such as text, tables, formulas, and charts, making it suitable for a wide range of challenging content types, including handwritten text and historical documents. This makes it highly versatile and suitable for a wide range of document types and scenarios.
3. **Multilingual Support:** PaddleOCR-VL Supports 109 languages, covering major global languages, including but not limited to Chinese, English, Japanese, Latin, and Korean, as well as languages with different scripts and structures, such as Russian (Cyrillic script), Arabic, Hindi (Devanagari script), and Thai. This broad language coverage substantially enhances the applicability of our system to multilingual and globalized document processing scenarios.
> We currently recommend using the [PaddleOCR official method for inference](https://www.paddleocr.ai/latest/en/version3.x/pipeline_usage/PaddleOCR-VL.html), as it is faster and supports page-level document parsing.
> The example code below only supports element-level recognition.
We have four types of element-level recognition:
- Text recognition, indicated by the prompt `OCR:`.
- Formula recognition, indicated by the prompt `Formula Recognition:`.
- Table recognition, indicated by the prompt `Table Recognition:`.
- Chart recognition, indicated by the prompt `Chart Recognition:`.
The following examples are all based on text recognition, with the prompt `OCR:`.
### Single input inference
The example below demonstrates how to generate text with PaddleOCRVL using [`Pipeline`] or the [`AutoModel`].
result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
print(result)
```
</hfoption>
</hfoptions>
### Batched inference
PaddleOCRVL also supports batched inference. We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Here is how you can do it with PaddleOCRVL using [`Pipeline`] or the [`AutoModel`]:
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
result = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(result)
```
</hfoption>
</hfoptions>
### Using Flash Attention 2
Flash Attention 2 is an even faster, optimized version of the previous optimization, please refer to the [FlashAttention](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention).
For example:
```shell
pip install flash-attn --no-build-isolation
```
```python
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("PaddlePaddle/PaddleOCR-VL", dtype="bfloat16", attn_implementation="flash_attention_2")
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Overview
Transformers provides multiple inference optimization techniques to make models fast, affordable, and accessible. Options include alternative attention mechanisms for reduced memory traffic, code compilation for faster execution, and optimized kernels for throughput. Stack these techniques for maximum performance.
> [!NOTE]
> Memory and speed are closely related but not the same. Shrinking your memory footprint makes a model "faster" because there is less data to move around. Pure speed optimizations don't always reduce memory and sometimes increase usage. Choose the appropriate optimization based on your use case and hardware.
Use the table below to pick an optimization technique.
This guide gives you a quick start on Transformers optimizations.
## Compilation
[torch.compile](./perf_torch_compile) reduces Python overhead, fuses operations, and creates kernels tuned for your shapes and hardware. The first run warms it up and subsequent runs use the faster compiled path.
Pass a [fixed size cache](./kv_cache#fixed-size-cache) to [`~GenerationMixin.generate`] to trigger `torch.compile` automatically.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
> Avoid calling `torch.compile(model)` outside of [`~GenerationMixin.generate`] to prevent the model from recompiling every step.
## Attention backends
Alternative [attention backends](./attention_interface) lower memory traffic. For example, FlashAttention tiles attention computations and avoids large intermediate tensors to reduce memory footprint.
Set `attn_implementation` in [`~PreTrainedModel.from_pretrained`] to load an optimized attention backend.
```py
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", attn_implementation="flash_attention_2")
```
## Kernels
Kernels fuse operations to boost throughput and reduce memory usage. The [Kernels](https://huggingface.co/docs/kernels/en/index) library loads optimized compute kernels from the [Hub](https://huggingface.co/kernels-community) in a flexible and version-safe way.
The example below loads an optimized FlashAttention-2 kernel without installing the package.
[Quantization](./quantization/overview) shrinks the size of every parameter which lowers memory footprint and increases speed because you can do more operations.
Pass a quantization config to the `quantization_config` argument in [`~PreTrainedModel.from_pretrained`]. Each quantization backend has a different config with different arguments. The example below quantizes a model to 4-bits and configures the computation dtype with the [bitsandbytes](./quantization/bitsandbytes) backend.
```py
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
[Caching](./kv_cache) speeds up generation by reusing past keys and values instead of recomputing them for every token. To offset and reduce the memory cost of storing past keys and values, Transformers
supports offloading the cache to the CPU. Only the current layer remains on the GPU.
Use the `cache_implementation` argument in [`~GenerationMixin.generate`] to set a cache strategy.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
[Parallelism](./perf_infer_gpu_multi) distributes a model across devices so models too big for one device run fast. This approach uses more memory due to sharding overhead and communication to sync results.
[Tensor parallelism](./perf_infer_gpu_multi) splits a model layer across devices. Set `tp_plan="auto"` in [`~PreTrainedModel.from_pretrained`] to enable it.
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", tp_plan="auto")
print(model._tp_plan)
```
## Continuous batching
[Continuous batching](./continuous_batching) maximizes throughput by keeping the GPU busy with dynamic scheduling and chunked prefill. [Serving](./serving.md) applications use it to process multiple incoming requests concurrently.
Use [`~ContinuousMixin.generate_batch`] to enable continuous batching.
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
@@ -306,3 +306,7 @@ The most important part of DTensor is the `placement` attribute because it tells
```
- `Partial()` - Indicates a tensor is pending a reduction operation (not typically relevant for usage in Transformers).
## Resources
Read the [Tensor Parallelism (TP) in Transformers: 5 Minutes to Understand](https://huggingface.co/blog/qgallouedec/tp) blog post for a quick overview of tensor parallelism and learn how column and row parallel setups differ.
@@ -21,7 +21,7 @@ Transformers is a PyTorch-first library. It provides models that are faithful to
A longer, in-depth article with examples, visualizations and timelines is available [here](https://huggingface.co/spaces/transformers-community/Transformers-tenets) as our canonical reference.
> [!NOTE]
> Our philosophy evolves through practice. What follows are out current, stable principles.
> Our philosophy evolves through practice. What follows are our current, stable principles.
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# Contribute
Transformers supports many quantization methods such as QLoRA, GPTQ, LLM.int8, and AWQ. However, there are still many more quantization approaches that haven't been integrated yet. To make adding and using these quantization methods with Transformers easier, use the [`~quantizers.HfQuantizer`] class. [`~quantizers.HfQuantizer`] is designed to be an internal helper class for adding a quantization method instead of something applied to every PyTorch module.
Transformers supports many quantization methods such as QLoRA, GPTQ, LLM.int8, and AWQ. However, there are still many more quantization approaches that haven't been integrated yet. To make adding and using these quantization methods with Transformers easier, use the [`~quantizers.HfQuantizer`] class. [`~quantizers.HfQuantizer`] is designed to be an internal helper class for adding a quantization method instead of something applied to every PyTorch module.
This guide will show you how to integrate a new quantization method with [`~quantizers.HfQuantizer`].
@@ -28,16 +28,16 @@ Before integrating a new quantization method into Transformers, ensure the metho
- The method can run on commonly-used hardware (CPU, GPU, etc.).
- The method is wrapped in a [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) ([`~bitsandbytes.nn.Linear8bitLt`], [`~bitsandbytes.nn.Linear4bit`]), and the quantized linear layer should have the following definition.
```py
class Linear4bit(nn.Module):
def __init__(self, ...):
...
def forward(self, x):
return my_4bit_kernel(x, self.weight, self.bias)
```
```py
class Linear4bit(nn.Module):
def __init__(self, ...):
...
This way, Transformers models are easily quantized by replacing instances of [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) with a target class.
def forward(self, x):
return my_4bit_kernel(x, self.weight, self.bias)
```
This way, Transformers models are easily quantized by replacing instances of [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) with a target class.
- The quantization method should be serializable. You can save the quantized weights locally or push them to the Hub.
- Make sure the package containing the quantization kernels/primitive is stable (no frequent breaking changes).
@@ -46,26 +46,109 @@ Some quantization methods may require "pre-quantizing" the model through data ca
## Create new HFQuantizer class
1. Create a new quantization config class inside [src/transformers/utils/quantization_config.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/quantization_config.py). Add the new quantization config to the [_import_structure](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py#L1088) inside Transformers' [src/transformers/__init__.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py) file.
0. The best starting point would be to have a look at another quantization method such as Finegrained Fp8. You will have to update or create three files in total: the [config file](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/quantization_config.py), the [integration file](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/finegrained_fp8.py) and the [quantizer file](https://github.com/huggingface/transformers/blob/main/src/transformers/quantizers/quantizer_finegrained_fp8.py).
1. Create a new quantization config class inside [src/transformers/utils/quantization_config.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/quantization_config.py). Add the new quantization config to the [\_import_structure](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py#L1088) inside Transformers' [src/transformers/\_\_init\_\_.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py) file.
2. Create a new file inside [src/transformers/quantizers/](https://github.com/huggingface/transformers/tree/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers) named `quantizer_your_method.py`, and make it inherit from [`~quantizers.HfQuantizer]. Make sure to add the new quantizer and quantization config in the quantization auto-mapping in [src/transformers/quantizers/auto.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers/auto.py).
3. Define the following class attributes and property methods for your quantization method.
3. Define the following class attributes and property methods for your quantization method:
- `requires_calibration`: Whether the quantization method requires a data calibration process. If set to `True`, you can only support inference (with quantized weights) and not inference and quantization.
- `required_packages`: A list of strings of the required packages to use the quantized weights. You might need to define some new utility methods such as `is_auto_awq_available` in [transformers/src/utils/import_utils.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/import_utils.py).
- `requires_parameters_quantization`: Only required if your quantization method requires extra attention to the underlying [nn.Parameter](https://pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html) object. For example, bitsandbytes uses [`~bitsandbytes.nn.Params4bit`] and [`~bitsandbytes.nn.Int8Params`], which requires some extra attention when quantizing the model. Most of the recent quantization method packs int2 and int4 weights inside [torch.uint8](https://pytorch.org/docs/stable/tensors.html) weights, so this flag should not be really required (set to `False` by default).
- `is_serializable`: A property method to determine whether the method is serializable or not.
- `is_trainable`: A property method to determine whether you can fine-tune models on top of the quantization method (with or without PEFT approaches).
- `requires_calibration`: Whether the quantization method requires a data calibration process. If set to `True`, you can only support inference (with quantized weights) and not inference and quantization.
- `is_serializable`: A property method to determine whether the method is serializable or not.
- `is_trainable`: A property method to determine whether you can fine-tune models on top of the quantization method (with or without PEFT approaches).
4. Write the `validate_environment` and `update_dtype` methods. These methods are called before creating the quantized model to ensure users use the right configuration. Refer to other quantizers for an example of it is implemented.
5. Write the `_process_model_before_weight_loading` method. In Transformers, the quantized models are initialized first on the `"meta"` device before loading the weights. This means the `_process_model_before_weight_loading` method takes care of manipulating the model skeleton to replace some modules ([nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)) with the target modules (quantization modules).
You can define module replacement logic or any other utility method by creating a new file in [transformers/src/integrations/](https://github.com/huggingface/transformers/tree/abbffc4525566a48a9733639797c812301218b83/src/transformers/integrations) and exposing the relevant methods in that folder's `__init__.py` file. The best starting point would be to have a look at another quantization method such as [quantizer_awq.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers/quantizer_awq.py).
You can define module replacement logic or any other utility method by creating a new file in [transformers/src/integrations/](https://github.com/huggingface/transformers/tree/abbffc4525566a48a9733639797c812301218b83/src/transformers/integrations) and exposing the relevant methods in that folder's `__init__.py` file.
6. Add the `get_quantize_ops` method to the quantizer class if the quantization supports quantizing on the fly. In transformers, we materialize each tensor and apply a sequence of different operations on it. In our case, the quantization operation happens at the end. You need to create a `XXXQuantize`, a subclass of `ConversionOps`, and add a `convert` method. In the `convert` method, you need to quantize the weights and return a dictionary of quantized params.
7. Add the `get_weight_conversions` method to the quantizer class if the quantization supports loading pre-quantized weights. In transformers, we can collect multiple tensors and apply operations on them. This is particularly useful when we have tensors in the checkpoint that require to be regrouped to re-create the quantized tensors.
8. Write the `_process_model_after_weight_loading` method if needed. This method enables implementing additional features that require manipulating the model after loading the weights.
9. Document everything! Make sure your quantization method is documented by adding a new file under `docs/source/en/quantization`.
10. You should add tests by adding the package in our nightly Dockerfile inside `docker/transformers-quantization-latest-gpu` and then adding a new test file in `tests/quantization/xxx`. Feel free to check out existing quantization methods to see how it is implemented.
Use this when loading a **pre-quantized checkpoint** where the quantized weights are saved as several separate components (such as data, scale, and zero point), and these need to be combined into one tensor during loading. Not all quantization methods require this reconstruction step: for example, some methods like FP8 simply load weights and scales as-is, without combining them. Others, such as torchao, do require reassembling the quantized tensor from its multiple saved components.
The `WeightConverter` collects related tensors based on `source_patterns`, then passes them to your `convert` method:
6. Write the `_process_model_after_weight_loading` method. This method enables implementing additional features that require manipulating the model after loading the weights.
7. Document everything! Make sure your quantization method is documented by adding a new file under `docs/source/en/quantization`.
8. You should add tests by adding the package in our nightly Dockerfile inside `docker/transformers-quantization-latest-gpu` and then adding a new test file in `tests/quantization/xxx`. Feel free to check out existing quantization methods to see how it is implemented.
@@ -16,10 +16,9 @@ rendered properly in your Markdown viewer.
# GPTQ
The [GPTQModel](https://github.com/ModelCloud/GPTQModel) and [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they're restored to fp16 on the fly during inference. This can save memory usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU's global memory. Inference is also faster because a lower bitwidth takes less time to communicate.
The [GPT-QModel](https://github.com/ModelCloud/GPTQModel) project (Python package `gptqmodel`) implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they're restored to fp16 on the fly during inference. This can save memory usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU's global memory. Inference is also faster because a lower bitwidth takes less time to communicate.
> [!WARNING]
> AutoGPTQ is likely to be deprecated in the future due to lack of continued support for new models and features. See the [GPTQModel](#gptqmodel) section for more details.
AutoGPTQ is no longer supported in Transformers. Install GPT-QModel] instead.
Install Accelerate, Transformers and Optimum first.
@@ -27,25 +26,12 @@ Install Accelerate, Transformers and Optimum first.
Then run the command below to install a GPTQ library.
<hfoptions id="install">
<hfoption id="GPTQmodel">
Then run the command below to install GPT-QModel].
```bash
pip install gptqmodel --no-build-isolation
```
</hfoption>
<hfoption id="AutoGPTQ">
```bash
pip install auto-gptq --no-build-isolation
```
</hfoption>
</hfoptions>
Create a [`GPTQConfig`] class and set the number of bits to quantize to, a dataset to calbrate the weights for quantization, and a tokenizer to prepare the dataset.
@@ -121,51 +107,16 @@ from transformers import AutoModelForCausalLM, GPTQConfig
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=GPTQConfig(bits=4, backend="marlin"))
```
## ExLlama
> [!WARNING]
> Only 4-bit models are supported, and we recommend deactivating the ExLlama kernels if you're finetuning a quantized model with PEFT.
[ExLlama](https://github.com/turboderp/exllama) is a Python/C++/CUDA implementation of the [Llama](model_doc/llama) model that is designed for faster inference with 4-bit GPTQ weights (check out these [benchmarks](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)). The ExLlama kernel is activated by default when you create a [`GPTQConfig`] object.
To boost inference speed even further, use the [ExLlamaV2](https://github.com/turboderp/exllamav2) kernels by configuring the `exllama_config` parameter in [`GPTQConfig`].
```py
import torch
from transformers import AutoModelForCausalLM, GPTQConfig
The ExLlama kernels are only supported when the entire model is on the GPU. If you're doing inference on a CPU with AutoGPTQ 0.4.2+, disable the ExLlama kernel in [`GPTQConfig`]. This overwrites the attributes related to the ExLlama kernels in the quantization config of the `config.json` file.
```py
import torch
from transformers import AutoModelForCausalLM, GPTQConfig
It is recommended to use GPTQModel, originally a maintained fork of AutoGPTQ, because it has since diverged from AutoGTPQ with some significant features. GPTQModel has faster quantization, lower memory usage, and more accurate default quantization.
GPT-QModel] is the actively maintained backend for GPTQ in Transformers. It was originally forked from AutoGPTQ, but has since diverged with significant improvements such as faster quantization, lower memory usage, and more accurate defaults.
GPTQModel provides asymmetric quantization which can potentially lower quantization errors compared to symmetric quantization. It is not backward compatible with AutoGPTQ, and not all kernels (Marlin) support asymmetric quantization.
GPT-QModel] provides asymmetric quantization which can potentially lower quantization errors compared to symmetric quantization. It is not backward compatible with legacy AutoGPTQ checkpoints, and not all kernels (Marlin) support asymmetric quantization.
GPTQModel also has broader support for the latest LLM models, multimodal models (Qwen2-VL and Ovis1.6-VL), platforms (Linux, macOS, Windows 11), and hardware (AMD ROCm, Apple Silicon, Intel/AMD CPUs, and Intel Datacenter Max/Arc GPUs, etc.).
GPT-QModel] also has broader support for the latest LLM models, multimodal models (Qwen2-VL and Ovis1.6-VL), platforms (Linux, macOS, Windows 11), and hardware (AMD ROCm, Apple Silicon, Intel/AMD CPUs, and Intel Datacenter Max/Arc GPUs, etc.).
The Marlin kernels are also updated for A100 GPUs and other kernels are updated to include auto-padding for legacy models and models with non-uniform in/out-features.
## Resources
Run the GPTQ quantization with PEFT [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) for a hands-on experience, and read [Making LLMs lighter with AutoGPTQ and transformers](https://huggingface.co/blog/gptq-integration) to learn more about the AutoGPTQ integration.
Run the GPTQ quantization with PEFT [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) for a hands-on experience.
torchao implements [torch.Tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor) for maximum flexibility in supporting new quantized torch.Tensor formats. [Safetensors](https://huggingface.co/docs/safetensors/en/index) serialization and deserialization does not work with torchao.
To avoid arbitrary user code execution, torchao sets `weights_only=True` in [torch.load](https://pytorch.org/docs/stable/generated/torch.load.html) to ensure only tensors are loaded. Any known user functions can be whitelisted with [add_safe_globals](https://pytorch.org/docs/stable/notes/serialization.html#torch.serialization.add_safe_globals).
Saving the quantized model with `save_pretrained` (in [safetensors](https://huggingface.co/docs/safetensors/en/index) format) is only supported for torchao >= v0.15. For any version below, it is only possible to manually save as unsafe `.bin` checkpoints with [torch.save](https://docs.pytorch.org/docs/stable/generated/torch.save.html).
This model has a [chat template](./chat_templating) that helps user parse chat outputs. Moreover, the model can also accept multiple images as input in a single conversation or message. We will now prepare the inputs.
@@ -65,24 +66,29 @@ The image inputs look like the following.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg" alt="A bee on a pink flower"/>
## ['User: What do we see in this image? \nAssistant: In this image we can see two cats on the nets. \nUser: And how about this image? \nAssistant: In this image we can see flowers, plants and insect.']
## ['In this image we can see flowers, plants and insect.']
```
## Pipeline
@@ -289,19 +296,38 @@ VLMs are often large and need to be optimized to fit on smaller hardware. Transf
First, install dependencies.
```bash
pip install -U quanto bitsandbytes
pip install -U optimum-quanto bitsandbytes
```
To quantize a model during loading, we need to first create [`QuantoConfig`]. Then load the model as usual, but pass `quantization_config` during model initialization.
To quantize a model during loading, we need to first create [`QuantoConfig`]. Then load the model as usual, but pass `quantization_config` during model initialization.
```python
from transformers import AutoModelForImageTextToText, QuantoConfig
## ['In this image, we see two tabby cats resting on a large, tangled pile of fishing nets. The nets are a mix of brown, orange, and red colors, with some blue and green ropes visible in the background. The cats appear relaxed and comfortable, nestled into the fibers of the nets. One cat is in the foreground, looking slightly to the side, while the other is positioned further back, looking directly at the camera. The scene suggests a coastal or fishing-related setting, possibly near']
```
And that's it, we can use the model the same way with no changes.
@@ -312,3 +338,4 @@ Here are some more resources for the image-text-to-text task.
- [Image-text-to-text task page](https://huggingface.co/tasks/image-text-to-text) covers model types, use cases, datasets, and more.
- [Vision Language Models Explained](https://huggingface.co/blog/vlms) is a blog post that covers everything about vision language models and supervised fine-tuning using [TRL](https://huggingface.co/docs/trl/en/index).
- [Learn how to fine-tune vision language models using TRL](https://huggingface.co/blog/trl-vlm-alignment)
@@ -24,8 +24,9 @@ Mask generation models are trained on large amounts of data and operate in two m
- Prompting mode: In this mode, the model takes in an image and a prompt, where a prompt can be a 2D point location (XY coordinates) in the image within an object or a bounding box surrounding an object. In prompting mode, the model only returns the mask over the object
that the prompt is pointing out.
- Segment Everything mode: In segment everything, given an image, the model generates every mask in the image. To do so, a grid of points is generated and overlaid on the image for inference.
- Video Inference: The model accepts a video, and a point or box prompt in a video frame, which is tracked throughout the video. You can get more information on how to do video inference by following [SAM 2 docs](../model_doc/sam2).
Mask generation task is supported by [Segment Anything Model (SAM)](model_doc/sam). It's a powerful model that consists of a Vision Transformer-based image encoder, a prompt encoder, and a two-way transformer mask decoder. Images and prompts are encoded, and the decoder takes these embeddings and generates valid masks.
Mask generation task is supported by [Segment Anything Model (SAM)](../model_doc/sam) and [Segment Anything Model 2 (SAM2)](../model_doc/sam2), while video inference is supported by [Segment Anything Model 2 (SAM2)](../model_doc/sam2). SAM is a powerful model that consists of a Vision Transformer-based image encoder, a prompt encoder, and a two-way transformer mask decoder. Images and prompts are encoded, and the decoder takes these embeddings and generates valid masks. Meanwhile, SAM 2 extends SAM by adding a memory module to track the masks.
We will fine-tune SAM2.1 on small part of MicroMat dataset for image matting. We need to install the [monai](https://github.com/Project-MONAI/MONAI) library to use DICE loss, and [trackio](https://huggingface.co/docs/trackio/index) for logging the masks during training.
Now we can define our dataset for loading the data. SAMDataset wraps our dataset and formats each sample the way the SAM processor expects. So instead of raw images and masks, you get processed images, bounding boxes, and ground-truth masks ready for training.
By default, processor resizes images, so on top of images and masks, it also returns original sizes. We also need to binarize the mask as it has values [0, 255].
We need to define a data collator that will turn varying size of ground truth masks to batches of reshaped masks in same shape. We reshape them using nearest neighbor interpolation. We also make batched tensors for rest of the elements in the batch. If your masks are all of same size, feel free to skip this step.
```python
import torch.nn.functional as F
def collate_fn(batch, target_hw=(256, 256)):
pixel_values = torch.cat([item["pixel_values"] for item in batch], dim=0)
original_sizes = torch.stack([item["original_sizes"] for item in batch])
input_boxes = torch.cat([item["input_boxes"] for item in batch], dim=0)

We need to log our predictions to trackio so we can monitor the model improvement in the middle of the training.
Great improvement after only training for 20 epochs on a small dataset!

@@ -22,16 +22,14 @@ Text-to-speech (TTS) is the task of creating natural-sounding speech from text,
languages and for multiple speakers. Several text-to-speech models are currently available in 🤗 Transformers, such as [Dia](../model_doc/dia), [CSM](../model_doc/csm),
[Bark](../model_doc/bark), [MMS](../model_doc/mms), [VITS](../model_doc/vits) and [SpeechT5](../model_doc/speecht5).
You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Dia,
can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music.
Here's an example of how you would use the `"text-to-speech"` pipeline with Dia:
You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`).
Here's an example of how you would use the `"text-to-speech"` pipeline with [CSM](https://huggingface.co/sesame/csm-1b):
... {"role": "0", "content": [{"type": "text", "text": "How much money can you spend?"}]},
... ]
>>> output = pipe(conversation)
```
Some models, like [Dia](https://huggingface.co/nari-labs/Dia-1.6B-0626), can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music. Below is such an example:
@@ -18,9 +18,13 @@ rendered properly in your Markdown viewer.
[[open-in-colab]]
Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning.
Video-text-to-text, also known as video language models are models that can process video and output text. These models can tackle various tasks, from video question answering to video captioning.
These models have nearly the same architecture as [image-text-to-text](../image_text_to_text) models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `<video>`".
These models have nearly the same architecture as [image-text-to-text](../image_text_to_text) models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos.
Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `<video>`".
Note that these models process videos with no audio. [Any-to-any](../any-to-any) models on the other hand can process videos with audio in them.
In this guide, we provide a brief overview of video LMs and show how to use them with Transformers for inference.
@@ -30,81 +34,27 @@ To begin with, there are multiple types of video LMs:
- chat fine-tuned models for conversation
- instruction fine-tuned models
This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-interleave-qwen-7b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively, you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf) if your hardware doesn't allow running a 7B model.
This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-onevision-qwen2-0.5b-ov-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively, you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf) if your hardware doesn't allow running a 7B model.
model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="auto", dtype=torch.float16)
```
Some models directly consume the `<video>` token, and others accept `<image>` tokens equal to the number of sampled frames. This model handles videos in the latter fashion. We will write a simple utility to handle image tokens, and another utility to get a video from a url and sample frames from it.
This model has a prompt template that looks like following. First, we'll put all the sampled frames into one list. Since we have eight frames in each video, we will insert 12 `<image>` tokens to our prompt. Add `assistant` at the end of the prompt to trigger the model to give answers. Then we can preprocess.
Videos are series of image frames. Depending on the hardware limitations, downsampling is required. If the number of downsampled frames are too little, predictions will be low quality.
Video-text-to-text models have processors with video processor abstracted in them. You can pass video inference related arguments to [`~ProcessorMixin.apply_chat_template`] function.
> [!WARNING]
> You can learn more about video processors [here](../main_classes/video_processor).
We can define our chat history, passing in video with a URL like below.
```python
user_prompt = "Are these two cats in these two videos doing the same thing?"
toks = "<image>" * 12
prompt = "<|im_start|>user"+ toks + f"\n{user_prompt}<|im_end|><|im_start|>assistant"
{"type": "text", "text": "Describe what is happening in this video."},
],
}
]
```
We can now call [`~GenerationMixin.generate`] for inference. The model outputs the question in our input and answer, so we only take the text after the prompt and `assistant` part from the model output.
You can preprocess the videos by passing in messages, setting `do_sample_frames` to True and passing in `num_frames`. Here we sample 10 frames.
The inputs contain `input_ids` for tokenized text, `pixel_values_videos` for 10 frames and `attention_mask` for which tokens .
# The first cat is shown in a relaxed state, with its eyes closed and a content expression, while the second cat is shown in a more active state, with its mouth open wide, possibly in a yawn or a vocalization.
We can now infer with our preprocessed inputs and decode them.
#"The video features a fluffy, long-haired cat with a mix of brown and white fur, lying on a beige carpeted floor. The cat's eyes are wide open, and its whiskers are prominently visible. The cat appears to be in a relaxed state, with its head slightly"
```
You can also interleave multiple videos with text directly in chat template like below.
{"type": "text", "text": "Describe similarities in these videos."},
],
}
]
```
And voila!
The inference remains the same as the previous example.
To learn more about chat templates and token streaming for video-text-to-text models, refer to the [image-text-to-text](../tasks/image_text_to_text) task guide because these models work similarly.
#['Both videos feature a cat with a similar appearance, characterized by a fluffy white coat with black markings, a pink nose, and a pink tongue. The cat\'s eyes are wide open, and it appears to be in a state of alertness or excitement. ']
@@ -372,7 +372,7 @@ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable
양자화는 LLM 가중치를 더 낮은 정밀도로 저장하여 크기를 줄입니다. 이는 메모리 사용량을 줄이며 GPU 메모리에 제약이 있는 경우 추론을 위해 LLM을 로드하는 것을 더 용이하게 합니다. GPU가 충분하다면, 모델을 양자화할 필요는 없습니다. 추가적인 양자화 및 양자화 해제 단계로 인해 약간의 지연이 발생할 수 있기 때문입니다(AWQ 및 융합 AWQ 모듈 제외).
> [!TIP]
> 다양한 양자화 라이브러리(자세한 내용은 [Quantization](./quantization) 가이드를 참조하십시오)가 있습니다. 여기에는 Quanto, AQLM, VPTQ, AWQ 및 AutoGPTQ가 포함됩니다. 사용 사례에 가장 잘 맞는 라이브러리를 사용해 보십시오. 또한 AutoGPTQ와 bitsandbytes를 비교하는 [Overview of natively supported quantization schemes in 🤗 Transformers](https://hf.co/blog/overview-quantization-transformers) 블로그 게시물을 읽어보는 것을 추천합니다.
> 다양한 양자화 라이브러리(자세한 내용은 [Quantization](./quantization) 가이드를 참조하십시오)가 있습니다. 여기에는 Quanto, AQLM, VPTQ, AWQ 및 GPT-QModel이 포함됩니다. 사용 사례에 가장 잘 맞는 라이브러리를 사용해 보십시오. 또한 gptqmodel과 bitsandbytes를 비교하는 [Overview of natively supported quantization schemes in 🤗 Transformers](https://hf.co/blog/overview-quantization-transformers) 블로그 게시물을 읽어보는 것을 추천합니다.
아래의 모델 메모리 계산기를 사용하여 모델을 로드하는 데 필요한 메모리를 추정하고 비교해 보십시오. 예를 들어 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)를 로드하는 데 필요한 메모리를 추정해 보십시오.
4비트 양자화는 RTX3090, V100, T4와 같은 GPU에서 모델을 실행할 수 있게 해주며, 이는 대부분의 사람들이 접근할 수 있는 GPU입니다.
양자화에 대한 더 많은 정보를 확인하고 4비트보다 더 적은 GPU VRAM 메모리로 모델을 양자화하거나, 더 많은 양자화 관련 정보를 보려면 [`AutoGPTQ`](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#autogptq-integration%60) 구현을 참조하는 것을 추천합니다.
양자화에 대한 더 많은 정보를 확인하고 4비트보다 더 적은 GPU VRAM 메모리로 모델을 양자화하거나, 더 많은 양자화 관련 정보를 보려면 [`GPT-QModel`](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#gptqmodel) 구현을 참조하는 것을 추천합니다.
> 결론적으로, 모델 양자화는 향상된 메모리 효율성과 모델 정확성 간의 균형을 맞추는 것이며, 경우에 따라 추론 시간에도 영향을 미칠 수 있습니다.
@@ -82,7 +82,7 @@ LLaMA2를 시작하는 데 도움이 될 Hugging Face의 공식 및 커뮤니티
- 개인 컴퓨터에서 QLoRA와 TRL을 사용하여 Llama 2 모델을 미세 조정하는 방법에 대한 [노트북](https://colab.research.google.com/drive/1SYpgFpcmtIUzdE7pxqknrM4ArCASfkFQ?usp=sharing)입니다. 🌎
⚡️ 추론
- AutoGPTQ 라이브러리의 GPTQ를 사용하여 Llama 2 모델을 양자화하는 방법에 대한 [노트북](https://colab.research.google.com/drive/1TC56ArKerXUpbgRy5vM3woRsbTEVNq7h?usp=sharing)입니다. 🌎
- GPT-QModel 라이브러리의 GPTQ를 사용하여 Llama 2 모델을 양자화하는 방법에 대한 [노트북](https://colab.research.google.com/drive/1TC56ArKerXUpbgRy5vM3woRsbTEVNq7h?usp=sharing)입니다. 🌎
- 로컬 컴퓨터나 Google Colab에서 4-bit 양자화로 Llama 2 채팅 모델을 실행하는 방법에 대한 [노트북](https://colab.research.google.com/drive/1X1z9Q6domMKl2CnEM0QGHNwidLfR4dW2?usp=sharing)입니다. 🌎
@@ -22,12 +22,12 @@ PEFT를 활용한 GPTQ 양자화를 사용해보시려면 이 [노트북](https:
</Tip>
[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) 라이브러리는 GPTQ 알고리즘을 구현합니다. 이는 훈련 후 양자화 기법으로, 가중치 행렬의 각 행을 독립적으로 양자화하여 오차를 최소화하는 가중치 버전을 찾습니다. 이 가중치는 int4로 양자화되지만, 추론 중에는 실시간으로 fp16으로 복원됩니다. 이는 int4 가중치가 GPU의 전역 메모리 대신 결합된 커널에서 역양자화되기 때문에 메모리 사용량을 4배 절약할 수 있으며, 더 낮은 비트 너비를 사용함으로써 통신 시간이 줄어들어 추론 속도가 빨라질 것으로 기대할 수 있습니다.
[GPT-QModel](https://github.com/ModelCloud/GPTQModel) 라이브러리는 GPTQ 알고리즘을 구현합니다. 이는 훈련 후 양자화 기법으로, 가중치 행렬의 각 행을 독립적으로 양자화하여 오차를 최소화하는 가중치 버전을 찾습니다. 이 가중치는 int4로 양자화되지만, 추론 중에는 실시간으로 fp16으로 복원됩니다. 이는 int4 가중치가 GPU의 전역 메모리 대신 결합된 커널에서 역양자화되기 때문에 메모리 사용량을 4배 절약할 수 있으며, 더 낮은 비트 너비를 사용함으로써 통신 시간이 줄어들어 추론 속도가 빨라질 것으로 기대할 수 있습니다.
@@ -91,30 +91,3 @@ from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto")
```
## ExLlama [[exllama]]
[ExLlama](https://github.com/turboderp/exllama)은 [Llama](model_doc/llama) 모델의 Python/C++/CUDA 구현체로, 4비트 GPTQ 가중치를 사용하여 더 빠른 추론을 위해 설계되었습니다(이 [벤치마크](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)를 참고하세요). ['GPTQConfig'] 객체를 생성할 때 ExLlama 커널이 기본적으로 활성화됩니다. 추론 속도를 더욱 높이기 위해, `exllama_config` 매개변수를 구성하여 [ExLlamaV2](https://github.com/turboderp/exllamav2) 커널을 사용할 수 있습니다:
```py
import torch
from transformers import AutoModelForCausalLM, GPTQConfig
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config)
```
<Tip warning={true}>
4비트 모델만 지원되며, 양자화된 모델을 PEFT로 미세 조정하는 경우 ExLlama 커널을 비활성화할 것을 권장합니다.
</Tip>
ExLlama 커널은 전체 모델이 GPU에 있을 때만 지원됩니다. AutoGPTQ(버전 0.4.2 이상)로 CPU에서 추론을 수행하는 경우 ExLlama 커널을 비활성화해야 합니다. 이를 위해 config.json 파일의 양자화 설정에서 ExLlama 커널과 관련된 속성을 덮어써야 합니다.
```py
import torch
from transformers import AutoModelForCausalLM, GPTQConfig
Now, when you start the training with `trainer.train()`, your metadata will be logged in Neptune.
**Note:** Although you can pass your **Neptune API token** and **project name** as arguments when creating the callback, the recommended way is to save them as environment variables:
| `NEPTUNE_API_TOKEN` | Your Neptune API token. To find and copy it, click your Neptune avatar and select **Get your API token**. |
| `NEPTUNE_PROJECT` | The full name of your Neptune project (`workspace-name/project-name`). To find and copy it, head to **project settings** → **Properties**. |
For detailed instructions and examples, see the [Neptune docs](https://docs.neptune.ai/integrations/transformers/).
By default will return vocab and merges with respect to their order, by sending `vocab_scores` we're going to
order the merges with respect to the piece scores instead.
"""
sp = self.sp
vocab = {sp.id_to_piece(index): index for index in range(sp.GetPieceSize())}
self.proto.trainer_spec.unk_id
if model_type is None:
from tokenizers.models import BPE, Unigram
merges = generate_merges(vocab, vocab_scores)
model_type = Unigram if self.proto.trainer_spec.model_type == 2 else BPE
vocab = [(piece.piece, piece.score) for piece in self.proto.pieces]
return vocab, merges
if model_type.__name__ != "BPE":
kwargs["unk_id"] = self.proto.trainer_spec.unk_id
kwargs["vocab"] = vocab
else:
from .tokenization_utils_base import generate_merges
vocab = {word: i for i, (word, score) in enumerate(vocab)}
merges = generate_merges(vocab)
kwargs["vocab"] = vocab
kwargs["merges"] = merges
# control tokens are special
# user defined symbols are not
# both user and control tokens are AddedTokens
# Add user defined symbols (type == 4) from sentencepiece (https://github.com/google/sentencepiece/blob/6225e08edb2577757163b3f5dbba4c0b670ef445/src/sentencepiece_model.proto#L299C29-L299C33)
spm_added_tokens = [(id, p.piece, p.type == 3) for id, p in enumerate(self.proto.pieces) if p.type in [3, 4]]
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.