Revert "[Fix]Load kv-cache dtype from hf_quant_config.json automatically" (#30653 )

CPU KV Offloading: Use more CUDA streams (#29013 )
Signed-off-by: Or Ozeri <oro@il.ibm.com>
--- a/docs/configuration/optimization.md
+++ b/docs/configuration/optimization.md
@@ -7,7 +7,7 @@ This guide covers optimization strategies and performance tuning for vLLM V1.

 ## Preemption

 Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
 Due to the autoregressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
 In such cases, vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
 available again. When this occurs, you may see the following warning:

--- a/docs/deployment/integrations/production-stack.md
+++ b/docs/deployment/integrations/production-stack.md
@@ -4,7 +4,7 @@ Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine le

 * **Upstream vLLM compatibility** – It wraps around upstream vLLM without modifying its code.
 * **Ease of use** – Simplified deployment via Helm charts and observability through Grafana dashboards.
 * **High performance** – Optimized for LLM workloads with features like multi-model support, model-aware and prefix-aware routing, fast vLLM bootstrapping, and KV cache offloading with [LMCache](https://github.com/LMCache/LMCache), among others.
 * **High performance** – Optimized for LLM workloads with features like multimodel support, model-aware and prefix-aware routing, fast vLLM bootstrapping, and KV cache offloading with [LMCache](https://github.com/LMCache/LMCache), among others.

 If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](https://github.com/vllm-project/production-stack), we provide a step-by-step [guide](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) and a [short video](https://www.youtube.com/watch?v=EsTJbQtzj0g) to set up everything and get started in **4 minutes**!

--- a/docs/design/cuda_graphs.md
+++ b/docs/design/cuda_graphs.md
@@ -41,7 +41,7 @@ These features allow the most flexibility for cudagraph capture and compilation
 * `NONE` — turn CUDA Graphs off. Good for debugging.
 * `PIECEWISE` —  a single-mode strategy (and past default). It is the most flexible: attention or other CUDA Graphs-incompatible operations stay eager, everything else goes into CUDA Graphs. Requires piecewise compilation.
 * `FULL` — a single-mode strategy, which only captures full CUDA Graphs for non-uniform batches, then uniform-decode batches reuse the CUDA Graph of non-uniform batch of the same batch_size, since they are compatible; can be good for small models or workloads with small prompts.
 * `FULL_DECODE_ONLY` — full CUDA Graph for uniform decode, no cudagraph for prefill/mixed etc; suitable for decode instances in a P/D setup where prefill is not as important, this way we can save the memory needed for `PIECEWISE` CUDA Graphs.
 * `FULL_DECODE_ONLY` — full CUDA Graph for uniform decode, no cudagraph for prefill/mixed etc.; suitable for decode instances in a P/D setup where prefill is not as important, this way we can save the memory needed for `PIECEWISE` CUDA Graphs.
 * `FULL_AND_PIECEWISE` — (default mode) full CUDA Graph for uniform decode, piecewise CUDA Graphs for others; generally the most performant setting, especially for low latency with small models or MoEs, but also requires the most memory and takes the longest to capture.

 Defaults: If you’re on v1 with piecewise compilation, we default to `FULL_AND_PIECEWISE` for better performance, (for pooling models, it's still `PIECEWISE`). Otherwise, e.g. if piecewise compilation unavailable, we default to `NONE`.
@@ -49,7 +49,7 @@ Defaults: If you’re on v1 with piecewise compilation, we default to `FULL_AND_
 While `NONE` , `PIECEWISE`, and `FULL` are single-mode configurations and simply equivalent to past implementations of eager execution, piecewise CUDA Graphs, and full CUDA Graphs respectively, `FULL_DECODE_ONLY` and `FULL_AND_PIECEWISE` are newly appended dual-mode configurations, which require dispatching to switch between concrete runtime modes according to runtime batches dynamically.

 !!! note
    Here, the single-modes `NONE`, `PIECEWISE`, and `FULL` are treated as the runtime modes for CUDA Graphs dispatching. If using a dual-mode, the dispatcher will always dispatch to one of its member modes (plus a potantial `NONE` if no suitable CUDA Graph available), depending on the batch composition.
    Here, the single-modes `NONE`, `PIECEWISE`, and `FULL` are treated as the runtime modes for CUDA Graphs dispatching. If using a dual-mode, the dispatcher will always dispatch to one of its member modes (plus a potential `NONE` if no suitable CUDA Graph available), depending on the batch composition.

 While cascade attention is not cudagraph compatible, it is now compatible with all possible cudagraph mode configurations. If a batch uses cascade attention, it always gets dispatched to `PIECEWISE` mode if available (otherwise `NONE`).

--- a/docs/design/optimization_levels.md
+++ b/docs/design/optimization_levels.md
@@ -4,7 +4,7 @@

 ## Overview

 vLLM now supports optimization levels (`-O0`, `-O1`, `-O2`, `-O3`). Optimization levels provide an intuitive mechnaism for users to trade startup time for performance. Higher levels have better performance but worse startup time. These optimization levels have associated defaults to help users get desired out of the box performance. Importantly, defaults set by optimization levels are purely defaults; explicit user settings will not be overwritten.
 vLLM now supports optimization levels (`-O0`, `-O1`, `-O2`, `-O3`). Optimization levels provide an intuitive mechanism for users to trade startup time for performance. Higher levels have better performance but worse startup time. These optimization levels have associated defaults to help users get desired out-of-the-box performance. Importantly, defaults set by optimization levels are purely defaults; explicit user settings will not be overwritten.

 ## Level Summaries and Usage Examples
 ```bash
--- a/docs/design/paged_attention.md
+++ b/docs/design/paged_attention.md
@@ -36,7 +36,7 @@ the input pointers `q`, `k_cache`, and `v_cache`, which point
 to query, key, and value data on global memory that need to be read
 and processed. The output pointer `out` points to global memory
 where the result should be written. These four pointers actually
 refer to multi-dimensional arrays, but each thread only accesses the
 refer to multidimensional arrays, but each thread only accesses the
 portion of data assigned to it. I have omitted all other runtime
 parameters here for simplicity.

@@ -229,7 +229,7 @@ manner.

 ## QK

 As shown the pseudo code below, before the entire for loop block, we
 As shown the pseudocode below, before the entire for loop block, we
 fetch the query data for one token and store it in `q_vecs`. Then,
 in the outer for loop, we iterate through different `k_ptrs` that
 point to different tokens and prepare the `k_vecs` in the inner for
@@ -403,7 +403,7 @@ for ... { // Iteration over different blocks.
 }
 ```

 As shown in the above pseudo code, in the outer loop, similar to
 As shown in the above pseudocode, in the outer loop, similar to
 `k_ptr`, `logits_vec` iterates over different blocks and reads
 `V_VEC_SIZE` elements from `logits`. In the inner loop, each
 thread reads `V_VEC_SIZE` elements from the same tokens as a
--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
@@ -659,6 +659,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
 | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
 |--------------|--------|--------|-------------------|----------------------|---------------------------|
 | `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | |
 | `AudioFlamingo3ForConditionalGeneration` | AudioFlamingo3 | T + A<sup>+</sup> | `nvidia/audio-flamingo-3-hf`, `nvidia/music-flamingo-hf` | ✅︎ | ✅︎ |
 | `AyaVisionForConditionalGeneration` | Aya Vision | T + I<sup>+</sup> | `CohereLabs/aya-vision-8b`, `CohereLabs/aya-vision-32b`, etc. | | ✅︎ |
 | `BeeForConditionalGeneration` | Bee-8B | T + I<sup>E+</sup> | `Open-Bee/Bee-8B-RL`, `Open-Bee/Bee-8B-SFT` | | ✅︎ |
 | `Blip2ForConditionalGeneration` | BLIP-2 | T + I<sup>E</sup> | `Salesforce/blip2-opt-2.7b`, `Salesforce/blip2-opt-6.7b`, etc. | | ✅︎ |
@@ -743,7 +744,7 @@ Some models are supported only via the [Transformers modeling backend](#transfor
    - There's no PLE caching or out-of-memory swapping support, as described in [Google's blog](https://developers.googleblog.com/en/introducing-gemma-3n/). These features might be too model-specific for vLLM, and swapping in particular may be better suited for constrained setups.

 !!! note
    For `InternVLChatModel`, only InternVL2.5 with Qwen2.5 text backbone (`OpenGVLab/InternVL2.5-1B` etc), InternVL3 and InternVL3.5 have video inputs support currently.
    For `InternVLChatModel`, only InternVL2.5 with Qwen2.5 text backbone (`OpenGVLab/InternVL2.5-1B` etc.), InternVL3 and InternVL3.5 have video inputs support currently.

 !!! note
    To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
--- a/docs/serving/parallelism_scaling.md
+++ b/docs/serving/parallelism_scaling.md
@@ -154,7 +154,7 @@ vllm serve /path/to/the/model/in/the/container \

 ## Optimizing network communication for tensor parallelism

 Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.
 Efficient tensor parallelism requires fast internode communication, preferably through high-speed network adapters such as InfiniBand.
 To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the
 [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) helper script.
 Contact your system administrator for more information about the required flags.
--- a/docs/usage/security.md
+++ b/docs/usage/security.md
@@ -10,7 +10,7 @@ All communications between nodes in a multi-node vLLM deployment are **insecure

 ### Configuration Options for Inter-Node Communications

 The following options control inter-node communications in vLLM:
 The following options control internode communications in vLLM:

 #### 1. **Environment Variables:**

@@ -28,7 +28,7 @@ The following options control inter-node communications in vLLM:

 ### Notes on PyTorch Distributed

 vLLM uses PyTorch's distributed features for some inter-node communication. For
 vLLM uses PyTorch's distributed features for some internode communication. For
 detailed information about PyTorch Distributed security considerations, please
 refer to the [PyTorch Security
 Guide](https://github.com/pytorch/pytorch/security/policy#using-distributed-features).
--- a/examples/offline_inference/audio_language.py
+++ b/examples/offline_inference/audio_language.py
@@ -42,60 +42,31 @@ class ModelRequestData(NamedTuple):
 # Unless specified, these settings have been tested to work on a single L4.


 # Voxtral
 # Make sure to install mistral-common[audio].
 def run_voxtral(question: str, audio_count: int) -> ModelRequestData:
    from mistral_common.audio import Audio
    from mistral_common.protocol.instruct.chunk import (
        AudioChunk,
        RawAudio,
        TextChunk,
    )
    from mistral_common.protocol.instruct.messages import (
        UserMessage,
    )
    from mistral_common.protocol.instruct.request import ChatCompletionRequest
    from mistral_common.tokens.tokenizers.mistral import MistralTokenizer

    model_name = "mistralai/Voxtral-Mini-3B-2507"
    tokenizer = MistralTokenizer.from_hf_hub(model_name)

 # AudioFlamingo3
 def run_audioflamingo3(question: str, audio_count: int) -> ModelRequestData:
    model_name = "nvidia/audio-flamingo-3-hf"
    engine_args = EngineArgs(
        model=model_name,
        max_model_len=8192,
        max_model_len=4096,
        max_num_seqs=2,
        limit_mm_per_prompt={"audio": audio_count},
        config_format="mistral",
        load_format="mistral",
        tokenizer_mode="mistral",
        enforce_eager=True,
        enable_chunked_prefill=False,
    )

    text_chunk = TextChunk(text=question)
    audios = [
        Audio.from_file(str(audio_assets[i].get_local_path()), strict=False)
        for i in range(audio_count)
    ]
    audio_chunks = [
        AudioChunk(input_audio=RawAudio.from_audio(audio)) for audio in audios
    ]

    messages = [UserMessage(content=[*audio_chunks, text_chunk])]

    req = ChatCompletionRequest(messages=messages, model=model_name)

    tokens = tokenizer.encode_chat_completion(req)
    prompt_ids, audios = tokens.tokens, tokens.audios

    audios_and_sr = [(au.audio_array, au.sampling_rate) for au in audios]
    # AudioFlamingo3 uses <sound> token for audio
    audio_placeholder = "<sound>" * audio_count

    multi_modal_data = {"audio": audios_and_sr}
    prompt = (
        "<|im_start|>system\n"
        "You are a helpful assistant.<|im_end|>\n"
        "<|im_start|>user\n"
        f"{audio_placeholder}{question}<|im_end|>\n"
        "<|im_start|>assistant\n"
    )

    return ModelRequestData(
        engine_args=engine_args,
        prompt_token_ids=prompt_ids,
        multi_modal_data=multi_modal_data,
        prompt=prompt,
    )


@@ -361,6 +332,63 @@ def run_ultravox(question: str, audio_count: int) -> ModelRequestData:
    )


 # Voxtral
 # Make sure to install mistral-common[audio].
 def run_voxtral(question: str, audio_count: int) -> ModelRequestData:
    from mistral_common.audio import Audio
    from mistral_common.protocol.instruct.chunk import (
        AudioChunk,
        RawAudio,
        TextChunk,
    )
    from mistral_common.protocol.instruct.messages import (
        UserMessage,
    )
    from mistral_common.protocol.instruct.request import ChatCompletionRequest
    from mistral_common.tokens.tokenizers.mistral import MistralTokenizer

    model_name = "mistralai/Voxtral-Mini-3B-2507"
    tokenizer = MistralTokenizer.from_hf_hub(model_name)

    engine_args = EngineArgs(
        model=model_name,
        max_model_len=8192,
        max_num_seqs=2,
        limit_mm_per_prompt={"audio": audio_count},
        config_format="mistral",
        load_format="mistral",
        tokenizer_mode="mistral",
        enforce_eager=True,
        enable_chunked_prefill=False,
    )

    text_chunk = TextChunk(text=question)
    audios = [
        Audio.from_file(str(audio_assets[i].get_local_path()), strict=False)
        for i in range(audio_count)
    ]
    audio_chunks = [
        AudioChunk(input_audio=RawAudio.from_audio(audio)) for audio in audios
    ]

    messages = [UserMessage(content=[*audio_chunks, text_chunk])]

    req = ChatCompletionRequest(messages=messages, model=model_name)

    tokens = tokenizer.encode_chat_completion(req)
    prompt_ids, audios = tokens.tokens, tokens.audios

    audios_and_sr = [(au.audio_array, au.sampling_rate) for au in audios]

    multi_modal_data = {"audio": audios_and_sr}

    return ModelRequestData(
        engine_args=engine_args,
        prompt_token_ids=prompt_ids,
        multi_modal_data=multi_modal_data,
    )


 # Whisper
 def run_whisper(question: str, audio_count: int) -> ModelRequestData:
    assert audio_count == 1, "Whisper only support single audio input per prompt"
@@ -382,7 +410,7 @@ def run_whisper(question: str, audio_count: int) -> ModelRequestData:


 model_example_map = {
    "voxtral": run_voxtral,
    "audioflamingo3": run_audioflamingo3,
    "gemma3n": run_gemma3n,
    "granite_speech": run_granite_speech,
    "midashenglm": run_midashenglm,
@@ -392,6 +420,7 @@ model_example_map = {
    "qwen2_audio": run_qwen2_audio,
    "qwen2_5_omni": run_qwen2_5_omni,
    "ultravox": run_ultravox,
    "voxtral": run_voxtral,
    "whisper": run_whisper,
 }

--- a/examples/online_serving/structured_outputs/structured_outputs.py
+++ b/examples/online_serving/structured_outputs/structured_outputs.py
@@ -112,7 +112,7 @@ PARAMS: dict[ConstraintsFormat, dict[str, Any]] = {
        "messages": [
            {
                "role": "user",
                "content": "Generate an SQL query to show the 'username' and 'email'from the 'users' table.",
                "content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
            }
        ],
        "extra_body": {
--- a/tests/compile/distributed/test_fusions_e2e.py
+++ b/tests/compile/distributed/test_fusions_e2e.py
@@ -27,6 +27,7 @@ is_blackwell = lambda: current_platform.is_device_capability_family(100)
 class Matches(NamedTuple):
    attention_fusion: int = 0
    allreduce_fusion: int = 0
    rms_quant_norm_fusion: int = 0
    sequence_parallel: int = 0
    async_tp: int = 0

@@ -40,6 +41,7 @@ class ModelBackendTestCase(NamedTuple):

 MODELS_FP8: list[ModelBackendTestCase] = []
 MODELS_FP4: list[ModelBackendTestCase] = []
 MODELS_GROUP_FP8: list[ModelBackendTestCase] = []
 MODELS: list[ModelBackendTestCase] = []  # tp-only

 if current_platform.is_cuda():
@@ -498,3 +500,79 @@ def run_model(compile_config: int | CompilationConfig, model: str, **model_kwarg
    compilation_config.compile_ranges_split_points = (
        llm.llm_engine.vllm_config.compilation_config.compile_ranges_split_points
    )


 if current_platform.is_cuda():
    MODELS_GROUP_FP8 = [
        ModelBackendTestCase(
            model_name="Qwen/Qwen3-30B-A3B-FP8",
            model_kwargs=dict(max_model_len=1024, kv_cache_dtype="fp8"),
            backend=AttentionBackendEnum.TRITON_ATTN,
            matches=Matches(
                rms_quant_norm_fusion=48,
            ),
        ),
    ]

 CUSTOM_OPS_QUANT_RMS_NORM = ["+quant_fp8,+rms_norm"]


@pytest.mark.parametrize(
    "model_name, model_kwargs, backend, matches, custom_ops",
    # Test rms norm+group quant_fp8 fusion
    list[tuple[Any, ...]](flat_product(MODELS_GROUP_FP8, CUSTOM_OPS_QUANT_RMS_NORM)),
 )
@pytest.mark.parametrize("inductor_graph_partition", [True, False])
 def test_rms_group_quant(
    model_name: str,
    model_kwargs: dict[str, Any],
    backend: AttentionBackendEnum,
    matches: Matches,
    custom_ops: str,
    inductor_graph_partition: bool,
    caplog_mp_spawn,
    monkeypatch,
 ):
    if inductor_graph_partition and not is_torch_equal_or_newer("2.9.0.dev"):
        pytest.skip("Inductor graph partition requires torch>=2.9")

    custom_ops_list = custom_ops.split(",") if custom_ops else []

    if inductor_graph_partition:
        mode = CUDAGraphMode.FULL_AND_PIECEWISE
        splitting_ops: list[str] | None = None
    else:
        mode = CUDAGraphMode.FULL_DECODE_ONLY
        splitting_ops = []

    # Disable, compile cache to make sure custom passes run.
    # Otherwise, we can't verify fusion happened through the logs.
    monkeypatch.setenv("VLLM_DISABLE_COMPILE_CACHE", "1")

    # To capture subprocess logs, we need to know whether spawn or fork is used.
    # Force spawn as it is more general.
    monkeypatch.setenv("VLLM_WORKER_MULTIPROC_METHOD", "spawn")
    monkeypatch.setenv("VLLM_ATTENTION_BACKEND", backend.name)

    compilation_config = CompilationConfig(
        # Testing properties
        custom_ops=custom_ops_list,
        use_inductor_graph_partition=inductor_graph_partition,
        cudagraph_mode=mode,
        splitting_ops=splitting_ops,
        # Common
        mode=CompilationMode.VLLM_COMPILE,
        pass_config=PassConfig(eliminate_noops=True, enable_fusion=True),
        # Inductor caches custom passes by default as well via uuid
        inductor_compile_config={"force_disable_caches": True},
    )

    with caplog_mp_spawn(logging.DEBUG) as log_holder:
        run_model(compilation_config, model_name, **model_kwargs)

    log_matches = re.findall(
        r"\[fusion.py:\d+] Replaced (\d+) patterns",
        log_holder.text,
    )
    assert len(log_matches) == 1, log_holder.text
    assert int(log_matches[0]) == matches.rms_quant_norm_fusion
--- a/tests/entrypoints/openai/test_chat_error.py
+++ b/tests/entrypoints/openai/test_chat_error.py
@@ -80,10 +80,9 @@ def _build_serving_chat(engine: AsyncLLM) -> OpenAIServingChat:
        return dict(engine_prompt), {}

    async def _fake_preprocess_chat(*args, **kwargs):
        # return conversation, request_prompts, engine_prompts
        # return conversation, engine_prompts
        return (
            [{"role": "user", "content": "Test"}],
            [[1, 2, 3]],
            [{"prompt_token_ids": [1, 2, 3]}],
        )

--- a/tests/entrypoints/openai/test_serving_chat.py
+++ b/tests/entrypoints/openai/test_serving_chat.py
@@ -877,7 +877,7 @@ class TestServingChatWithHarmony:

        # Test the Harmony messages for the first turn's input
        req = ChatCompletionRequest(model=MODEL_NAME, messages=messages)
        input_messages, _, _ = serving_chat._make_request_with_harmony(req)
        input_messages, _ = serving_chat._make_request_with_harmony(req)
        verify_harmony_messages(
            input_messages,
            [
@@ -905,7 +905,7 @@ class TestServingChatWithHarmony:

        # Test the Harmony messages for the second turn's input
        req_2 = ChatCompletionRequest(model=MODEL_NAME, messages=messages)
        input_messages_2, _, _ = serving_chat._make_request_with_harmony(req_2)
        input_messages_2, _ = serving_chat._make_request_with_harmony(req_2)
        verify_harmony_messages(
            input_messages_2,
            [
@@ -927,7 +927,7 @@ class TestServingChatWithHarmony:

        # Test the Harmony messages for the first turn's input
        req = ChatCompletionRequest(model=MODEL_NAME, messages=messages, tools=tools)
        input_messages, _, _ = serving_chat._make_request_with_harmony(req)
        input_messages, _ = serving_chat._make_request_with_harmony(req)
        verify_harmony_messages(
            input_messages,
            [
@@ -971,7 +971,7 @@ class TestServingChatWithHarmony:

        # Test the Harmony messages for the second turn's input
        req_2 = ChatCompletionRequest(model=MODEL_NAME, messages=messages)
        input_messages_2, _, _ = serving_chat._make_request_with_harmony(req_2)
        input_messages_2, _ = serving_chat._make_request_with_harmony(req_2)
        verify_harmony_messages(
            input_messages_2,
            [
@@ -1008,7 +1008,7 @@ class TestServingChatWithHarmony:

        # Test the Harmony messages for the first turn's input
        req = ChatCompletionRequest(model=MODEL_NAME, messages=messages, tools=tools)
        input_messages, _, _ = serving_chat._make_request_with_harmony(req)
        input_messages, _ = serving_chat._make_request_with_harmony(req)
        verify_harmony_messages(
            input_messages,
            [
@@ -1052,7 +1052,7 @@ class TestServingChatWithHarmony:

        # Test the Harmony messages for the second turn's input
        req_2 = ChatCompletionRequest(model=MODEL_NAME, messages=messages)
        input_messages_2, _, _ = serving_chat._make_request_with_harmony(req_2)
        input_messages_2, _ = serving_chat._make_request_with_harmony(req_2)
        verify_harmony_messages(
            input_messages_2,
            [
@@ -1089,7 +1089,7 @@ class TestServingChatWithHarmony:

        # Test the Harmony messages for the first turn's input
        req = ChatCompletionRequest(model=MODEL_NAME, messages=messages, tools=tools)
        input_messages, _, _ = serving_chat._make_request_with_harmony(req)
        input_messages, _ = serving_chat._make_request_with_harmony(req)
        verify_harmony_messages(
            input_messages,
            [
@@ -1133,7 +1133,7 @@ class TestServingChatWithHarmony:

        # Test the Harmony messages for the second turn's input
        req_2 = ChatCompletionRequest(model=MODEL_NAME, messages=messages)
        input_messages_2, _, _ = serving_chat._make_request_with_harmony(req_2)
        input_messages_2, _ = serving_chat._make_request_with_harmony(req_2)
        verify_harmony_messages(
            input_messages_2,
            [
@@ -1183,7 +1183,7 @@ class TestServingChatWithHarmony:

        # Test the Harmony messages for the third turn's input
        req_3 = ChatCompletionRequest(model=MODEL_NAME, messages=messages)
        input_messages_3, _, _ = serving_chat._make_request_with_harmony(req_3)
        input_messages_3, _ = serving_chat._make_request_with_harmony(req_3)
        verify_harmony_messages(
            input_messages_3,
            [
@@ -1246,7 +1246,7 @@ class TestServingChatWithHarmony:

        # Test the Harmony messages for the fourth turn's input
        req_4 = ChatCompletionRequest(model=MODEL_NAME, messages=messages)
        input_messages_4, _, _ = serving_chat._make_request_with_harmony(req_4)
        input_messages_4, _ = serving_chat._make_request_with_harmony(req_4)
        verify_harmony_messages(
            input_messages_4,
            [
@@ -1295,7 +1295,7 @@ class TestServingChatWithHarmony:
            },
        ]
        req = ChatCompletionRequest(model=MODEL_NAME, messages=messages)
        input_messages, _, _ = serving_chat._make_request_with_harmony(req)
        input_messages, _ = serving_chat._make_request_with_harmony(req)

        verify_harmony_messages(
            input_messages,
@@ -1327,7 +1327,7 @@ class TestServingChatWithHarmony:
            },
        ]
        req = ChatCompletionRequest(model=MODEL_NAME, messages=messages)
        input_messages, _, _ = serving_chat._make_request_with_harmony(req)
        input_messages, _ = serving_chat._make_request_with_harmony(req)

        verify_harmony_messages(
            input_messages,
@@ -1357,7 +1357,7 @@ class TestServingChatWithHarmony:
            },
        ]
        req = ChatCompletionRequest(model=MODEL_NAME, messages=messages)
        input_messages, _, _ = serving_chat._make_request_with_harmony(req)
        input_messages, _ = serving_chat._make_request_with_harmony(req)

        verify_harmony_messages(
            input_messages,
--- a/tests/entrypoints/openai/test_serving_responses.py
+++ b/tests/entrypoints/openai/test_serving_responses.py
@@ -21,7 +21,7 @@ from vllm.entrypoints.openai.serving_responses import (
    extract_tool_types,
 )
 from vllm.entrypoints.tool_server import ToolServer
 from vllm.inputs.data import TokensPrompt as EngineTokensPrompt
 from vllm.inputs.data import TokensPrompt


 class MockConversationContext(ConversationContext):
@@ -237,7 +237,7 @@ class TestValidateGeneratorInput:
        """Test _validate_generator_input with valid prompt length"""
        # Create an engine prompt with valid length (less than max_model_len)
        valid_prompt_token_ids = list(range(5))  # 5 tokens < 100 max_model_len
        engine_prompt = EngineTokensPrompt(prompt_token_ids=valid_prompt_token_ids)
        engine_prompt = TokensPrompt(prompt_token_ids=valid_prompt_token_ids)

        # Call the method
        result = serving_responses_instance._validate_generator_input(engine_prompt)
@@ -247,7 +247,7 @@ class TestValidateGeneratorInput:

        # create an invalid engine prompt
        invalid_prompt_token_ids = list(range(200))  # 100 tokens >= 100 max_model_len
        engine_prompt = EngineTokensPrompt(prompt_token_ids=invalid_prompt_token_ids)
        engine_prompt = TokensPrompt(prompt_token_ids=invalid_prompt_token_ids)

        # Call the method
        result = serving_responses_instance._validate_generator_input(engine_prompt)
--- a/tests/kernels/quantization/test_awq.py
+++ b/tests/kernels/quantization/test_awq.py
@@ -41,9 +41,9 @@ def test_awq_gemm_opcheck(monkeypatch: pytest.MonkeyPatch):
        qweight = torch.randint(
            -2000000000, 2000000000, (8192, 256), device="cuda", dtype=torch.int32
        )
        scales = torch.randint(
        scales = torch.empty((64, 2048), device="cuda", dtype=torch.float16)
        qzeros = torch.randint(
            -2000000000, 2000000000, (64, 256), device="cuda", dtype=torch.int32
        )
        qzeros = torch.empty((64, 2048), device="cuda", dtype=torch.float16)
        split_k_iters = 8
        opcheck(torch.ops._C.awq_gemm, (input, qweight, qzeros, scales, split_k_iters))
        opcheck(torch.ops._C.awq_gemm, (input, qweight, scales, qzeros, split_k_iters))
--- a/tests/models/fixtures/audioflamingo3/expected_results_batched.json
+++ b/tests/models/fixtures/audioflamingo3/expected_results_batched.json
@@ -0,0 +1 @@
 {"transcriptions": ["There is no clear relationship between the barking and the music, as they seem to be independent of each other.", "(B) To indicate that language cannot express clearly, satirizing the inversion of black and white in the world"], "token_ids": [[3862, 374, 902, 2797, 5025, 1948, 279, 293, 33452, 323, 279, 4627, 11, 438, 807, 2803, 311, 387, 9489, 315, 1817, 1008, 13, 151645], [5349, 8, 2014, 13216, 429, 4128, 4157, 3158, 9355, 11, 7578, 404, 4849, 279, 46488, 315, 3691, 323, 4158, 304, 279, 1879, 151645, 151671]]}
--- a/tests/models/fixtures/audioflamingo3/expected_results_single.json
+++ b/tests/models/fixtures/audioflamingo3/expected_results_single.json
@@ -0,0 +1 @@
 {"transcriptions": ["The content of the input audio is 'you can ask why over and over and over again forever even if one day we explain every physical interaction and scientific law and hope and dream and regret with a single elegant equation'."], "token_ids": [[785, 2213, 315, 279, 1946, 7699, 374, 364, 9330, 646, 2548, 3170, 916, 323, 916, 323, 916, 1549, 15683, 1496, 421, 825, 1899, 582, 10339, 1449, 6961, 16230, 323, 12344, 2329, 323, 3900, 323, 7904, 323, 22231, 448, 264, 3175, 25777, 23606, 4427, 151645]]}
--- a/tests/models/multimodal/generation/test_audioflamingo3.py
+++ b/tests/models/multimodal/generation/test_audioflamingo3.py
@@ -0,0 +1,142 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project

 # Copyright 2025 The vLLM team.
 # Copyright 2025 NVIDIA CORPORATION and the HuggingFace Inc. team. All rights
 # reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.

 import json
 import os

 import pytest

 from tests.models.registry import HF_EXAMPLE_MODELS
 from vllm import LLM, SamplingParams

 MODEL_NAME = "nvidia/audio-flamingo-3-hf"


 def get_fixture_path(filename):
    return os.path.join(
        os.path.dirname(__file__), "../../fixtures/audioflamingo3", filename
    )


@pytest.fixture(scope="module")
 def llm():
    # Check if the model is supported by the current transformers version
    model_info = HF_EXAMPLE_MODELS.get_hf_info("AudioFlamingo3ForConditionalGeneration")
    model_info.check_transformers_version(on_fail="skip")

    try:
        llm = LLM(
            model=MODEL_NAME,
            trust_remote_code=True,
            dtype="bfloat16",
            enforce_eager=True,
            limit_mm_per_prompt={"audio": 1},
        )
        return llm
    except Exception as e:
        pytest.skip(f"Failed to load model {MODEL_NAME}: {e}")


 def test_single_generation(llm):
    fixture_path = get_fixture_path("expected_results_single.json")
    if not os.path.exists(fixture_path):
        pytest.skip(f"Fixture not found: {fixture_path}")

    with open(fixture_path) as f:
        expected = json.load(f)

    audio_url = "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/Why_do_we_ask_questions_converted.wav"

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "audio_url", "audio_url": {"url": audio_url}},
                {"type": "text", "text": "Transcribe the input speech."},
            ],
        }
    ]

    sampling_params = SamplingParams(temperature=0.0, max_tokens=128)

    outputs = llm.chat(
        messages=messages,
        sampling_params=sampling_params,
    )
    generated_text = outputs[0].outputs[0].text.strip()

    expected_text = expected["transcriptions"][0]

    assert expected_text in generated_text or generated_text in expected_text


 def test_batched_generation(llm):
    fixture_path = get_fixture_path("expected_results_batched.json")
    if not os.path.exists(fixture_path):
        pytest.skip(f"Fixture not found: {fixture_path}")

    with open(fixture_path) as f:
        expected = json.load(f)

    items = [
        {
            "audio_url": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/dogs_barking_in_sync_with_the_music.wav",
            "question": "What is surprising about the relationship "
            "between the barking and the music?",
            "expected_idx": 0,
        },
        {
            "audio_url": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/Ch6Ae9DT6Ko_00-04-03_00-04-31.wav",
            "question": (
                "Why is the philosopher's name mentioned in the lyrics? "
                "(A) To express a sense of nostalgia "
                "(B) To indicate that language cannot express clearly, "
                "satirizing the inversion of black and white in the world "
                "(C) To add depth and complexity to the lyrics "
                "(D) To showcase the wisdom and influence of the philosopher"
            ),
            "expected_idx": 1,
        },
    ]

    conversations = []
    for item in items:
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "audio_url", "audio_url": {"url": item["audio_url"]}},
                    {"type": "text", "text": item["question"]},
                ],
            }
        ]
        conversations.append(messages)

    sampling_params = SamplingParams(temperature=0.0, max_tokens=128)

    outputs = llm.chat(
        messages=conversations,
        sampling_params=sampling_params,
    )

    for i, output in enumerate(outputs):
        generated_text = output.outputs[0].text.strip()
        expected_text = expected["transcriptions"][i]

        assert expected_text in generated_text or generated_text in expected_text
--- a/tests/models/multimodal/processing/test_audioflamingo3.py
+++ b/tests/models/multimodal/processing/test_audioflamingo3.py
@@ -0,0 +1,125 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project

 # Copyright 2025 The vLLM team.
 # Copyright 2025 NVIDIA CORPORATION and the HuggingFace Inc. team. All rights
 # reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.

 from unittest.mock import MagicMock

 import numpy as np
 import pytest
 import torch
 from transformers import PretrainedConfig

 from tests.models.registry import HF_EXAMPLE_MODELS


 class MockAudioFlamingo3Config(PretrainedConfig):
    model_type = "audioflamingo3"

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.audio_config = PretrainedConfig()
        self.text_config = PretrainedConfig()


 class MockAudioFlamingo3Processor:
    def __init__(self):
        self.audio_token = "<sound>"
        self.audio_token_id = 12345
        self.feature_extractor = MockFeatureExtractor()

    def __call__(self, text=None, audios=None, **kwargs):
        return {"input_ids": [1, 2, 3], "input_features": [np.zeros((3000, 80))]}


 class MockFeatureExtractor:
    def __init__(self):
        self.sampling_rate = 16000
        self.chunk_length = 30


@pytest.fixture
 def mock_ctx():
    config = MockAudioFlamingo3Config()

    ctx = MagicMock()
    ctx.get_hf_config.return_value = config
    ctx.get_hf_processor.return_value = MockAudioFlamingo3Processor()
    ctx.model_config.hf_config = config
    return ctx


@pytest.fixture(autouse=True)
 def check_transformers_version():
    # Check if the model is supported by the current transformers version
    model_info = HF_EXAMPLE_MODELS.get_hf_info("AudioFlamingo3ForConditionalGeneration")
    model_info.check_transformers_version(on_fail="skip")


 def test_audio_chunk_counting(mock_ctx):
    from vllm.model_executor.models.audioflamingo3 import (
        AudioFlamingo3DummyInputsBuilder,
        AudioFlamingo3MultiModalProcessor,
        AudioFlamingo3ProcessingInfo,
    )

    info = AudioFlamingo3ProcessingInfo(mock_ctx)
    processor = AudioFlamingo3MultiModalProcessor(
        info, AudioFlamingo3DummyInputsBuilder(info)
    )

    sr = 16000
    audio_1 = np.zeros(30 * sr)
    audio_2 = np.zeros(45 * sr)

    mm_data = {"audio": [audio_1, audio_2]}
    prompt = "<|user|>Listen.<|end|>"

    from vllm.multimodal.processing import BaseMultiModalProcessor

    def mock_base_call(self, prompt, mm_data, mm_kwargs, tok_kwargs):
        return {"input_ids": [1, 2, 3], "input_features": torch.randn(1, 80, 3000)}

    with pytest.MonkeyPatch.context() as mp:
        mp.setattr(BaseMultiModalProcessor, "_call_hf_processor", mock_base_call)

        processed = processor._call_hf_processor(prompt, mm_data, {}, {})

        chunk_counts = processed["chunk_counts"]

        assert chunk_counts[0].item() == 1
        assert chunk_counts[1].item() == 2
        assert len(chunk_counts) == 2


 def test_dummy_data_generation(mock_ctx):
    from vllm.model_executor.models.audioflamingo3 import (
        AudioFlamingo3DummyInputsBuilder,
        AudioFlamingo3ProcessingInfo,
    )

    info = AudioFlamingo3ProcessingInfo(mock_ctx)
    builder = AudioFlamingo3DummyInputsBuilder(info)

    mm_counts = {"audio": 2}
    dummy_data = builder.get_dummy_mm_data(100, mm_counts, None)

    assert "audio" in dummy_data
    assert len(dummy_data["audio"]) == 2

    expected_len = 600 * 16000
    assert len(dummy_data["audio"][0]) == expected_len
--- a/tests/models/registry.py
+++ b/tests/models/registry.py
@@ -578,6 +578,9 @@ _AUTOMATIC_CONVERTED_MODELS = {
 _MULTIMODAL_EXAMPLE_MODELS = {
    # [Decoder-only]
    "AriaForConditionalGeneration": _HfExamplesInfo("rhymes-ai/Aria"),
    "AudioFlamingo3ForConditionalGeneration": _HfExamplesInfo(
        "nvidia/audio-flamingo-3-hf", min_transformers_version="5.0.0.dev"
    ),
    "AyaVisionForConditionalGeneration": _HfExamplesInfo("CohereLabs/aya-vision-8b"),
    "BeeForConditionalGeneration": _HfExamplesInfo(
        "Open-Bee/Bee-8B-RL",
--- a/tests/v1/entrypoints/conftest.py
+++ b/tests/v1/entrypoints/conftest.py
@@ -76,6 +76,8 @@ def sample_json_schema():
        },
        "required": ["name", "age", "skills", "grade", "email", "work_history"],
        "additionalProperties": False,
        "minProperties": 1,
        "maxProperties": 10,
    }


@@ -96,6 +98,9 @@ def unsupported_json_schema():
        },
        "required": ["score", "tags"],
        "additionalProperties": False,
        "patternProperties": {
            "^score$": {"type": "integer"},
        },
    }


--- a/tests/v1/kv_connector/unit/test_nixl_connector.py
+++ b/tests/v1/kv_connector/unit/test_nixl_connector.py
@@ -461,7 +461,7 @@ class TestNixlHandshake:
            metadata = NixlConnectorMetadata()
            if num_xfers > 0:
                num_xfers -= 1
                metadata.add_new_req(
                metadata.add_new_req_to_recv(
                    request_id=request_id,
                    local_block_ids=[num_xfers + 1, num_xfers + 2, num_xfers + 3],
                    kv_transfer_params={
@@ -532,7 +532,7 @@ class TestNixlHandshake:
            vllm_config, connector.engine_id
        )
        metadata = NixlConnectorMetadata()
        metadata.add_new_req(
        metadata.add_new_req_to_recv(
            request_id="id",
            local_block_ids=[1, 2, 3],
            kv_transfer_params={
@@ -588,7 +588,7 @@ class TestNixlHandshake:
        metadata = NixlConnectorMetadata()
        total_reqs = 5
        for i in range(total_reqs):
            metadata.add_new_req(
            metadata.add_new_req_to_recv(
                request_id=f"id_{i}",
                local_block_ids=[1, 2, 3],
                kv_transfer_params={
@@ -752,7 +752,7 @@ def test_kv_connector_stats(dist_init):
    # Create transfer metadata
    request_id = "test_req_for_stats"
    metadata = NixlConnectorMetadata()
    metadata.add_new_req(
    metadata.add_new_req_to_recv(
        request_id=request_id,
        local_block_ids=[1, 2, 3],
        kv_transfer_params={
@@ -1515,7 +1515,7 @@ def test_handshake_failure_returns_finished(dist_init):

    request_id = "test_handshake_fail"
    metadata = NixlConnectorMetadata()
    metadata.add_new_req(
    metadata.add_new_req_to_recv(
        request_id=request_id,
        local_block_ids=[1, 2, 3],
        kv_transfer_params={
@@ -1565,7 +1565,7 @@ def test_transfer_setup_failure_returns_finished(dist_init):

    request_id = "test_transfer_fail"
    metadata = NixlConnectorMetadata()
    metadata.add_new_req(
    metadata.add_new_req_to_recv(
        request_id=request_id,
        local_block_ids=[7, 8, 9],
        kv_transfer_params={
--- a/tests/v1/kv_offload/test_cpu_gpu.py
+++ b/tests/v1/kv_offload/test_cpu_gpu.py
@@ -9,7 +9,7 @@ import torch
 from vllm.platforms import current_platform
 from vllm.v1.attention.backends.flash_attn import FlashAttentionBackend
 from vllm.v1.kv_offload.mediums import CPULoadStoreSpec, GPULoadStoreSpec
 from vllm.v1.kv_offload.worker.cpu_gpu import CpuGpuOffloadingHandler
 from vllm.v1.kv_offload.worker.cpu_gpu import CpuGpuOffloadingHandlers

 BACKENDS_TO_TEST = [FlashAttentionBackend]

@@ -82,7 +82,7 @@ def test_transfer(

    # create handler
    cpu_block_size = gpu_blocks_per_cpu_block * gpu_block_size
    handler = CpuGpuOffloadingHandler(
    handlers = CpuGpuOffloadingHandlers(
        attn_backends=attn_backends,
        gpu_block_size=gpu_block_size,
        cpu_block_size=cpu_block_size,
@@ -112,8 +112,7 @@ def test_transfer(

    # set transfer direction
    if gpu_to_cpu:
        src_kv_caches = handler.gpu_tensors
        dst_kv_caches = handler.cpu_tensors
        handler = handlers.gpu_to_cpu_handler
        src_spec_class = GPULoadStoreSpec
        dst_spec_class = CPULoadStoreSpec
        src_blocks = gpu_blocks
@@ -122,8 +121,7 @@ def test_transfer(
        dst_blocks_in_gpu_block_size = cpu_blocks_in_gpu_block_size
        dst_size_in_gpu_blocks = num_cpu_blocks * gpu_blocks_per_cpu_block
    else:
        src_kv_caches = handler.cpu_tensors
        dst_kv_caches = handler.gpu_tensors
        handler = handlers.cpu_to_gpu_handler
        src_spec_class = CPULoadStoreSpec
        dst_spec_class = GPULoadStoreSpec
        src_blocks = cpu_blocks
@@ -144,12 +142,12 @@ def test_transfer(
    dst_spec = dst_spec_class(dst_blocks)

    # clone src and dst tensors before transfer
    orig_src_caches = [x.clone() for x in src_kv_caches]
    orig_dst_caches = [x.clone() for x in dst_kv_caches]
    orig_src_caches = [x.clone() for x in handler.src_tensors]
    orig_dst_caches = [x.clone() for x in handler.dst_tensors]

    # call transfer function
    assert handler.transfer_async(1, (src_spec, dst_spec))
    assert set(handler.transfer_events.keys()) == {1}
    assert set({x[0] for x in handler._transfers}) == {1}

    # wait for transfer to complete
    end_time = time.time() + 10
@@ -161,15 +159,15 @@ def test_transfer(
        time.sleep(0.1)

    # verify src tensors did not change
    for orig_tensor, tensor in zip(orig_src_caches, src_kv_caches):
    for orig_tensor, tensor in zip(orig_src_caches, handler.src_tensors):
        assert torch.equal(orig_tensor, tensor)

    # verify dst tensors
    for dst_block in range(dst_size_in_gpu_blocks):
        src_block_candidate = dst_to_src.get(dst_block)
        for src_cache, dst_cache, orig_dst_cache, kv_dim in zip(
            src_kv_caches,
            dst_kv_caches,
            handler.src_tensors,
            handler.dst_tensors,
            orig_dst_caches,
            handler.kv_dim_before_num_blocks,
        ):
--- a/tests/v1/structured_output/test_utils.py
+++ b/tests/v1/structured_output/test_utils.py
@@ -44,8 +44,6 @@ def unsupported_array_schemas():
@pytest.fixture
 def unsupported_object_schemas():
    return [
        {"type": "object", "minProperties": 1},
        {"type": "object", "maxProperties": 5},
        {"type": "object", "propertyNames": {"pattern": "^[a-z]+$"}},
        {"type": "object", "patternProperties": {"^S": {"type": "string"}}},
    ]
@@ -79,6 +77,8 @@ def supported_schema():
                },
            },
        },
        "minProperties": 1,
        "maxProperties": 100,
    }


--- a/vllm/_custom_ops.py
+++ b/vllm/_custom_ops.py
@@ -498,15 +498,15 @@ def awq_dequantize(
 def awq_gemm(
    input: torch.Tensor,
    qweight: torch.Tensor,
    qzeros: torch.Tensor,
    scales: torch.Tensor,
    qzeros: torch.Tensor,
    split_k_iters: int,
 ) -> torch.Tensor:
    if envs.VLLM_USE_TRITON_AWQ:
        from vllm.model_executor.layers.quantization.awq_triton import awq_gemm_triton

        return awq_gemm_triton(input, qweight, qzeros, scales, split_k_iters)
    return torch.ops._C.awq_gemm(input, qweight, qzeros, scales, split_k_iters)
        return awq_gemm_triton(input, qweight, scales, qzeros, split_k_iters)
    return torch.ops._C.awq_gemm(input, qweight, scales, qzeros, split_k_iters)


 # gptq
@@ -632,8 +632,8 @@ if hasattr(torch.ops._C, "gptq_marlin_24_gemm"):
    def _awq_gemm_fake(
        input: torch.Tensor,
        qweight: torch.Tensor,
        qzeros: torch.Tensor,
        scales: torch.Tensor,
        qzeros: torch.Tensor,
        split_k_iters: torch.SymInt,
    ) -> torch.Tensor:
        num_in_feats = input.size(0)
--- a/vllm/benchmarks/serve.py
+++ b/vllm/benchmarks/serve.py
@@ -235,7 +235,9 @@ async def get_request(


 def calculate_metrics_for_embeddings(
    outputs: list[RequestFuncOutput], dur_s: float, selected_percentiles: list[float]
    outputs: list[RequestFuncOutput],
    dur_s: float,
    selected_percentiles: list[float],
 ) -> EmbedBenchmarkMetrics:
    """Calculate the metrics for the embedding requests.

--- a/vllm/benchmarks/startup.py
+++ b/vllm/benchmarks/startup.py
@@ -0,0 +1,326 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 """Benchmark the cold and warm startup time of vLLM models.

 This script measures total startup time (including model loading, compilation,
 and cache operations) for both cold and warm scenarios:
 - Cold startup: Fresh start with no caches (temporary cache directories)
 - Warm startup: Using cached compilation and model info
 """

 import argparse
 import dataclasses
 import json
 import multiprocessing
 import os
 import shutil
 import tempfile
 import time
 from contextlib import contextmanager
 from typing import Any

 import numpy as np
 from tqdm import tqdm

 from vllm.benchmarks.lib.utils import (
    convert_to_pytorch_benchmark_format,
    write_to_json,
 )
 from vllm.engine.arg_utils import EngineArgs


@contextmanager
 def cold_startup():
    """
    Context manager to measure cold startup time:
    1. Uses a temporary directory for vLLM cache to avoid any pollution
       between cold startup iterations.
    2. Uses inductor's fresh_cache to clear torch.compile caches.
    """
    from torch._inductor.utils import fresh_cache

    # Use temporary directory for caching to avoid any pollution between cold startups
    original_cache_root = os.environ.get("VLLM_CACHE_ROOT")
    temp_cache_dir = tempfile.mkdtemp(prefix="vllm_startup_bench_cold_")
    try:
        os.environ["VLLM_CACHE_ROOT"] = temp_cache_dir
        with fresh_cache():
            yield
    finally:
        # Clean up temporary cache directory
        shutil.rmtree(temp_cache_dir, ignore_errors=True)
        if original_cache_root:
            os.environ["VLLM_CACHE_ROOT"] = original_cache_root
        else:
            os.environ.pop("VLLM_CACHE_ROOT", None)


 def run_startup_in_subprocess(engine_args_dict, result_queue):
    """
    Run LLM startup in a subprocess and return timing metrics via a queue.
    This ensures complete isolation between iterations.
    """
    try:
        # Import inside the subprocess to avoid issues with forking
        from vllm import LLM
        from vllm.engine.arg_utils import EngineArgs

        engine_args = EngineArgs(**engine_args_dict)

        # Measure total startup time
        start_time = time.perf_counter()

        llm = LLM(**dataclasses.asdict(engine_args))

        total_startup_time = time.perf_counter() - start_time

        # Extract compilation time if available
        compilation_time = 0.0
        if hasattr(llm.llm_engine, "vllm_config"):
            vllm_config = llm.llm_engine.vllm_config
            if (
                hasattr(vllm_config, "compilation_config")
                and vllm_config.compilation_config is not None
            ):
                compilation_time = vllm_config.compilation_config.compilation_time

        result_queue.put(
            {
                "total_startup_time": total_startup_time,
                "compilation_time": compilation_time,
            }
        )

    except Exception as e:
        result_queue.put(None)
        result_queue.put(str(e))


 def save_to_pytorch_benchmark_format(
    args: argparse.Namespace, results: dict[str, Any]
 ) -> None:
    base_name = os.path.splitext(args.output_json)[0]

    cold_startup_records = convert_to_pytorch_benchmark_format(
        args=args,
        metrics={
            "avg_cold_startup_time": results["avg_cold_startup_time"],
        },
        extra_info={
            "cold_startup_times": results["cold_startup_times"],
            "cold_startup_percentiles": results["cold_startup_percentiles"],
        },
    )
    if cold_startup_records:
        write_to_json(f"{base_name}.cold_startup.pytorch.json", cold_startup_records)

    cold_compilation_records = convert_to_pytorch_benchmark_format(
        args=args,
        metrics={
            "avg_cold_compilation_time": results["avg_cold_compilation_time"],
        },
        extra_info={
            "cold_compilation_times": results["cold_compilation_times"],
            "cold_compilation_percentiles": results["cold_compilation_percentiles"],
        },
    )
    if cold_compilation_records:
        write_to_json(
            f"{base_name}.cold_compilation.pytorch.json", cold_compilation_records
        )

    warm_startup_records = convert_to_pytorch_benchmark_format(
        args=args,
        metrics={
            "avg_warm_startup_time": results["avg_warm_startup_time"],
        },
        extra_info={
            "warm_startup_times": results["warm_startup_times"],
            "warm_startup_percentiles": results["warm_startup_percentiles"],
        },
    )
    if warm_startup_records:
        write_to_json(f"{base_name}.warm_startup.pytorch.json", warm_startup_records)

    warm_compilation_records = convert_to_pytorch_benchmark_format(
        args=args,
        metrics={
            "avg_warm_compilation_time": results["avg_warm_compilation_time"],
        },
        extra_info={
            "warm_compilation_times": results["warm_compilation_times"],
            "warm_compilation_percentiles": results["warm_compilation_percentiles"],
        },
    )
    if warm_compilation_records:
        write_to_json(
            f"{base_name}.warm_compilation.pytorch.json", warm_compilation_records
        )


 def add_cli_args(parser: argparse.ArgumentParser):
    parser.add_argument(
        "--num-iters-cold",
        type=int,
        default=5,
        help="Number of cold startup iterations.",
    )
    parser.add_argument(
        "--num-iters-warmup",
        type=int,
        default=3,
        help="Number of warmup iterations before benchmarking warm startups.",
    )
    parser.add_argument(
        "--num-iters-warm",
        type=int,
        default=5,
        help="Number of warm startup iterations.",
    )
    parser.add_argument(
        "--output-json",
        type=str,
        default=None,
        help="Path to save the startup time results in JSON format.",
    )

    parser = EngineArgs.add_cli_args(parser)
    return parser


 def main(args: argparse.Namespace):
    # Set multiprocessing start method to 'spawn' for clean process isolation
    # This ensures each subprocess starts fresh without inheriting state
    multiprocessing.set_start_method("spawn", force=True)

    engine_args = EngineArgs.from_cli_args(args)

    def create_llm_and_measure_startup():
        """
        Create LLM instance in a subprocess and measure startup time.
        Returns timing metrics, using subprocess for complete isolation.
        """
        # Convert engine_args to dictionary for pickling
        engine_args_dict = dataclasses.asdict(engine_args)

        # Create a queue for inter-process communication
        result_queue = multiprocessing.Queue()
        process = multiprocessing.Process(
            target=run_startup_in_subprocess,
            args=(
                engine_args_dict,
                result_queue,
            ),
        )
        process.start()
        process.join()

        if not result_queue.empty():
            result = result_queue.get()
            if result is None:
                if not result_queue.empty():
                    error_msg = result_queue.get()
                    raise RuntimeError(f"Subprocess failed: {error_msg}")
                else:
                    raise RuntimeError("Subprocess failed with unknown error")
            return result
        else:
            raise RuntimeError("Subprocess did not return a result")

    os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
    print("Setting VLLM_ENABLE_V1_MULTIPROCESSING=0 to collect startup metrics.\n")

    print("Measuring cold startup time...\n")
    cold_startup_times = []
    cold_compilation_times = []
    for i in tqdm(range(args.num_iters_cold), desc="Cold startup iterations"):
        with cold_startup():
            metrics = create_llm_and_measure_startup()
            cold_startup_times.append(metrics["total_startup_time"])
            cold_compilation_times.append(metrics["compilation_time"])

    # Warmup for warm startup
    print("\nWarming up for warm startup measurement...\n")
    for _ in tqdm(range(args.num_iters_warmup), desc="Warmup iterations"):
        create_llm_and_measure_startup()

    print("\nMeasuring warm startup time...\n")
    warm_startup_times = []
    warm_compilation_times = []
    for i in tqdm(range(args.num_iters_warm), desc="Warm startup iterations"):
        metrics = create_llm_and_measure_startup()
        warm_startup_times.append(metrics["total_startup_time"])
        warm_compilation_times.append(metrics["compilation_time"])

    # Calculate statistics
    cold_startup_array = np.array(cold_startup_times)
    cold_compilation_array = np.array(cold_compilation_times)
    warm_startup_array = np.array(warm_startup_times)
    warm_compilation_array = np.array(warm_compilation_times)

    avg_cold_startup = np.mean(cold_startup_array)
    avg_cold_compilation = np.mean(cold_compilation_array)
    avg_warm_startup = np.mean(warm_startup_array)
    avg_warm_compilation = np.mean(warm_compilation_array)

    percentages = [10, 25, 50, 75, 90, 99]
    cold_startup_percentiles = np.percentile(cold_startup_array, percentages)
    cold_compilation_percentiles = np.percentile(cold_compilation_array, percentages)
    warm_startup_percentiles = np.percentile(warm_startup_array, percentages)
    warm_compilation_percentiles = np.percentile(warm_compilation_array, percentages)

    print("\n" + "=" * 60)
    print("STARTUP TIME BENCHMARK RESULTS")
    print("=" * 60)

    # Cold startup statistics
    print("\nCOLD STARTUP:")
    print(f"Avg total startup time: {avg_cold_startup:.2f} seconds")
    print(f"Avg compilation time:   {avg_cold_compilation:.2f} seconds")
    print("Startup time percentiles:")
    for percentage, percentile in zip(percentages, cold_startup_percentiles):
        print(f"  {percentage}%: {percentile:.2f} seconds")
    print("Compilation time percentiles:")
    for percentage, percentile in zip(percentages, cold_compilation_percentiles):
        print(f"  {percentage}%: {percentile:.2f} seconds")

    # Warm startup statistics
    print("\nWARM STARTUP:")
    print(f"Avg total startup time: {avg_warm_startup:.2f} seconds")
    print(f"Avg compilation time:   {avg_warm_compilation:.2f} seconds")
    print("Startup time percentiles:")
    for percentage, percentile in zip(percentages, warm_startup_percentiles):
        print(f"  {percentage}%: {percentile:.2f} seconds")
    print("Compilation time percentiles:")
    for percentage, percentile in zip(percentages, warm_compilation_percentiles):
        print(f"  {percentage}%: {percentile:.2f} seconds")

    print("=" * 60)

    # Output JSON results if specified
    if args.output_json:
        results = {
            "avg_cold_startup_time": float(avg_cold_startup),
            "avg_cold_compilation_time": float(avg_cold_compilation),
            "cold_startup_times": cold_startup_times,
            "cold_compilation_times": cold_compilation_times,
            "cold_startup_percentiles": dict(
                zip(percentages, cold_startup_percentiles.tolist())
            ),
            "cold_compilation_percentiles": dict(
                zip(percentages, cold_compilation_percentiles.tolist())
            ),
            "avg_warm_startup_time": float(avg_warm_startup),
            "avg_warm_compilation_time": float(avg_warm_compilation),
            "warm_startup_times": warm_startup_times,
            "warm_compilation_times": warm_compilation_times,
            "warm_startup_percentiles": dict(
                zip(percentages, warm_startup_percentiles.tolist())
            ),
            "warm_compilation_percentiles": dict(
                zip(percentages, warm_compilation_percentiles.tolist())
            ),
        }
        with open(args.output_json, "w") as f:
            json.dump(results, f, indent=4)
        save_to_pytorch_benchmark_format(args, results)
--- a/vllm/compilation/backends.py
+++ b/vllm/compilation/backends.py
@@ -463,21 +463,27 @@ class PiecewiseCompileInterpreter(torch.fx.Interpreter):
 # the tag for the part of model being compiled,
 # e.g. backbone/eagle_head
 model_tag: str = "backbone"
 model_is_encoder: bool = False


@contextmanager
 def set_model_tag(tag: str):
 def set_model_tag(tag: str, is_encoder: bool = False):
    """Context manager to set the model tag."""
    global model_tag
    global model_is_encoder
    assert tag != model_tag, (
        f"Model tag {tag} is the same as the current tag {model_tag}."
    )
    old_tag = model_tag
    old_is_encoder = model_is_encoder

    model_tag = tag
    model_is_encoder = is_encoder
    try:
        yield
    finally:
        model_tag = old_tag
        model_is_encoder = old_is_encoder


 class VllmBackend:
@@ -523,6 +529,9 @@ class VllmBackend:
        # them, e.g. backbone (default), eagle_head, etc.
        self.prefix = prefix or model_tag

        # Mark compilation for encoder.
        self.is_encoder = model_is_encoder

        # Passes to run on the graph post-grad.
        self.pass_manager = resolve_obj_by_qualname(
            current_platform.get_pass_manager_cls()
--- a/vllm/compilation/fusion.py
+++ b/vllm/compilation/fusion.py
@@ -23,17 +23,14 @@ from vllm.model_executor.layers.quantization.utils.quant_utils import (
    kNvfp4Quant,
    kStaticTensorScale,
 )
 from vllm.model_executor.layers.quantization.utils.w8a8_utils import (
    cutlass_block_fp8_supported,
 )
 from vllm.platforms import current_platform
 from vllm.utils.deep_gemm import (
    is_deep_gemm_e8m0_used,
    should_use_deepgemm_for_fp8_linear_for_nk,
 )

 from .inductor_pass import enable_fake_mode
 from .matcher_utils import MatcherFusedAddRMSNorm, MatcherQuantFP8, MatcherRMSNorm
 from .matcher_utils import (
    MatcherFusedAddRMSNorm,
    MatcherQuantFP8,
    MatcherRMSNorm,
 )
 from .vllm_inductor_pass import VllmInductorPass, VllmPatternMatcherPass

 logger = init_logger(__name__)
@@ -118,21 +115,18 @@ FUSED_OPS: dict[FusedRMSQuantKey, OpOverload] = {


 class RMSNormQuantPattern:
    def __init__(self, epsilon: float, key: FusedRMSQuantKey):
    def __init__(
        self,
        epsilon: float,
        key: FusedRMSQuantKey,
        has_col_major_scales: bool = False,
        is_e8m0: bool = False,
    ):
        self.epsilon = epsilon
        self.quant_dtype = key.quant.dtype
        config = get_current_vllm_config()
        self.model_dtype = config.model_config.dtype if config.model_config else None

        # groupwise FP8 linear uses col major scales if deepgemm and cutlass
        using_deepgemm = should_use_deepgemm_for_fp8_linear_for_nk(
            self.model_dtype,
            config.model_config.hf_config.intermediate_size,
            config.model_config.hf_config.hidden_size,
        )
        use_col_major_scales = using_deepgemm or cutlass_block_fp8_supported()
        use_e8m0 = is_deep_gemm_e8m0_used() if using_deepgemm else False

        assert key in FUSED_OPS, f"unsupported fused rmsnorm+quant op for {key}"
        self.FUSED_OP = FUSED_OPS[key]

@@ -142,7 +136,7 @@ class RMSNormQuantPattern:
            else MatcherFusedAddRMSNorm(epsilon)
        )
        self.quant_matcher = MatcherQuantFP8(
            key.quant, use_col_major_scales=use_col_major_scales, use_e8m0=use_e8m0
            key.quant, has_col_major_scales=has_col_major_scales, is_e8m0=is_e8m0
        )


@@ -260,6 +254,8 @@ class FusedAddRMSNormGroupQuantPattern(RMSNormQuantPattern):
        quant_dtype: torch.dtype,
        group_shape: GroupShape,
        symmetric=True,
        has_col_major_scales: bool = False,
        is_e8m0: bool = False,
    ):
        scale = ScaleDesc(torch.float32, False, group_shape)
        key = FusedRMSQuantKey(
@@ -267,7 +263,11 @@ class FusedAddRMSNormGroupQuantPattern(RMSNormQuantPattern):
            quant=QuantKey(dtype=quant_dtype, scale=scale, symmetric=symmetric),
        )
        self.group_shape = group_shape
        super().__init__(epsilon, key)
        self.has_col_major_scales = has_col_major_scales
        self.is_e8m0 = is_e8m0
        super().__init__(
            epsilon, key, has_col_major_scales=has_col_major_scales, is_e8m0=is_e8m0
        )

    def register(self, pm_pass: PatternMatcherPass):
        def pattern(input: torch.Tensor, weight: torch.Tensor, residual: torch.Tensor):
@@ -283,9 +283,7 @@ class FusedAddRMSNormGroupQuantPattern(RMSNormQuantPattern):
            input = input.to(dtype=self.model_dtype)

            result = torch.empty_like(input, dtype=self.quant_dtype)
            scale = self.quant_matcher.make_scale(
                input, transposed=self.quant_matcher.use_col_major_scales
            )
            scale = self.quant_matcher.make_scale(input, self.has_col_major_scales)
            at = auto_functionalized(
                self.FUSED_OP,
                result=result,
@@ -296,7 +294,7 @@ class FusedAddRMSNormGroupQuantPattern(RMSNormQuantPattern):
                scale_ub=None,
                residual=residual,
                group_size=self.group_shape[1],
                is_scale_transposed=self.quant_matcher.use_col_major_scales,
                is_scale_transposed=self.has_col_major_scales,
            )

            # result, residual, scale
@@ -318,6 +316,8 @@ class RMSNormGroupQuantPattern(RMSNormQuantPattern):
        quant_dtype: torch.dtype,
        group_shape: GroupShape,
        symmetric=True,
        has_col_major_scales: bool = False,
        is_e8m0: bool = False,
    ):
        scale = ScaleDesc(torch.float32, False, group_shape)
        key = FusedRMSQuantKey(
@@ -325,7 +325,9 @@ class RMSNormGroupQuantPattern(RMSNormQuantPattern):
            quant=QuantKey(dtype=quant_dtype, scale=scale, symmetric=symmetric),
        )
        self.group_shape = group_shape
        super().__init__(epsilon, key)
        super().__init__(
            epsilon, key, has_col_major_scales=has_col_major_scales, is_e8m0=is_e8m0
        )

    def register(self, pm_pass: PatternMatcherPass):
        def pattern(input: torch.Tensor, weight: torch.Tensor):
@@ -340,7 +342,7 @@ class RMSNormGroupQuantPattern(RMSNormQuantPattern):

            result = torch.empty_like(input, dtype=self.quant_dtype)
            scale = self.quant_matcher.make_scale(
                input, transposed=self.quant_matcher.use_col_major_scales
                input, transposed=self.quant_matcher.has_col_major_scales
            )
            at = auto_functionalized(
                self.FUSED_OP,
@@ -352,7 +354,7 @@ class RMSNormGroupQuantPattern(RMSNormQuantPattern):
                scale_ub=None,
                residual=None,
                group_size=self.group_shape[1],
                is_scale_transposed=self.quant_matcher.use_col_major_scales,
                is_scale_transposed=self.quant_matcher.has_col_major_scales,
            )

            # result, scale
@@ -489,27 +491,6 @@ class RMSNormQuantFusionPass(VllmPatternMatcherPass):
        # Make sure fused add patterns are before simple rms norm,
        # as the latter is a subset of the former in torch ops
        for epsilon in [1e-5, 1e-6]:
            # Fuse fused_add_rms_norm + fp8 group quant
            # Only register group quant patterns on CUDA where the C++ op exists
            if current_platform.is_cuda():
                FusedAddRMSNormGroupQuantPattern(
                    epsilon, FP8_DTYPE, group_shape=GroupShape(1, 128)
                ).register(self.patterns)

                # Fuse rms_norm + fp8 group quant
                RMSNormGroupQuantPattern(
                    epsilon, FP8_DTYPE, group_shape=GroupShape(1, 128)
                ).register(self.patterns)

                FusedAddRMSNormGroupQuantPattern(
                    epsilon, FP8_DTYPE, group_shape=GroupShape(1, 64)
                ).register(self.patterns)

                # Fuse rms_norm + fp8 group quant
                RMSNormGroupQuantPattern(
                    epsilon, FP8_DTYPE, group_shape=GroupShape(1, 64)
                ).register(self.patterns)

            # Fuse fused_add_rms_norm + static fp8 quant
            FusedAddRMSNormStaticQuantPattern(epsilon, FP8_DTYPE).register(
                self.patterns
@@ -526,6 +507,29 @@ class RMSNormQuantFusionPass(VllmPatternMatcherPass):
            # Fuse rms_norm + dynamic per-token fp8 quant
            RMSNormDynamicQuantPattern(epsilon, FP8_DTYPE).register(self.patterns)

            # Only register group quant patterns on CUDA where the C++ op exists
            if current_platform.is_cuda():
                for group_shape in [GroupShape(1, 128), GroupShape(1, 64)]:
                    for has_col_major_scales in [True, False]:
                        for is_e8m0 in [True, False]:
                            # Fuse fused_add_rms_norm + fp8 group quant
                            FusedAddRMSNormGroupQuantPattern(
                                epsilon,
                                FP8_DTYPE,
                                group_shape=group_shape,
                                has_col_major_scales=has_col_major_scales,
                                is_e8m0=is_e8m0,
                            ).register(self.patterns)

                            # Fuse rms_norm + fp8 group quant
                            RMSNormGroupQuantPattern(
                                epsilon,
                                FP8_DTYPE,
                                group_shape=group_shape,
                                has_col_major_scales=has_col_major_scales,
                                is_e8m0=is_e8m0,
                            ).register(self.patterns)

        self.dump_patterns(config, self.patterns)

    @VllmInductorPass.time_and_log
--- a/vllm/compilation/matcher_utils.py
+++ b/vllm/compilation/matcher_utils.py
@@ -234,24 +234,30 @@ class MatcherQuantFP8(MatcherCustomOp):
        self,
        quant_key: QuantKey,
        enabled: bool | None = None,
        use_col_major_scales: bool = False,
        use_e8m0: bool = False,
        has_col_major_scales: bool = False,
        is_e8m0: bool = False,
    ):
        if enabled is None:
            enabled = QuantFP8.enabled()

        super().__init__(enabled)
        self.quant_key = quant_key
        self.use_col_major_scales = use_col_major_scales
        self.use_e8m0 = use_e8m0
        assert quant_key in QUANT_OPS, f"unsupported quantization scheme {quant_key}"
        self.QUANT_OP = QUANT_OPS[quant_key]

        self.has_col_major_scales = has_col_major_scales
        self.is_e8m0 = is_e8m0

        assert quant_key.dtype == current_platform.fp8_dtype(), (
            "Only QuantFP8 supported by"
        )
        assert quant_key.scale2 is None
        self.quant_fp8 = QuantFP8(quant_key.scale.static, quant_key.scale.group_shape)
        self.quant_fp8 = QuantFP8(
            quant_key.scale.static,
            quant_key.scale.group_shape,
            column_major_scales=has_col_major_scales,
            use_ue8m0=is_e8m0,
        )

    def forward_custom(
        self,
@@ -264,7 +270,7 @@ class MatcherQuantFP8(MatcherCustomOp):

        if self.quant_key.scale.group_shape.is_per_group():
            assert scale is None
            scale = self.make_scale(input, transposed=self.use_col_major_scales)
            scale = self.make_scale(input, transposed=self.has_col_major_scales)

            finfo = torch.finfo(self.quant_key.dtype)
            fp8_min = finfo.min
@@ -279,7 +285,7 @@ class MatcherQuantFP8(MatcherCustomOp):
                eps=1e-10,
                fp8_min=fp8_min,
                fp8_max=fp8_max,
                scale_ue8m0=self.use_e8m0,
                scale_ue8m0=self.is_e8m0,
            )
            return result, scale

--- a/vllm/compilation/piecewise_backend.py
+++ b/vllm/compilation/piecewise_backend.py
@@ -53,12 +53,7 @@ class PiecewiseBackend:
        self.is_last_graph = piecewise_compile_index == total_piecewise_compiles - 1

        self.is_full_graph = total_piecewise_compiles == 1
        # TODO: we need to generalize encoder compilation to other models
        self.is_encoder_compilation = vllm_backend.prefix in [
            "Qwen2_5_VisionPatchEmbed",
            "Qwen2_5_VisionPatchMerger",
            "Qwen2_5_VisionBlock",
        ]
        self.is_encoder_compilation = vllm_backend.is_encoder

        self.compile_ranges = self.compilation_config.get_compile_ranges()
        if self.is_encoder_compilation:
--- a/vllm/config/model.py
+++ b/vllm/config/model.py
@@ -611,9 +611,17 @@ class ModelConfig:
    @model_validator(mode="after")
    def validate_model_config_after(self: "ModelConfig") -> "ModelConfig":
        if not isinstance(self.tokenizer, str):
            raise ValueError("tokenizer must be a string after __post_init__.")
        if not isinstance(self.max_model_len, int):
            raise ValueError("max_model_len must be an integer after __post_init__.")
            raise ValueError(
                f"tokenizer must be a string, got "
                f"{type(self.tokenizer).__name__}: {self.tokenizer!r}. "
                "Please provide a valid tokenizer path or HuggingFace model ID."
            )
        if not isinstance(self.max_model_len, int) or self.max_model_len <= 0:
            raise ValueError(
                f"max_model_len must be a positive integer, "
                f"got {type(self.max_model_len).__name__}: {self.max_model_len!r}. "
                "Example: max_model_len=2048"
            )
        return self

    def _get_transformers_backend_cls(self) -> str:
@@ -1186,7 +1194,15 @@ class ModelConfig:
                        // block.attention.n_heads_in_group
                    )

            raise RuntimeError("Couldn't determine number of kv heads")
            raise RuntimeError(
                "Could not determine the number of key-value attention heads "
                "from model configuration. "
                f"Model: {self.model}, Architecture: {self.architectures}. "
                "This usually indicates an unsupported model architecture or "
                "missing configuration. "
                "Please check if your model is supported at: "
                "https://docs.vllm.ai/en/latest/models/supported_models.html"
            )

        if self.is_attention_free:
            return 0
--- a/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
+++ b/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
@@ -202,17 +202,22 @@ def compute_nixl_compatibility_hash(
    return compat_hash


@dataclass
 class RemoteMeta:
    block_ids: list[int]
    host: str
    port: int
    engine_id: str
    request_id: str


@dataclass
 class ReqMeta:
    local_block_ids: list[int]
    # To be used when logical block size does not match the kernel block size
    local_physical_block_ids: list[int]
    remote_block_ids: list[int]
    remote_host: str
    remote_port: int
    remote_engine_id: str
    remote_request_id: str
    tp_size: int
    remote: RemoteMeta | None = None


 class NixlConnectorMetadata(KVConnectorMetadata):
@@ -223,31 +228,43 @@ class NixlConnectorMetadata(KVConnectorMetadata):
        self.reqs_in_batch: set[ReqId] = set()
        self.reqs_not_processed: set[ReqId] = set()

    def add_new_req(
    def _add_new_req(
        self,
        request_id: ReqId,
        local_block_ids: list[int],
        kv_transfer_params: dict[str, Any],
        load_remote_cache: bool = True,
        save_to_host: bool = False,
    ):
        # save and load are mutually exclusive
        assert load_remote_cache ^ save_to_host
        _req = ReqMeta(
    ) -> ReqMeta:
        return ReqMeta(
            local_block_ids=local_block_ids,
            local_physical_block_ids=local_block_ids,
            remote_block_ids=kv_transfer_params["remote_block_ids"],
            remote_engine_id=kv_transfer_params["remote_engine_id"],
            remote_request_id=kv_transfer_params["remote_request_id"],
            remote_host=kv_transfer_params["remote_host"],
            remote_port=kv_transfer_params["remote_port"],
            # P workers don't need to receive tp_size from proxy here.
            tp_size=kv_transfer_params.get("tp_size", 1),
        )
        if save_to_host:
            self.reqs_to_save[request_id] = _req
        if load_remote_cache:
            self.reqs_to_recv[request_id] = _req

    def add_new_req_to_save(
        self,
        request_id: ReqId,
        local_block_ids: list[int],
        kv_transfer_params: dict[str, Any],
    ):
        self.reqs_to_save[request_id] = self._add_new_req(
            local_block_ids, kv_transfer_params
        )

    def add_new_req_to_recv(
        self,
        request_id: ReqId,
        local_block_ids: list[int],
        kv_transfer_params: dict[str, Any],
    ):
        req = self._add_new_req(local_block_ids, kv_transfer_params)
        req.remote = RemoteMeta(
            block_ids=kv_transfer_params["remote_block_ids"],
            engine_id=kv_transfer_params["remote_engine_id"],
            request_id=kv_transfer_params["remote_request_id"],
            host=kv_transfer_params["remote_host"],
            port=kv_transfer_params["remote_port"],
        )
        self.reqs_to_recv[request_id] = req


 class NixlConnector(KVConnectorBase_V1):
@@ -666,22 +683,18 @@ class NixlConnectorScheduler:
        # Loop through scheduled reqs and convert to ReqMeta.
        for req_id, (req, block_ids) in self._reqs_need_recv.items():
            assert req.kv_transfer_params is not None
            meta.add_new_req(
            meta.add_new_req_to_recv(
                request_id=req_id,
                local_block_ids=block_ids,
                kv_transfer_params=req.kv_transfer_params,
                load_remote_cache=True,
                save_to_host=False,
            )

        for req_id, (req, block_ids) in self._reqs_need_save.items():
            assert req.kv_transfer_params is not None
            meta.add_new_req(
            meta.add_new_req_to_save(
                request_id=req_id,
                local_block_ids=block_ids,
                kv_transfer_params=req.kv_transfer_params,
                load_remote_cache=False,
                save_to_host=True,
            )

        meta.reqs_to_send = self._reqs_need_send
@@ -1124,10 +1137,11 @@ class NixlConnectorWorker:
        # Do NIXL handshake in background and add to _ready_requests when done.
        fut = self._handshake_futures.get(remote_engine_id)
        if fut is None:
            assert meta.remote is not None
            fut = self._handshake_initiation_executor.submit(
                self._nixl_handshake,
                meta.remote_host,
                meta.remote_port,
                meta.remote.host,
                meta.remote.port,
                meta.tp_size,
                remote_engine_id,
            )
@@ -1774,6 +1788,7 @@ class NixlConnectorWorker:
            # clean up metadata for completed requests
            meta = self._recving_metadata.pop(req_id, None)
            assert meta is not None, f"{req_id} not found in recving_metadata list"
            assert meta.remote is not None
            if self.use_host_buffer:
                self.sync_recved_kv_to_device(req_id, meta)
            if self.enable_permute_local_kv:
@@ -1781,7 +1796,7 @@ class NixlConnectorWorker:

            # post processing for heteroblocksize
            block_size_ratio = self.kv_topo.block_size_ratio_from_engine_id(
                meta.remote_engine_id
                meta.remote.engine_id
            )
            if (
                not self.use_mla
@@ -1916,17 +1931,18 @@ class NixlConnectorWorker:
            meta.local_physical_block_ids = self._logical_to_kernel_block_ids(
                meta.local_block_ids
            )
            meta.remote_block_ids = self._logical_to_kernel_block_ids(
                meta.remote_block_ids
            assert meta.remote is not None
            meta.remote.block_ids = self._logical_to_kernel_block_ids(
                meta.remote.block_ids
            )
            remote_engine_id = meta.remote_engine_id
            remote_engine_id = meta.remote.engine_id
            logger.debug(
                "start_load_kv for request %s from remote engine %s. "
                "Num local_block_ids: %s. Num remote_block_ids: %s. ",
                req_id,
                remote_engine_id,
                len(meta.local_physical_block_ids),
                len(meta.remote_block_ids),
                len(meta.remote.block_ids),
            )
            # always store metadata for failure recovery
            self._recving_metadata[req_id] = meta
@@ -1965,17 +1981,18 @@ class NixlConnectorWorker:
                self._reqs_to_send[req_id] = expiration_time

    def _read_blocks_for_req(self, req_id: str, meta: ReqMeta):
        assert meta.remote is not None
        logger.debug(
            "Remote agent %s available, calling _read_blocks for req %s",
            meta.remote_engine_id,
            meta.remote.engine_id,
            req_id,
        )
        self._read_blocks(
            request_id=req_id,
            dst_engine_id=meta.remote_engine_id,
            remote_request_id=meta.remote_request_id,
            dst_engine_id=meta.remote.engine_id,
            remote_request_id=meta.remote.request_id,
            local_block_ids=meta.local_physical_block_ids,
            remote_block_ids=meta.remote_block_ids,
            remote_block_ids=meta.remote.block_ids,
        )

    def _read_blocks(
--- a/vllm/entrypoints/cli/init.py
+++ b/vllm/entrypoints/cli/init.py
@@ -2,12 +2,14 @@
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
 from vllm.entrypoints.cli.benchmark.serve import BenchmarkServingSubcommand
 from vllm.entrypoints.cli.benchmark.startup import BenchmarkStartupSubcommand
 from vllm.entrypoints.cli.benchmark.sweep import BenchmarkSweepSubcommand
 from vllm.entrypoints.cli.benchmark.throughput import BenchmarkThroughputSubcommand

 __all__: list[str] = [
    "BenchmarkLatencySubcommand",
    "BenchmarkServingSubcommand",
    "BenchmarkStartupSubcommand",
    "BenchmarkSweepSubcommand",
    "BenchmarkThroughputSubcommand",
 ]
--- a/vllm/entrypoints/cli/benchmark/startup.py
+++ b/vllm/entrypoints/cli/benchmark/startup.py
@@ -0,0 +1,21 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 import argparse

 from vllm.benchmarks.startup import add_cli_args, main
 from vllm.entrypoints.cli.benchmark.base import BenchmarkSubcommandBase


 class BenchmarkStartupSubcommand(BenchmarkSubcommandBase):
    """The `startup` subcommand for `vllm bench`."""

    name = "startup"
    help = "Benchmark the startup time of vLLM models."

    @classmethod
    def add_cli_args(cls, parser: argparse.ArgumentParser) -> None:
        add_cli_args(parser)

    @staticmethod
    def cmd(args: argparse.Namespace) -> None:
        main(args)
--- a/vllm/entrypoints/openai/protocol.py
+++ b/vllm/entrypoints/openai/protocol.py
@@ -320,6 +320,7 @@ class ResponsesRequest(OpenAIBaseModel):
    max_tool_calls: int | None = None
    metadata: Metadata | None = None
    model: str | None = None
    logit_bias: dict[str, float] | None = None
    parallel_tool_calls: bool | None = True
    previous_response_id: str | None = None
    prompt: ResponsePrompt | None = None
@@ -333,6 +334,7 @@ class ResponsesRequest(OpenAIBaseModel):
    tools: list[Tool] = Field(default_factory=list)
    top_logprobs: int | None = 0
    top_p: float | None = None
    top_k: int | None = None
    truncation: Literal["auto", "disabled"] | None = "disabled"
    user: str | None = None

@@ -387,6 +389,7 @@ class ResponsesRequest(OpenAIBaseModel):
    _DEFAULT_SAMPLING_PARAMS = {
        "temperature": 1.0,
        "top_p": 1.0,
        "top_k": 0,
    }

    def to_sampling_params(
@@ -408,6 +411,10 @@ class ResponsesRequest(OpenAIBaseModel):
            top_p = default_sampling_params.get(
                "top_p", self._DEFAULT_SAMPLING_PARAMS["top_p"]
            )
        if (top_k := self.top_k) is None:
            top_k = default_sampling_params.get(
                "top_k", self._DEFAULT_SAMPLING_PARAMS["top_k"]
            )
        stop_token_ids = default_sampling_params.get("stop_token_ids")

        # Structured output
@@ -428,6 +435,7 @@ class ResponsesRequest(OpenAIBaseModel):
        return SamplingParams.from_optional(
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            max_tokens=max_tokens,
            logprobs=self.top_logprobs if self.is_include_output_logprobs() else None,
            stop_token_ids=stop_token_ids,
@@ -435,6 +443,7 @@ class ResponsesRequest(OpenAIBaseModel):
                RequestOutputKind.DELTA if self.stream else RequestOutputKind.FINAL_ONLY
            ),
            structured_outputs=structured_outputs,
            logit_bias=self.logit_bias,
        )

    def is_include_output_logprobs(self) -> bool:
--- a/vllm/entrypoints/openai/serving_chat.py
+++ b/vllm/entrypoints/openai/serving_chat.py
@@ -61,7 +61,7 @@ from vllm.entrypoints.openai.tool_parsers import ToolParser
 from vllm.entrypoints.openai.tool_parsers.mistral_tool_parser import MistralToolCall
 from vllm.entrypoints.openai.utils import maybe_filter_parallel_tool_calls
 from vllm.entrypoints.utils import get_max_tokens, should_include_usage
 from vllm.inputs.data import TokensPrompt as EngineTokensPrompt
 from vllm.inputs.data import TokensPrompt
 from vllm.logger import init_logger
 from vllm.logprobs import Logprob
 from vllm.outputs import CompletionOutput, RequestOutput
@@ -234,11 +234,7 @@ class OpenAIServingChat(OpenAIServing):
                )
                if error_check_ret is not None:
                    return error_check_ret
                (
                    conversation,
                    request_prompts,
                    engine_prompts,
                ) = await self._preprocess_chat(
                conversation, engine_prompts = await self._preprocess_chat(
                    request,
                    tokenizer,
                    request.messages,
@@ -254,11 +250,7 @@ class OpenAIServingChat(OpenAIServing):
                )
            else:
                # For GPT-OSS.
                (
                    conversation,
                    request_prompts,
                    engine_prompts,
                ) = self._make_request_with_harmony(request)
                conversation, engine_prompts = self._make_request_with_harmony(request)
        except (ValueError, TypeError, RuntimeError, jinja2.TemplateError) as e:
            logger.exception("Error in preprocessing prompt inputs")
            return self.create_error_response(f"{e} {e.__cause__}")
@@ -278,7 +270,7 @@ class OpenAIServingChat(OpenAIServing):
        generators: list[AsyncGenerator[RequestOutput, None]] = []
        try:
            for i, engine_prompt in enumerate(engine_prompts):
                prompt_text, _, _ = self._get_prompt_components(request_prompts[i])
                prompt_text, _, _ = self._get_prompt_components(engine_prompt)
                # If we are creating sub requests for multiple prompts, ensure that they
                # have unique request ids.
                sub_request_id = (
@@ -313,7 +305,7 @@ class OpenAIServingChat(OpenAIServing):

                self._log_inputs(
                    sub_request_id,
                    request_prompts[i],
                    engine_prompt,
                    params=sampling_params,
                    lora_request=lora_request,
                )
@@ -537,7 +529,7 @@ class OpenAIServingChat(OpenAIServing):
        request_id: str,
        model_name: str,
        conversation: list[ConversationMessage],
        tokenizer: TokenizerLike,
        tokenizer: TokenizerLike | None,
        request_metadata: RequestResponseMetadata,
    ) -> AsyncGenerator[str, None]:
        created_time = int(time.time())
@@ -591,6 +583,11 @@ class OpenAIServingChat(OpenAIServing):

        try:
            if self.reasoning_parser:
                if tokenizer is None:
                    raise ValueError(
                        "Tokenizer not available when `skip_tokenizer_init=True`"
                    )

                reasoning_parser = self.reasoning_parser(
                    tokenizer,
                    chat_template_kwargs=request.chat_template_kwargs,  # type: ignore
@@ -604,6 +601,11 @@ class OpenAIServingChat(OpenAIServing):
        # Prepare the tool parser if it's needed
        try:
            if tool_choice_auto and self.tool_parser:
                if tokenizer is None:
                    raise ValueError(
                        "Tokenizer not available when `skip_tokenizer_init=True`"
                    )

                tool_parsers: list[ToolParser | None] = [
                    self.tool_parser(tokenizer)
                ] * num_choices
@@ -1317,7 +1319,7 @@ class OpenAIServingChat(OpenAIServing):
        request_id: str,
        model_name: str,
        conversation: list[ConversationMessage],
        tokenizer: TokenizerLike,
        tokenizer: TokenizerLike | None,
        request_metadata: RequestResponseMetadata,
    ) -> ErrorResponse | ChatCompletionResponse:
        created_time = int(time.time())
@@ -1367,6 +1369,11 @@ class OpenAIServingChat(OpenAIServing):
                    reasoning = None

                if self.tool_parser is not None:
                    if tokenizer is None:
                        raise ValueError(
                            "Tokenizer not available when `skip_tokenizer_init=True`"
                        )

                    tool_parser = self.tool_parser(tokenizer)
                    # NOTE: We use token_ids for openai tool parser
                    tool_call_info = tool_parser.extract_tool_calls(
@@ -1409,6 +1416,11 @@ class OpenAIServingChat(OpenAIServing):

            if self.reasoning_parser:
                try:
                    if tokenizer is None:
                        raise ValueError(
                            "Tokenizer not available when `skip_tokenizer_init=True`"
                        )

                    reasoning_parser = self.reasoning_parser(
                        tokenizer,
                        chat_template_kwargs=request.chat_template_kwargs,  # type: ignore
@@ -1648,7 +1660,7 @@ class OpenAIServingChat(OpenAIServing):
        self,
        logprobs: dict[int, Logprob],
        top_logprobs: int | None,
        tokenizer: TokenizerLike,
        tokenizer: TokenizerLike | None,
        should_return_as_token_id: bool,
    ) -> list[ChatCompletionLogProb]:
        return [
@@ -1672,7 +1684,7 @@ class OpenAIServingChat(OpenAIServing):
        self,
        token_ids: GenericSequence[int],
        top_logprobs: GenericSequence[dict[int, Logprob] | None],
        tokenizer: TokenizerLike,
        tokenizer: TokenizerLike | None,
        num_output_top_logprobs: int | None = None,
        return_as_token_id: bool | None = None,
    ) -> ChatCompletionLogProbs:
@@ -1690,6 +1702,11 @@ class OpenAIServingChat(OpenAIServing):
                if should_return_as_token_id:
                    token = f"token_id:{token_id}"
                else:
                    if tokenizer is None:
                        raise ValueError(
                            "Tokenizer not available when `skip_tokenizer_init=True`"
                        )

                    token = tokenizer.decode(token_id)

                logprobs_content.append(
@@ -1800,10 +1817,10 @@ class OpenAIServingChat(OpenAIServing):

        # Render prompt token ids.
        prompt_token_ids = render_for_completion(messages)
        engine_prompt = EngineTokensPrompt(prompt_token_ids=prompt_token_ids)
        engine_prompt = TokensPrompt(prompt_token_ids=prompt_token_ids)

        # Add cache_salt if provided in the request
        if request.cache_salt is not None:
            engine_prompt["cache_salt"] = request.cache_salt

        return messages, [prompt_token_ids], [engine_prompt]
        return messages, [engine_prompt]
--- a/vllm/entrypoints/openai/serving_engine.py
+++ b/vllm/entrypoints/openai/serving_engine.py
@@ -5,60 +5,19 @@ import json
 import sys
 import time
 import traceback
 from collections.abc import AsyncGenerator, Callable, Iterable, Mapping, Sequence
 from collections.abc import AsyncGenerator, Callable, Iterable, Mapping
 from concurrent.futures import ThreadPoolExecutor
 from dataclasses import dataclass, field
 from http import HTTPStatus
 from typing import Any, ClassVar, Generic, TypeAlias, TypeVar

 import numpy as np
 import torch
 from fastapi import Request
 from pydantic import ConfigDict, TypeAdapter
 from starlette.datastructures import Headers
 from typing_extensions import TypeIs

 from vllm.entrypoints.context import (
    HarmonyContext,
    ParsableContext,
    StreamingHarmonyContext,
 )
 from vllm.entrypoints.openai.protocol import (
    FunctionCall,
    ResponseInputOutputItem,
    ResponsesRequest,
 )
 from vllm.entrypoints.pooling.classify.protocol import (
    ClassificationChatRequest,
    ClassificationCompletionRequest,
    ClassificationRequest,
    ClassificationResponse,
 )
 from vllm.entrypoints.pooling.embed.protocol import (
    EmbeddingChatRequest,
    EmbeddingCompletionRequest,
    EmbeddingRequest,
    EmbeddingResponse,
 )
 from vllm.entrypoints.pooling.pooling.protocol import (
    IOProcessorRequest,
    PoolingResponse,
 )
 from vllm.entrypoints.pooling.score.protocol import (
    RerankRequest,
    ScoreRequest,
    ScoreResponse,
 )
 from vllm.transformers_utils.tokenizer import AnyTokenizer

 if sys.version_info >= (3, 12):
    from typing import TypedDict
 else:
    from typing_extensions import TypedDict

 from openai.types.responses import (
    ToolChoiceFunction,
 )
 from pydantic import ConfigDict, TypeAdapter
 from starlette.datastructures import Headers

 import vllm.envs as envs
 from vllm.beam_search import BeamSearchSequence, create_sort_beams_key_function
@@ -72,7 +31,12 @@ from vllm.entrypoints.chat_utils import (
    parse_chat_messages_futures,
    resolve_chat_template_content_format,
 )
 from vllm.entrypoints.context import ConversationContext
 from vllm.entrypoints.context import (
    ConversationContext,
    HarmonyContext,
    ParsableContext,
    StreamingHarmonyContext,
 )
 from vllm.entrypoints.logger import RequestLogger
 from vllm.entrypoints.openai.protocol import (
    ChatCompletionNamedToolChoiceParam,
@@ -83,7 +47,10 @@ from vllm.entrypoints.openai.protocol import (
    DetokenizeRequest,
    ErrorInfo,
    ErrorResponse,
    FunctionCall,
    FunctionDefinition,
    ResponseInputOutputItem,
    ResponsesRequest,
    TokenizeChatRequest,
    TokenizeCompletionRequest,
    TokenizeResponse,
@@ -93,14 +60,34 @@ from vllm.entrypoints.openai.protocol import (
 )
 from vllm.entrypoints.openai.serving_models import OpenAIServingModels
 from vllm.entrypoints.openai.tool_parsers import ToolParser, ToolParserManager
 from vllm.entrypoints.pooling.classify.protocol import (
    ClassificationChatRequest,
    ClassificationCompletionRequest,
    ClassificationRequest,
    ClassificationResponse,
 )
 from vllm.entrypoints.pooling.embed.protocol import (
    EmbeddingChatRequest,
    EmbeddingCompletionRequest,
    EmbeddingRequest,
    EmbeddingResponse,
 )
 from vllm.entrypoints.pooling.pooling.protocol import (
    IOProcessorRequest,
    PoolingResponse,
 )
 from vllm.entrypoints.pooling.score.protocol import (
    RerankRequest,
    ScoreRequest,
    ScoreResponse,
 )
 from vllm.entrypoints.renderer import BaseRenderer, CompletionRenderer, RenderConfig
 from vllm.entrypoints.responses_utils import (
    construct_input_messages,
 )
 from vllm.entrypoints.serve.disagg.protocol import GenerateRequest, GenerateResponse
 from vllm.entrypoints.utils import _validate_truncation_size
 from vllm.inputs.data import PromptType
 from vllm.inputs.data import TokensPrompt as EngineTokensPrompt
 from vllm.inputs.data import PromptType, TokensPrompt
 from vllm.inputs.parse import (
    PromptComponents,
    get_prompt_components,
@@ -109,10 +96,7 @@ from vllm.inputs.parse import (
 from vllm.logger import init_logger
 from vllm.logprobs import Logprob, PromptLogprobs
 from vllm.lora.request import LoRARequest
 from vllm.multimodal import (  # noqa: F401 - Required to resolve Pydantic error in RequestProcessingMixin
    MultiModalDataDict,
    MultiModalUUIDDict,
 )
 from vllm.multimodal import MultiModalDataDict
 from vllm.outputs import CompletionOutput, PoolingRequestOutput, RequestOutput
 from vllm.pooling_params import PoolingParams
 from vllm.reasoning import ReasoningParser, ReasoningParserManager
@@ -185,34 +169,6 @@ AnyResponse: TypeAlias = (
 )


 class TextTokensPrompt(TypedDict):
    prompt: str
    prompt_token_ids: list[int]


 class EmbedsPrompt(TypedDict):
    prompt_embeds: torch.Tensor


 RequestPrompt: TypeAlias = list[int] | str | TextTokensPrompt | EmbedsPrompt


 def is_text_tokens_prompt(prompt: RequestPrompt) -> TypeIs[TextTokensPrompt]:
    return (
        isinstance(prompt, dict)
        and "prompt_token_ids" in prompt
        and "prompt_embeds" not in prompt
    )


 def is_embeds_prompt(prompt: RequestPrompt) -> TypeIs[EmbedsPrompt]:
    return (
        isinstance(prompt, dict)
        and "prompt_token_ids" not in prompt
        and "prompt_embeds" in prompt
    )


 RequestT = TypeVar("RequestT", bound=AnyRequest)


@@ -223,8 +179,7 @@ class RequestProcessingMixin:
    handling prompt preparation and engine input.
    """

    request_prompts: Sequence[RequestPrompt] | None = field(default_factory=list)
    engine_prompts: list[EngineTokensPrompt] | None = field(default_factory=list)
    engine_prompts: list[TokensPrompt] | None = field(default_factory=list)


@dataclass(kw_only=True)
@@ -425,7 +380,7 @@ class OpenAIServing:
            prompts_batch, lora_req_batch = zip(
                *[
                    (
                        EngineTokensPrompt(
                        TokensPrompt(
                            prompt_token_ids=beam.tokens,
                            multi_modal_data=beam.multi_modal_data,
                            mm_processor_kwargs=beam.mm_processor_kwargs,
@@ -947,7 +902,7 @@ class OpenAIServing:
        prompt: str,
        tokenizer: TokenizerLike,
        add_special_tokens: bool,
    ) -> TextTokensPrompt:
    ) -> TokensPrompt:
        async_tokenizer = self._get_async_tokenizer(tokenizer)

        if (
@@ -988,7 +943,7 @@ class OpenAIServing:
        request: AnyRequest,
        prompt_ids: list[int],
        tokenizer: TokenizerLike | None,
    ) -> TextTokensPrompt:
    ) -> TokensPrompt:
        truncate_prompt_tokens = getattr(request, "truncate_prompt_tokens", None)

        if truncate_prompt_tokens is None:
@@ -1011,7 +966,7 @@ class OpenAIServing:
        request: AnyRequest,
        input_ids: list[int],
        input_text: str,
    ) -> TextTokensPrompt:
    ) -> TokensPrompt:
        token_num = len(input_ids)

        # Note: EmbeddingRequest, ClassificationRequest,
@@ -1042,7 +997,7 @@ class OpenAIServing:
                    f"{token_num} tokens in the input for {operation}. "
                    f"Please reduce the length of the input."
                )
            return TextTokensPrompt(prompt=input_text, prompt_token_ids=input_ids)
            return TokensPrompt(prompt=input_text, prompt_token_ids=input_ids)

        # Note: TokenizeRequest and DetokenizeRequest doesn't have max_tokens
        # and does not require model context length validation
@@ -1050,7 +1005,7 @@ class OpenAIServing:
            request,
            (TokenizeCompletionRequest, TokenizeChatRequest, DetokenizeRequest),
        ):
            return TextTokensPrompt(prompt=input_text, prompt_token_ids=input_ids)
            return TokensPrompt(prompt=input_text, prompt_token_ids=input_ids)

        # chat completion endpoint supports max_completion_tokens
        if isinstance(request, ChatCompletionRequest):
@@ -1078,7 +1033,7 @@ class OpenAIServing:
                f" - {token_num})."
            )

        return TextTokensPrompt(prompt=input_text, prompt_token_ids=input_ids)
        return TokensPrompt(prompt=input_text, prompt_token_ids=input_ids)

    async def _tokenize_prompt_input_async(
        self,
@@ -1086,7 +1041,7 @@ class OpenAIServing:
        tokenizer: TokenizerLike,
        prompt_input: str | list[int],
        add_special_tokens: bool = True,
    ) -> TextTokensPrompt:
    ) -> TokensPrompt:
        """
        A simpler implementation that tokenizes a single prompt input.
        """
@@ -1105,7 +1060,7 @@ class OpenAIServing:
        tokenizer: TokenizerLike,
        prompt_inputs: Iterable[str | list[int]],
        add_special_tokens: bool = True,
    ) -> AsyncGenerator[TextTokensPrompt, None]:
    ) -> AsyncGenerator[TokensPrompt, None]:
        """
        A simpler implementation that tokenizes multiple prompt inputs.
        """
@@ -1158,11 +1113,7 @@ class OpenAIServing:
        chat_template_kwargs: dict[str, Any] | None = None,
        tool_parser: Callable[[TokenizerLike], ToolParser] | None = None,
        add_special_tokens: bool = False,
    ) -> tuple[
        list[ConversationMessage],
        Sequence[RequestPrompt],
        list[EngineTokensPrompt],
    ]:
    ) -> tuple[list[ConversationMessage], list[TokensPrompt]]:
        model_config = self.model_config

        resolved_content_format = resolve_chat_template_content_format(
@@ -1235,9 +1186,7 @@ class OpenAIServing:
                "Prompt has to be a string",
                "when the tokenizer is not initialised",
            )
            prompt_inputs = TextTokensPrompt(
                prompt=request_prompt, prompt_token_ids=[1]
            )
            prompt_inputs = TokensPrompt(prompt=request_prompt, prompt_token_ids=[1])
        elif isinstance(request_prompt, str):
            prompt_inputs = await self._tokenize_prompt_input_async(
                request,
@@ -1250,14 +1199,15 @@ class OpenAIServing:
            assert is_list_of(request_prompt, int), (
                "Prompt has to be either a string or a list of token ids"
            )
            prompt_inputs = TextTokensPrompt(
            prompt_inputs = TokensPrompt(
                prompt=tokenizer.decode(request_prompt),
                prompt_token_ids=request_prompt,
            )

        engine_prompt = EngineTokensPrompt(
            prompt_token_ids=prompt_inputs["prompt_token_ids"]
        )
        engine_prompt = TokensPrompt(prompt_token_ids=prompt_inputs["prompt_token_ids"])
        if "prompt" in prompt_inputs:
            engine_prompt["prompt"] = prompt_inputs["prompt"]

        if mm_data is not None:
            engine_prompt["multi_modal_data"] = mm_data

@@ -1270,7 +1220,7 @@ class OpenAIServing:
        if hasattr(request, "cache_salt") and request.cache_salt is not None:
            engine_prompt["cache_salt"] = request.cache_salt

        return conversation, [request_prompt], [engine_prompt]
        return conversation, [engine_prompt]

    async def _process_inputs(
        self,
@@ -1302,7 +1252,7 @@ class OpenAIServing:
    async def _render_next_turn(
        self,
        request: ResponsesRequest,
        tokenizer: AnyTokenizer,
        tokenizer: TokenizerLike | None,
        messages: list[ResponseInputOutputItem],
        tool_dicts: list[dict[str, Any]] | None,
        tool_parser,
@@ -1313,7 +1263,7 @@ class OpenAIServing:
            request_input=messages,
        )

        _, request_prompts, engine_prompts = await self._preprocess_chat(
        _, engine_prompts = await self._preprocess_chat(
            request,
            tokenizer,
            new_messages,
@@ -1322,20 +1272,20 @@ class OpenAIServing:
            chat_template=chat_template,
            chat_template_content_format=chat_template_content_format,
        )
        return request_prompts, engine_prompts
        return engine_prompts

    async def _generate_with_builtin_tools(
        self,
        request_id: str,
        request_prompt: RequestPrompt,
        engine_prompt: EngineTokensPrompt,
        engine_prompt: TokensPrompt,
        sampling_params: SamplingParams,
        context: ConversationContext,
        lora_request: LoRARequest | None = None,
        priority: int = 0,
        **kwargs,
    ):
        prompt_text, _, _ = self._get_prompt_components(request_prompt)
        prompt_text, _, _ = self._get_prompt_components(engine_prompt)

        orig_priority = priority
        sub_request = 0
        while True:
@@ -1343,7 +1293,7 @@ class OpenAIServing:
            sub_request_id = f"{request_id}_{sub_request}"
            self._log_inputs(
                sub_request_id,
                request_prompt,
                engine_prompt,
                params=sampling_params,
                lora_request=lora_request,
            )
@@ -1388,10 +1338,9 @@ class OpenAIServing:
            # Render the next prompt token ids.
            if isinstance(context, (HarmonyContext, StreamingHarmonyContext)):
                prompt_token_ids = context.render_for_completion()
                engine_prompt = EngineTokensPrompt(prompt_token_ids=prompt_token_ids)
                request_prompt = prompt_token_ids
                engine_prompt = TokensPrompt(prompt_token_ids=prompt_token_ids)
            elif isinstance(context, ParsableContext):
                request_prompts, engine_prompts = await self._render_next_turn(
                engine_prompts = await self._render_next_turn(
                    context.request,
                    context.tokenizer,
                    context.parser.response_messages,
@@ -1401,8 +1350,7 @@ class OpenAIServing:
                    context.chat_template_content_format,
                )
                engine_prompt = engine_prompts[0]
                request_prompt = request_prompts[0]
                prompt_text, _, _ = self._get_prompt_components(request_prompt)
                prompt_text, _, _ = self._get_prompt_components(engine_prompt)

            # Update the sampling params.
            sampling_params.max_tokens = self.max_model_len - len(
@@ -1412,19 +1360,13 @@ class OpenAIServing:
            priority = orig_priority - 1
            sub_request += 1

    def _get_prompt_components(
        self,
        prompt: RequestPrompt | PromptType,
    ) -> PromptComponents:
        if isinstance(prompt, list):
            return PromptComponents(token_ids=prompt)

        return get_prompt_components(prompt)  # type: ignore[arg-type]
    def _get_prompt_components(self, prompt: PromptType) -> PromptComponents:
        return get_prompt_components(prompt)

    def _log_inputs(
        self,
        request_id: str,
        inputs: RequestPrompt | PromptType,
        inputs: PromptType,
        params: SamplingParams | PoolingParams | BeamSearchParams | None,
        lora_request: LoRARequest | None,
    ) -> None:
@@ -1486,7 +1428,7 @@ class OpenAIServing:
    @staticmethod
    def _parse_tool_calls_from_content(
        request: ResponsesRequest | ChatCompletionRequest,
        tokenizer: TokenizerLike,
        tokenizer: TokenizerLike | None,
        enable_auto_tools: bool,
        tool_parser_cls: Callable[[TokenizerLike], ToolParser] | None,
        content: str | None = None,
@@ -1526,6 +1468,11 @@ class OpenAIServing:
            and enable_auto_tools
            and (request.tool_choice == "auto" or request.tool_choice is None)
        ):
            if tokenizer is None:
                raise ValueError(
                    "Tokenizer not available when `skip_tokenizer_init=True`"
                )

            # Automatic Tool Call Parsing
            try:
                tool_parser = tool_parser_cls(tokenizer)
--- a/vllm/entrypoints/openai/serving_responses.py
+++ b/vllm/entrypoints/openai/serving_responses.py
@@ -107,7 +107,7 @@ from vllm.entrypoints.responses_utils import (
    make_response_output_items_from_parsable_context,
 )
 from vllm.entrypoints.tool_server import ToolServer
 from vllm.inputs.data import TokensPrompt as EngineTokensPrompt
 from vllm.inputs.data import TokensPrompt
 from vllm.logger import init_logger
 from vllm.logprobs import Logprob as SampleLogprob
 from vllm.logprobs import SampleLogprobs
@@ -258,7 +258,7 @@ class OpenAIServingResponses(OpenAIServing):
        self.tool_server = tool_server

    def _validate_generator_input(
        self, engine_prompt: EngineTokensPrompt
        self, engine_prompt: TokensPrompt
    ) -> ErrorResponse | None:
        """Add validations to the input to the generator here."""
        if self.max_model_len <= len(engine_prompt["prompt_token_ids"]):
@@ -353,11 +353,11 @@ class OpenAIServingResponses(OpenAIServing):
            tokenizer = await self.engine_client.get_tokenizer()

            if self.use_harmony:
                messages, request_prompts, engine_prompts = (
                    self._make_request_with_harmony(request, prev_response)
                messages, engine_prompts = self._make_request_with_harmony(
                    request, prev_response
                )
            else:
                messages, request_prompts, engine_prompts = await self._make_request(
                messages, engine_prompts = await self._make_request(
                    request, prev_response, tokenizer
                )

@@ -393,7 +393,7 @@ class OpenAIServingResponses(OpenAIServing):
            assert len(builtin_tool_list) == 0
            available_tools = []
        try:
            for i, engine_prompt in enumerate(engine_prompts):
            for engine_prompt in engine_prompts:
                maybe_error = self._validate_generator_input(engine_prompt)
                if maybe_error is not None:
                    return maybe_error
@@ -420,7 +420,7 @@ class OpenAIServingResponses(OpenAIServing):
                        context = HarmonyContext(messages, available_tools)
                else:
                    if envs.VLLM_USE_EXPERIMENTAL_PARSER_CONTEXT:
                        # This is an feature in development for parsing
                        # This is a feature in development for parsing
                        # tokens during generation instead of at the end
                        context = ParsableContext(
                            response_messages=messages,
@@ -449,7 +449,6 @@ class OpenAIServingResponses(OpenAIServing):
                        )
                generator = self._generate_with_builtin_tools(
                    request_id=request.request_id,
                    request_prompt=request_prompts[i],
                    engine_prompt=engine_prompt,
                    sampling_params=sampling_params,
                    context=context,
@@ -564,7 +563,7 @@ class OpenAIServingResponses(OpenAIServing):
            prev_msg=self.msg_store.get(prev_response.id) if prev_response else None,
            prev_response_output=prev_response.output if prev_response else None,
        )
        _, request_prompts, engine_prompts = await self._preprocess_chat(
        _, engine_prompts = await self._preprocess_chat(
            request,
            tokenizer,
            messages,
@@ -573,7 +572,7 @@ class OpenAIServingResponses(OpenAIServing):
            chat_template=self.chat_template,
            chat_template_content_format=self.chat_template_content_format,
        )
        return messages, request_prompts, engine_prompts
        return messages, engine_prompts

    def _make_request_with_harmony(
        self,
@@ -586,13 +585,13 @@ class OpenAIServingResponses(OpenAIServing):
            )
        messages = self._construct_input_messages_with_harmony(request, prev_response)
        prompt_token_ids = render_for_completion(messages)
        engine_prompt = EngineTokensPrompt(prompt_token_ids=prompt_token_ids)
        engine_prompt = TokensPrompt(prompt_token_ids=prompt_token_ids)

        # Add cache_salt if provided in the request
        if request.cache_salt is not None:
            engine_prompt["cache_salt"] = request.cache_salt

        return messages, [prompt_token_ids], [engine_prompt]
        return messages, [engine_prompt]

    async def _initialize_tool_sessions(
        self,
--- a/vllm/entrypoints/pooling/classify/serving.py
+++ b/vllm/entrypoints/pooling/classify/serving.py
@@ -72,11 +72,7 @@ class ClassificationMixin(OpenAIServing):
                if ret:
                    return ret

                (
                    _,
                    _,
                    engine_prompts,
                ) = await self._preprocess_chat(
                _, engine_prompts = await self._preprocess_chat(
                    cast(ChatCompletionRequest, chat_request),
                    ctx.tokenizer,
                    messages,
--- a/vllm/entrypoints/pooling/embed/serving.py
+++ b/vllm/entrypoints/pooling/embed/serving.py
@@ -20,7 +20,6 @@ from vllm.entrypoints.openai.serving_engine import (
    EmbeddingServeContext,
    OpenAIServing,
    ServeContext,
    TextTokensPrompt,
 )
 from vllm.entrypoints.openai.serving_models import OpenAIServingModels
 from vllm.entrypoints.pooling.embed.protocol import (
@@ -32,7 +31,7 @@ from vllm.entrypoints.pooling.embed.protocol import (
    EmbeddingResponseData,
 )
 from vllm.entrypoints.renderer import RenderConfig
 from vllm.inputs.data import TokensPrompt as EngineTokensPrompt
 from vllm.inputs.data import TokensPrompt
 from vllm.logger import init_logger
 from vllm.outputs import (
    EmbeddingRequestOutput,
@@ -83,11 +82,7 @@ class EmbeddingMixin(OpenAIServing):
            renderer = self._get_renderer(tokenizer)

            if isinstance(ctx.request, EmbeddingChatRequest):
                (
                    _,
                    _,
                    ctx.engine_prompts,
                ) = await self._preprocess_chat(
                _, ctx.engine_prompts = await self._preprocess_chat(
                    ctx.request,
                    tokenizer,
                    ctx.request.messages,
@@ -209,14 +204,13 @@ class EmbeddingMixin(OpenAIServing):
    async def _process_chunked_request(
        self,
        ctx: EmbeddingServeContext,
        original_prompt: TextTokensPrompt,
        token_ids: list[int],
        pooling_params,
        trace_headers,
        prompt_idx: int,
    ) -> list[AsyncGenerator[PoolingRequestOutput, None]]:
        """Process a single prompt using chunked processing."""
        generators: list[AsyncGenerator[PoolingRequestOutput, None]] = []
        token_ids = original_prompt["prompt_token_ids"]

        # Split into chunks using max_position_embeddings
        max_pos_embeddings = self._get_max_position_embeddings()
@@ -228,18 +222,12 @@ class EmbeddingMixin(OpenAIServing):
            chunk_request_id = f"{ctx.request_id}-prompt-{prompt_idx}-chunk-{chunk_idx}"

            # Create engine prompt for this chunk
            chunk_engine_prompt = EngineTokensPrompt(prompt_token_ids=chunk_tokens)

            # Create chunk request prompt for logging
            chunk_text = ""
            chunk_request_prompt = TextTokensPrompt(
                prompt=chunk_text, prompt_token_ids=chunk_tokens
            )
            chunk_engine_prompt = TokensPrompt(prompt_token_ids=chunk_tokens)

            # Log the chunk
            self._log_inputs(
                chunk_request_id,
                chunk_request_prompt,
                chunk_engine_prompt,
                params=pooling_params,
                lora_request=ctx.lora_request,
            )
@@ -263,7 +251,7 @@ class EmbeddingMixin(OpenAIServing):
        request,
        input_ids: list[int],
        input_text: str,
    ) -> TextTokensPrompt:
    ) -> TokensPrompt:
        """Override to support chunked processing for embedding requests."""
        token_num = len(input_ids)

@@ -328,23 +316,15 @@ class EmbeddingMixin(OpenAIServing):
                        )
                    )

            return TextTokensPrompt(prompt=input_text, prompt_token_ids=input_ids)
            return TokensPrompt(prompt=input_text, prompt_token_ids=input_ids)

        # For other request types, use the parent's implementation
        return super()._validate_input(request, input_ids, input_text)

    def _is_text_tokens_prompt(self, prompt) -> bool:
        """Check if a prompt is a TextTokensPrompt (has prompt_token_ids)."""
        return (
            isinstance(prompt, dict)
            and "prompt_token_ids" in prompt
            and "prompt_embeds" not in prompt
        )

    async def _create_single_prompt_generator(
        self,
        ctx: EmbeddingServeContext,
        engine_prompt: EngineTokensPrompt,
        engine_prompt: TokensPrompt,
        pooling_params: PoolingParams,
        trace_headers: Mapping[str, str] | None,
        prompt_index: int,
@@ -413,14 +393,16 @@ class EmbeddingMixin(OpenAIServing):

            for i, engine_prompt in enumerate(ctx.engine_prompts):
                # Check if this specific prompt needs chunked processing
                if self._is_text_tokens_prompt(engine_prompt):
                    # Cast to TextTokensPrompt since we've verified
                    # prompt_token_ids
                    text_tokens_prompt = cast(TextTokensPrompt, engine_prompt)
                    if len(text_tokens_prompt["prompt_token_ids"]) > max_pos_embeddings:
                if "prompt_token_ids" in engine_prompt:
                    prompt_token_ids = engine_prompt["prompt_token_ids"]
                    if len(prompt_token_ids) > max_pos_embeddings:
                        # Use chunked processing for this prompt
                        chunk_generators = await self._process_chunked_request(
                            ctx, text_tokens_prompt, pooling_params, trace_headers, i
                            ctx,
                            prompt_token_ids,
                            pooling_params,
                            trace_headers,
                            i,
                        )
                        generators.extend(chunk_generators)
                        continue
@@ -578,14 +560,13 @@ class EmbeddingMixin(OpenAIServing):

                        # Get original prompt token IDs for this prompt
                        original_prompt = ctx.engine_prompts[prompt_idx]
                        if not self._is_text_tokens_prompt(original_prompt):
                        if "prompt_token_ids" not in original_prompt:
                            return self.create_error_response(
                                f"Chunked prompt {prompt_idx} is not a TextTokensPrompt"
                                f"Chunked prompt {prompt_idx} does not contain "
                                "token IDs"
                            )

                        original_token_ids = cast(TextTokensPrompt, original_prompt)[
                            "prompt_token_ids"
                        ]
                        original_token_ids = original_prompt["prompt_token_ids"]

                        pooling_request_output = PoolingRequestOutput(
                            request_id=aggregator["request_id"],
--- a/vllm/entrypoints/pooling/pooling/serving.py
+++ b/vllm/entrypoints/pooling/pooling/serving.py
@@ -137,11 +137,8 @@ class OpenAIServingPooling(OpenAIServing):
                )
                if error_check_ret is not None:
                    return error_check_ret
                (
                    _,
                    _,
                    engine_prompts,
                ) = await self._preprocess_chat(

                _, engine_prompts = await self._preprocess_chat(
                    request,
                    tokenizer,
                    request.messages,
--- a/vllm/entrypoints/pooling/score/protocol.py
+++ b/vllm/entrypoints/pooling/score/protocol.py
@@ -120,6 +120,7 @@ class RerankResult(BaseModel):


 class RerankUsage(BaseModel):
    prompt_tokens: int
    total_tokens: int


--- a/vllm/entrypoints/pooling/score/serving.py
+++ b/vllm/entrypoints/pooling/score/serving.py
@@ -502,5 +502,7 @@ class ServingScores(OpenAIServing):
            id=request_id,
            model=model_name,
            results=results,
            usage=RerankUsage(total_tokens=num_prompt_tokens),
            usage=RerankUsage(
                total_tokens=num_prompt_tokens, prompt_tokens=num_prompt_tokens
            ),
        )
--- a/vllm/entrypoints/renderer.py
+++ b/vllm/entrypoints/renderer.py
@@ -12,9 +12,7 @@ import torch
 from pydantic import Field

 from vllm.config import ModelConfig
 from vllm.inputs.data import EmbedsPrompt as EngineEmbedsPrompt
 from vllm.inputs.data import TextPrompt as EngineTextPrompt
 from vllm.inputs.data import TokensPrompt as EngineTokensPrompt
 from vllm.inputs.data import EmbedsPrompt, TextPrompt, TokensPrompt
 from vllm.inputs.parse import get_prompt_components, parse_raw_prompts
 from vllm.tokenizers import TokenizerLike
 from vllm.utils.async_utils import AsyncMicrobatchTokenizer
@@ -97,7 +95,7 @@ class BaseRenderer(ABC):
        *,
        prompt_or_prompts: str | list[str] | list[int] | list[list[int]],
        config: RenderConfig,
    ) -> list[EngineTokensPrompt]:
    ) -> list[TokensPrompt]:
        """
        Convert text or token inputs into engine-ready TokensPrompt objects.

@@ -115,7 +113,7 @@ class BaseRenderer(ABC):
                (e.g., tokenization and length handling).

        Returns:
            list[EngineTokensPrompt]: Engine-ready token prompts.
            list[TokensPrompt]: Engine-ready token prompts.

        Raises:
            ValueError: If input formats are invalid or length limits exceeded.
@@ -129,7 +127,7 @@ class BaseRenderer(ABC):
        prompt_or_prompts: str | list[str] | list[int] | list[list[int]] | None = None,
        prompt_embeds: bytes | list[bytes] | None = None,
        config: RenderConfig,
    ) -> list[EngineTokensPrompt | EngineEmbedsPrompt]:
    ) -> list[TokensPrompt | EmbedsPrompt]:
        """
        Convert text/token and/or base64-encoded embeddings inputs into
        engine-ready prompt objects using a unified RenderConfig.
@@ -146,7 +144,7 @@ class BaseRenderer(ABC):
                (e.g., tokenization and length handling).

        Returns:
            list[Union[EngineTokensPrompt, EngineEmbedsPrompt]]:
            list[Union[TokensPrompt, EmbedsPrompt]]:
                Engine-ready prompt objects.

        Raises:
@@ -161,14 +159,14 @@ class BaseRenderer(ABC):
        prompt_embeds: bytes | list[bytes],
        truncate_prompt_tokens: Annotated[int, Field(ge=0)] | None = None,
        cache_salt: str | None = None,
    ) -> list[EngineEmbedsPrompt]:
    ) -> list[EmbedsPrompt]:
        """Load and validate base64-encoded embeddings into prompt objects."""
        if not self.model_config.enable_prompt_embeds:
            raise ValueError(
                "You must set `--enable-prompt-embeds` to input `prompt_embeds`."
            )

        def _load_and_validate_embed(embed: bytes) -> EngineEmbedsPrompt:
        def _load_and_validate_embed(embed: bytes) -> EmbedsPrompt:
            tensor = torch.load(
                io.BytesIO(pybase64.b64decode(embed, validate=True)),
                weights_only=True,
@@ -185,7 +183,7 @@ class BaseRenderer(ABC):
                assert tensor.dim() == 2
            if truncate_prompt_tokens is not None:
                tensor = tensor[-truncate_prompt_tokens:]
            embeds_prompt = EngineEmbedsPrompt(prompt_embeds=tensor)
            embeds_prompt = EmbedsPrompt(prompt_embeds=tensor)
            if cache_salt is not None:
                embeds_prompt["cache_salt"] = cache_salt
            return embeds_prompt
@@ -213,7 +211,7 @@ class CompletionRenderer(BaseRenderer):
        *,
        prompt_or_prompts: str | list[str] | list[int] | list[list[int]],
        config: RenderConfig,
    ) -> list[EngineTokensPrompt]:
    ) -> list[TokensPrompt]:
        """Implementation of prompt rendering for completion-style requests.

        Uses async tokenizer pooling for improved performance. See base class
@@ -240,7 +238,7 @@ class CompletionRenderer(BaseRenderer):
        prompt_or_prompts: str | list[str] | list[int] | list[list[int]] | None = None,
        prompt_embeds: bytes | list[bytes] | None = None,
        config: RenderConfig,
    ) -> list[EngineTokensPrompt | EngineEmbedsPrompt]:
    ) -> list[TokensPrompt | EmbedsPrompt]:
        """
        Render text/token prompts and/or precomputed embedding prompts. At
        least one of `prompt_or_prompts` or `prompt_embeds` must be provided.
@@ -249,7 +247,7 @@ class CompletionRenderer(BaseRenderer):
        if truncate_prompt_tokens == 0:
            return []

        rendered: list[EngineTokensPrompt | EngineEmbedsPrompt] = []
        rendered: list[TokensPrompt | EmbedsPrompt] = []

        if prompt_embeds is not None:
            rendered.extend(
@@ -281,10 +279,10 @@ class CompletionRenderer(BaseRenderer):

    async def _create_prompt(
        self,
        prompt_input: EngineTextPrompt | EngineTokensPrompt,
        prompt_input: TextPrompt | TokensPrompt,
        config: RenderConfig,
        truncate_prompt_tokens: int | None,
    ) -> EngineTokensPrompt:
    ) -> TokensPrompt:
        prompt, prompt_token_ids, _ = get_prompt_components(prompt_input)

        if prompt_token_ids is not None:
@@ -317,7 +315,7 @@ class CompletionRenderer(BaseRenderer):
        truncate_prompt_tokens: int | None,
        add_special_tokens: bool,
        cache_salt: str | None,
    ) -> EngineTokensPrompt:
    ) -> TokensPrompt:
        """Tokenize text input asynchronously."""
        async_tokenizer = self._get_async_tokenizer()

@@ -350,7 +348,7 @@ class CompletionRenderer(BaseRenderer):
        truncate_prompt_tokens: int | None,
        cache_salt: str | None,
        needs_detokenization: bool | None = False,
    ) -> EngineTokensPrompt:
    ) -> TokensPrompt:
        """Optionally detokenize token IDs and build a tokens prompt."""
        token_ids = self._maybe_apply_truncation(token_ids, truncate_prompt_tokens)

@@ -392,8 +390,8 @@ class CompletionRenderer(BaseRenderer):
        max_length: int | None = None,
        cache_salt: str | None = None,
        prompt: str | None = None,
    ) -> EngineTokensPrompt:
        """Create validated EngineTokensPrompt."""
    ) -> TokensPrompt:
        """Create validated TokensPrompt."""
        if max_length is not None and len(token_ids) > max_length:
            raise ValueError(
                f"This model's maximum context length is {max_length} tokens. "
@@ -401,7 +399,7 @@ class CompletionRenderer(BaseRenderer):
                "Please reduce the length of the input messages."
            )

        tokens_prompt = EngineTokensPrompt(prompt_token_ids=token_ids)
        tokens_prompt = TokensPrompt(prompt_token_ids=token_ids)
        if cache_salt is not None:
            tokens_prompt["cache_salt"] = cache_salt
        if prompt is not None:
--- a/vllm/entrypoints/serve/disagg/serving.py
+++ b/vllm/entrypoints/serve/disagg/serving.py
@@ -27,7 +27,7 @@ from vllm.entrypoints.serve.disagg.protocol import (
    GenerateResponse,
    GenerateResponseChoice,
 )
 from vllm.inputs.data import TokensPrompt as EngineTokensPrompt
 from vllm.inputs.data import TokensPrompt
 from vllm.logger import init_logger
 from vllm.logprobs import Logprob
 from vllm.outputs import RequestOutput
@@ -99,7 +99,7 @@ class ServingTokens(OpenAIServing):

        # TODO(NickLucche): Change to EngineCoreRequest once Renderer work is
        # completed
        engine_prompt = EngineTokensPrompt(prompt_token_ids=request.token_ids)
        engine_prompt = TokensPrompt(prompt_token_ids=request.token_ids)
        if request.features is not None:
            engine_prompt["multi_modal_data"] = None

@@ -115,7 +115,7 @@ class ServingTokens(OpenAIServing):

            self._log_inputs(
                request_id,
                request.token_ids,
                TokensPrompt(prompt_token_ids=request.token_ids),
                params=sampling_params,
                lora_request=lora_request,
            )
--- a/vllm/entrypoints/serve/tokenize/serving.py
+++ b/vllm/entrypoints/serve/tokenize/serving.py
@@ -21,6 +21,7 @@ from vllm.entrypoints.openai.protocol import (
 from vllm.entrypoints.openai.serving_engine import OpenAIServing
 from vllm.entrypoints.openai.serving_models import OpenAIServingModels
 from vllm.entrypoints.renderer import RenderConfig
 from vllm.inputs import TokensPrompt
 from vllm.logger import init_logger
 from vllm.tokenizers import TokenizerLike

@@ -80,11 +81,8 @@ class OpenAIServingTokenization(OpenAIServing):
                )
                if error_check_ret is not None:
                    return error_check_ret
                (
                    _,
                    _,
                    engine_prompts,
                ) = await self._preprocess_chat(

                _, engine_prompts = await self._preprocess_chat(
                    request,
                    tokenizer,
                    request.messages,
@@ -141,7 +139,10 @@ class OpenAIServingTokenization(OpenAIServing):
        tokenizer = await self.engine_client.get_tokenizer()

        self._log_inputs(
            request_id, request.tokens, params=None, lora_request=lora_request
            request_id,
            TokensPrompt(prompt_token_ids=request.tokens),
            params=None,
            lora_request=lora_request,
        )

        prompt_input = await self._tokenize_prompt_input_async(
--- a/vllm/model_executor/layers/fused_moe/shared_fused_moe.py
+++ b/vllm/model_executor/layers/fused_moe/shared_fused_moe.py
@@ -30,8 +30,8 @@ class SharedFusedMoE(FusedMoE):

        # Disable shared expert overlap if:
        #   - we are using eplb, because of correctness issues
        #   - we are using flashinfer with DP, since there nothint to gain
        #   - we are using marlin kjernels
        #   - we are using flashinfer with DP, since there nothing to gain
        #   - we are using marlin kernels
        self.use_overlapped = (
            use_overlapped
            and not (
--- a/vllm/model_executor/layers/quantization/kernels/scaled_mm/init.py
+++ b/vllm/model_executor/layers/quantization/kernels/scaled_mm/init.py
@@ -62,7 +62,7 @@ def choose_scaled_mm_linear_kernel(
            continue

        # If the current platform uses compute_capability,
        # make sure the kernel supports the compute cability.
        # make sure the kernel supports the compute capability.
        is_supported, reason = kernel.is_supported(compute_capability)
        if not is_supported:
            failure_reasons.append(f"{kernel.__name__}: {reason}")
--- a/vllm/model_executor/layers/quantization/modelopt.py
+++ b/vllm/model_executor/layers/quantization/modelopt.py
@@ -188,7 +188,24 @@ class ModelOptQuantConfigBase(QuantizationConfig):

    def apply_vllm_mapper(self, hf_to_vllm_mapper: "WeightsMapper"):
        if len(self.exclude_modules) > 0:
            self.exclude_modules = hf_to_vllm_mapper.apply_list(self.exclude_modules)
            # This is a workaround for the weights remapping issue:
            # https://github.com/vllm-project/vllm/issues/28072
            # Right now, the Nvidia ModelOpt library use just one wildcard pattern:
            #        module_path*
            # It gets applied if the whole tree of modules rooted at module_path
            # is not quantized. Here we replace such pattern by 2 patterns that are
            # collectively equivalent to the original pattern:
            #        module_path
            #        module_path.*
            new_exclude_modules = []
            for exclude in self.exclude_modules:
                if len(exclude) >= 2 and exclude[-1] == "*" and exclude[-2] != ".":
                    new_exclude_modules.append(exclude[:-1])
                    new_exclude_modules.append(exclude[:-1] + ".*")
                else:
                    new_exclude_modules.append(exclude)

            self.exclude_modules = hf_to_vllm_mapper.apply_list(new_exclude_modules)

    @staticmethod
    def get_config_filenames() -> list[str]:
--- a/vllm/model_executor/layers/quantization/moe_wna16.py
+++ b/vllm/model_executor/layers/quantization/moe_wna16.py
@@ -17,6 +17,9 @@ from vllm.model_executor.layers.fused_moe.layer import (
    FusedMoEMethodBase,
    FusedMoeWeightScaleSupported,
 )
 from vllm.model_executor.layers.fused_moe.unquantized_fused_moe_method import (
    UnquantizedFusedMoEMethod,
 )
 from vllm.model_executor.layers.linear import LinearBase, UnquantizedLinearMethod
 from vllm.model_executor.layers.quantization import QuantizationMethods
 from vllm.model_executor.layers.quantization.base_config import (
@@ -162,6 +165,8 @@ class MoeWNA16Config(QuantizationConfig):
        self, layer: torch.nn.Module, prefix: str
    ) -> Optional["QuantizeMethodBase"]:
        if is_layer_skipped_quant(prefix, self.modules_to_not_convert):
            if isinstance(layer, FusedMoE):
                return UnquantizedFusedMoEMethod(layer.moe_config)
            return UnquantizedLinearMethod()
        elif isinstance(layer, LinearBase):
            # Avoid circular import
--- a/vllm/model_executor/models/audioflamingo3.py
+++ b/vllm/model_executor/models/audioflamingo3.py
@@ -0,0 +1,639 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project

 # Copyright 2025 The vLLM team.
 # Copyright 2025 NVIDIA CORPORATION and the HuggingFace Inc. team. All rights
 # reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.

 from collections.abc import Iterable, Mapping, Sequence
 from typing import Annotated, Any, Literal, TypeAlias

 import torch
 import torch.nn as nn
 from transformers import BatchFeature, PretrainedConfig
 from transformers.models.audioflamingo3 import (
    AudioFlamingo3Config,
    AudioFlamingo3Processor,
 )
 from transformers.models.qwen2_audio import Qwen2AudioEncoder

 from vllm.config import VllmConfig
 from vllm.config.multimodal import BaseDummyOptions
 from vllm.model_executor.layers.activation import get_act_fn
 from vllm.model_executor.models.module_mapping import MultiModelKeys
 from vllm.multimodal import MULTIMODAL_REGISTRY
 from vllm.multimodal.inputs import (
    MultiModalDataDict,
    MultiModalFieldConfig,
    MultiModalKwargsItems,
 )
 from vllm.multimodal.parse import (
    DictEmbeddingItems,
    ModalityData,
    ModalityDataItems,
    MultiModalDataItems,
    MultiModalDataParser,
 )
 from vllm.multimodal.processing import (
    BaseMultiModalProcessor,
    BaseProcessingInfo,
    PromptReplacement,
    PromptUpdate,
    PromptUpdateDetails,
 )
 from vllm.multimodal.profiling import BaseDummyInputsBuilder
 from vllm.sequence import IntermediateTensors
 from vllm.utils.tensor_schema import TensorSchema, TensorShape

 from .interfaces import (
    MultiModalEmbeddings,
    SupportsLoRA,
    SupportsMultiModal,
    SupportsPP,
 )
 from .utils import (
    AutoWeightsLoader,
    init_vllm_registered_model,
    maybe_prefix,
 )

 MAX_AUDIO_LEN = 10 * 60


 # === Audio Inputs === #
 class AudioFlamingo3FeatureInputs(TensorSchema):
    """
    Dimensions:
        - num_chunks: Number of audio chunks (flattened)
        - nmb: Number of mel bins
        - num_audios: Number of original audio files
    """

    type: Literal["audio_features"]
    input_features: Annotated[
        torch.Tensor | list[torch.Tensor],
        TensorShape("num_chunks", "nmb", 3000),
    ]

    feature_attention_mask: Annotated[
        torch.Tensor,
        TensorShape("num_chunks", 3000),
    ]

    chunk_counts: Annotated[
        torch.Tensor,
        TensorShape("num_audios"),
    ]


 class AudioFlamingo3EmbeddingInputs(TensorSchema):
    """
    Dimensions:
        - bn: Batch size
        - naf: Number of audio features
        - hs: Hidden size (must match the hidden size of language model
          backbone)
    """

    type: Literal["audio_embeds"] = "audio_embeds"

    audio_embeds: Annotated[
        list[torch.Tensor],
        TensorShape("bn", "naf", "hs"),
    ]


 AudioFlamingo3Inputs: TypeAlias = (
    AudioFlamingo3FeatureInputs | AudioFlamingo3EmbeddingInputs
 )


 class AudioFlamingo3Encoder(Qwen2AudioEncoder):
    def __init__(
        self,
        config: PretrainedConfig,
    ):
        super().__init__(config)
        self.avg_pooler = nn.AvgPool1d(kernel_size=2, stride=2)
        # self.layer_norm is already initialized in super().__init__

    def forward(
        self,
        input_features: torch.Tensor | list[torch.Tensor],
        attention_mask: torch.Tensor = None,
    ):
        # input_features: (batch, num_mel_bins, seq_len)
        if isinstance(input_features, list):
            input_features = torch.stack(input_features)

        hidden_states = nn.functional.gelu(self.conv1(input_features))
        hidden_states = nn.functional.gelu(self.conv2(hidden_states))
        hidden_states = hidden_states.transpose(-1, -2)
        hidden_states = (
            hidden_states + self.embed_positions.weight[: hidden_states.size(-2), :]
        ).to(hidden_states.dtype)

        for layer in self.layers:
            layer_outputs = layer(hidden_states, attention_mask)
            hidden_states = layer_outputs[0]

        # AvgPool (time/2) + LayerNorm
        # hidden_states: (batch, seq_len, hidden_size)
        hidden_states = hidden_states.permute(0, 2, 1)  # (batch, hidden_size, seq_len)
        hidden_states = self.avg_pooler(hidden_states)
        hidden_states = hidden_states.permute(
            0, 2, 1
        )  # (batch, seq_len/2, hidden_size)
        hidden_states = self.layer_norm(hidden_states)

        return hidden_states

    def _get_feat_extract_output_lengths(self, input_lengths: torch.Tensor):
        """
        Computes the output length of the convolutional layers and the output length
        of the audio encoder
        """
        input_lengths = (input_lengths - 1) // 2 + 1
        output_lengths = (input_lengths - 2) // 2 + 1
        return input_lengths, output_lengths


 class AudioFlamingo3MultiModalProjector(nn.Module):
    def __init__(self, config: PretrainedConfig):
        super().__init__()
        self.linear_1 = nn.Linear(
            config.audio_config.hidden_size,
            config.text_config.hidden_size,
            bias=config.projector_bias,
        )
        self.act = get_act_fn(config.projector_hidden_act)
        self.linear_2 = nn.Linear(
            config.text_config.hidden_size,
            config.text_config.hidden_size,
            bias=config.projector_bias,
        )

    def forward(self, audio_features):
        hidden_states = self.linear_1(audio_features)
        hidden_states = self.act(hidden_states)
        hidden_states = self.linear_2(hidden_states)
        return hidden_states


 class AudioFlamingo3ProcessingInfo(BaseProcessingInfo):
    def get_hf_config(self):
        return self.ctx.get_hf_config(AudioFlamingo3Config)

    def get_hf_processor(self, **kwargs: object):
        return self.ctx.get_hf_processor(AudioFlamingo3Processor, **kwargs)

    def get_feature_extractor(self, **kwargs: object):
        hf_processor = self.get_hf_processor(**kwargs)
        feature_extractor = hf_processor.feature_extractor
        return feature_extractor

    def get_supported_mm_limits(self) -> Mapping[str, int | None]:
        return {"audio": None}


 class AudioFlamingo3DummyInputsBuilder(
    BaseDummyInputsBuilder[AudioFlamingo3ProcessingInfo]
 ):
    def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
        num_audios = mm_counts.get("audio", 0)
        hf_processor = self.info.get_hf_processor()
        audio_token = hf_processor.audio_token
        return audio_token * num_audios

    def get_dummy_mm_data(
        self,
        seq_len: int,
        mm_counts: Mapping[str, int],
        mm_options: Mapping[str, BaseDummyOptions] | None = None,
    ) -> MultiModalDataDict:
        feature_extractor = self.info.get_feature_extractor()
        sampling_rate = feature_extractor.sampling_rate
        audio_len = MAX_AUDIO_LEN * sampling_rate
        num_audios = mm_counts.get("audio", 0)
        audio_overrides = mm_options.get("audio") if mm_options else None

        return {
            "audio": self._get_dummy_audios(
                length=audio_len,
                num_audios=num_audios,
                overrides=audio_overrides,
            )
        }


 def _audioflamingo3_field_config(hf_inputs: Mapping[str, torch.Tensor]):
    chunk_counts = hf_inputs.get("chunk_counts")
    if chunk_counts is not None:
        return dict(
            audio_embeds=MultiModalFieldConfig.batched("audio"),
            input_features=MultiModalFieldConfig.flat_from_sizes(
                "audio", chunk_counts, dim=0
            ),
            feature_attention_mask=MultiModalFieldConfig.flat_from_sizes(
                "audio", chunk_counts, dim=0
            ),
            chunk_counts=MultiModalFieldConfig.batched("audio"),
        )
    return dict(
        audio_embeds=MultiModalFieldConfig.batched("audio"),
        input_features=MultiModalFieldConfig.batched("audio"),
        feature_attention_mask=MultiModalFieldConfig.batched("audio"),
        chunk_counts=MultiModalFieldConfig.batched("audio"),
    )


 class AudioFlamingo3MultiModalDataParser(MultiModalDataParser):
    def _parse_audio_data(
        self,
        data: dict[str, torch.Tensor] | ModalityData[Any],
    ) -> ModalityDataItems[Any, Any] | None:
        if isinstance(data, dict):
            return DictEmbeddingItems(
                data,
                modality="audio",
                required_fields={"audio_embeds"},
                fields_factory=_audioflamingo3_field_config,
            )
        return super()._parse_audio_data(data)


 class AudioFlamingo3MultiModalProcessor(
    BaseMultiModalProcessor[AudioFlamingo3ProcessingInfo]
 ):
    def _get_data_parser(self) -> MultiModalDataParser:
        feature_extractor = self.info.get_feature_extractor()
        return AudioFlamingo3MultiModalDataParser(
            target_sr=feature_extractor.sampling_rate
        )

    def _call_hf_processor(
        self,
        prompt: str,
        mm_data: dict[str, object],
        mm_kwargs: Mapping[str, Any],
        tok_kwargs: Mapping[str, object],
    ) -> BatchFeature:
        audios = mm_data.pop("audios", [])
        if audios:
            mm_data["audio"] = audios

        if not mm_data.get("audio", []):
            prompt_ids = self.info.get_tokenizer().encode(prompt)
            prompt_ids = self._apply_hf_processor_tokens_only(prompt_ids)
            return BatchFeature(dict(input_ids=[prompt_ids]), tensor_type="pt")

        feature_extractor = self.info.get_feature_extractor(**mm_kwargs)
        mm_kwargs = dict(
            **mm_kwargs,
            sampling_rate=feature_extractor.sampling_rate,
        )

        # Calculate chunk counts
        audio_list = mm_data.get("audio")
        if not isinstance(audio_list, list):
            audio_list = [audio_list]

        chunk_counts = []
        sampling_rate = feature_extractor.sampling_rate
        chunk_length = feature_extractor.chunk_length
        window_size = int(sampling_rate * chunk_length)
        # MAX_AUDIO_LEN is 10 * 60 in HF processor.
        max_windows = int(MAX_AUDIO_LEN // chunk_length)

        for audio in audio_list:
            # audio is numpy array or list
            n_samples = len(audio) if isinstance(audio, list) else audio.shape[0]

            n_win = max(1, (n_samples + window_size - 1) // window_size)
            if n_win > max_windows:
                n_win = max_windows
            chunk_counts.append(n_win)

        outputs = super()._call_hf_processor(
            prompt=prompt,
            mm_data=mm_data,
            mm_kwargs=mm_kwargs,
            tok_kwargs=tok_kwargs,
        )

        if "input_features_mask" in outputs:
            outputs["feature_attention_mask"] = outputs.pop("input_features_mask")

        outputs["chunk_counts"] = torch.tensor(chunk_counts, dtype=torch.long)

        return outputs

    def _get_mm_fields_config(
        self,
        hf_inputs: BatchFeature,
        hf_processor_mm_kwargs: Mapping[str, object],
    ) -> Mapping[str, MultiModalFieldConfig]:
        return _audioflamingo3_field_config(hf_inputs)

    def _get_prompt_updates(
        self,
        mm_items: MultiModalDataItems,
        hf_processor_mm_kwargs: Mapping[str, object],
        out_mm_kwargs: MultiModalKwargsItems,
    ) -> Sequence[PromptUpdate]:
        processor = self.info.get_hf_processor(**hf_processor_mm_kwargs)
        tokenizer = self.info.get_tokenizer()
        vocab = tokenizer.get_vocab()

        audio_token = getattr(processor, "audio_token", "<sound>")
        audio_token_id = vocab.get(audio_token)
        if audio_token_id is None:
            # Fallback if not found, though it should be there
            audio_token_id = processor.audio_token_id

        out_mm_data = out_mm_kwargs.get_data()
        feature_attention_mask = out_mm_data.get("feature_attention_mask")
        chunk_counts = out_mm_data.get("chunk_counts")

        def get_replacement_audioflamingo3(item_idx: int):
            if feature_attention_mask is not None:
                if chunk_counts is not None:
                    counts = (
                        chunk_counts.tolist()
                        if isinstance(chunk_counts, torch.Tensor)
                        else chunk_counts
                    )
                    start_idx = sum(counts[:item_idx])
                    count = counts[item_idx]
                    end_idx = start_idx + count

                    if isinstance(feature_attention_mask, list):
                        mask_list = feature_attention_mask[start_idx:end_idx]
                        if len(mask_list) > 0 and isinstance(
                            mask_list[0], torch.Tensor
                        ):
                            mask = torch.stack(mask_list)
                        else:
                            mask = torch.tensor(mask_list)
                    else:
                        mask = feature_attention_mask[start_idx:end_idx]
                else:
                    # feature_attention_mask is list[Tensor] or Tensor
                    if isinstance(feature_attention_mask, list):
                        mask = feature_attention_mask[item_idx]
                    else:
                        mask = feature_attention_mask[item_idx].unsqueeze(0)

                # mask shape: (num_chunks, 3000)
                input_lengths = mask.sum(-1)
                conv_lengths = (input_lengths - 1) // 2 + 1
                audio_output_lengths = (conv_lengths - 2) // 2 + 1
                num_features = audio_output_lengths.sum().item()
            else:
                audio_embeds = out_mm_data["audio_embeds"][item_idx]
                num_features = audio_embeds.shape[0]

            if num_features == 0:
                raise ValueError("Audio is too short")

            audio_tokens = [audio_token_id] * int(num_features)
            return PromptUpdateDetails.select_token_id(
                audio_tokens,
                embed_token_id=audio_token_id,
            )

        return [
            PromptReplacement(
                modality="audio",
                target=audio_token,
                replacement=get_replacement_audioflamingo3,
            )
        ]


@MULTIMODAL_REGISTRY.register_processor(
    AudioFlamingo3MultiModalProcessor,
    info=AudioFlamingo3ProcessingInfo,
    dummy_inputs=AudioFlamingo3DummyInputsBuilder,
 )
 class AudioFlamingo3ForConditionalGeneration(
    nn.Module, SupportsMultiModal, SupportsPP, SupportsLoRA
 ):
    """
    AudioFlamingo3 model for conditional generation.

    This model integrates a Whisper-based audio encoder with a Qwen2 language model.
    It supports multi-chunk audio processing.
    """

    packed_modules_mapping = {
        "qkv_proj": ["q_proj", "k_proj", "v_proj"],
        "gate_up_proj": ["gate_proj", "up_proj"],
    }

    def get_mm_mapping(self) -> MultiModelKeys:
        """
        Get the module prefix in multimodal models
        """
        return MultiModelKeys.from_string_field(
            language_model="language_model.",
            connector="multi_modal_projector.",
            tower_model="audio_tower.",
        )

    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
        super().__init__()
        config = vllm_config.model_config.hf_config
        quant_config = vllm_config.quant_config
        multimodal_config = vllm_config.model_config.multimodal_config
        self.config = config
        self.multimodal_config = multimodal_config

        self.audio_tower = AudioFlamingo3Encoder(
            config.audio_config,
        )
        self.multi_modal_projector = AudioFlamingo3MultiModalProjector(config)

        self.quant_config = quant_config

        self.language_model = init_vllm_registered_model(
            vllm_config=vllm_config,
            hf_config=config.text_config,
            prefix=maybe_prefix(prefix, "language_model"),
            architectures=["Qwen2ForCausalLM"],
        )

        self.make_empty_intermediate_tensors = (
            self.language_model.make_empty_intermediate_tensors
        )

    def _parse_and_validate_audio_input(
        self, **kwargs: object
    ) -> AudioFlamingo3Inputs | None:
        input_features = kwargs.pop("input_features", None)
        audio_embeds = kwargs.pop("audio_embeds", None)
        feature_attention_mask = kwargs.pop("feature_attention_mask", None)
        chunk_counts = kwargs.pop("chunk_counts", None)

        if input_features is None and audio_embeds is None:
            return None

        if audio_embeds is not None:
            return AudioFlamingo3EmbeddingInputs(
                type="audio_embeds", audio_embeds=audio_embeds
            )

        if input_features is not None:
            return AudioFlamingo3FeatureInputs(
                type="audio_features",
                input_features=input_features,
                feature_attention_mask=feature_attention_mask,
                chunk_counts=chunk_counts,
            )

        raise AssertionError("This line should be unreachable.")

    def _process_audio_input(
        self, audio_input: AudioFlamingo3Inputs
    ) -> torch.Tensor | tuple[torch.Tensor, ...]:
        if audio_input["type"] == "audio_embeds":
            audio_embeds = audio_input["audio_embeds"]
            return tuple(audio_embeds)

        input_features = audio_input["input_features"]
        feature_attention_mask = audio_input["feature_attention_mask"]
        chunk_counts = audio_input.get("chunk_counts")

        if isinstance(input_features, list):
            input_features = torch.cat(input_features, dim=0)
            feature_attention_mask = torch.cat(feature_attention_mask, dim=0)

        if chunk_counts is None:
            chunk_counts = [1] * input_features.shape[0]
        elif isinstance(chunk_counts, torch.Tensor):
            chunk_counts = chunk_counts.tolist()
        elif (
            isinstance(chunk_counts, list)
            and chunk_counts
            and isinstance(chunk_counts[0], torch.Tensor)
        ):
            chunk_counts = [c.item() for c in chunk_counts]

        # Calculate output lengths
        input_lengths = feature_attention_mask.sum(-1)
        # Conv downsampling
        conv_lengths = (input_lengths - 1) // 2 + 1
        # AvgPool downsampling
        audio_output_lengths = (conv_lengths - 2) // 2 + 1

        batch_size, _, max_mel_seq_len = input_features.shape

        # Calculate max_seq_len after convs (before pooling) for attention mask
        max_seq_len = (max_mel_seq_len - 1) // 2 + 1

        # Create a sequence tensor of shape (batch_size, max_seq_len)
        seq_range = (
            torch.arange(
                0,
                max_seq_len,
                dtype=conv_lengths.dtype,
                device=conv_lengths.device,
            )
            .unsqueeze(0)
            .expand(batch_size, max_seq_len)
        )
        lengths_expand = conv_lengths.unsqueeze(-1).expand(batch_size, max_seq_len)
        # Create mask
        padding_mask = seq_range >= lengths_expand

        audio_attention_mask_ = padding_mask.view(batch_size, 1, 1, max_seq_len).expand(
            batch_size, 1, max_seq_len, max_seq_len
        )
        audio_attention_mask = audio_attention_mask_.to(
            dtype=self.audio_tower.conv1.weight.dtype,
            device=self.audio_tower.conv1.weight.device,
        )
        audio_attention_mask[audio_attention_mask_] = float("-inf")

        # Forward pass
        audio_features = self.audio_tower(
            input_features, attention_mask=audio_attention_mask
        )

        # Project
        audio_features = self.multi_modal_projector(audio_features)

        # Masking after pooling
        num_audios, max_audio_tokens, embed_dim = audio_features.shape
        audio_output_lengths = audio_output_lengths.unsqueeze(1)
        audio_features_mask = (
            torch.arange(max_audio_tokens)
            .expand(num_audios, max_audio_tokens)
            .to(audio_output_lengths.device)
            < audio_output_lengths
        )
        masked_audio_features = audio_features[audio_features_mask].view(-1, embed_dim)

        # Split to tuple of embeddings for individual audio input.
        chunk_embeddings = torch.split(
            masked_audio_features, audio_output_lengths.flatten().tolist()
        )

        grouped_embeddings = []
        current_idx = 0
        for count in chunk_counts:
            audio_chunks = chunk_embeddings[current_idx : current_idx + count]
            grouped_embeddings.append(torch.cat(audio_chunks, dim=0))
            current_idx += count
        return tuple(grouped_embeddings)

    def get_language_model(self) -> torch.nn.Module:
        return self.language_model

    def embed_multimodal(self, **kwargs: object) -> MultiModalEmbeddings:
        audio_input = self._parse_and_validate_audio_input(**kwargs)
        if audio_input is None:
            return []
        masked_audio_features = self._process_audio_input(audio_input)
        return masked_audio_features

    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        intermediate_tensors: IntermediateTensors | None = None,
        inputs_embeds: torch.Tensor | None = None,
        **kwargs: object,
    ) -> torch.Tensor | IntermediateTensors:
        if intermediate_tensors is not None:
            inputs_embeds = None

        hidden_states = self.language_model.model(
            input_ids,
            positions,
            intermediate_tensors,
            inputs_embeds=inputs_embeds,
        )
        return hidden_states

    def compute_logits(
        self,
        hidden_states: torch.Tensor,
    ) -> torch.Tensor | None:
        return self.language_model.compute_logits(hidden_states)

    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
        loader = AutoWeightsLoader(self)
        return loader.load_weights(weights)
--- a/vllm/model_executor/models/qwen2_5_vl.py
+++ b/vllm/model_executor/models/qwen2_5_vl.py
@@ -612,7 +612,7 @@ class Qwen2_5_VisionTransformer(nn.Module):
        # DO NOT MOVE THIS IMPORT
        from vllm.compilation.backends import set_model_tag

        with set_model_tag("Qwen2_5_VisionPatchEmbed"):
        with set_model_tag("Qwen2_5_VisionPatchEmbed", is_encoder=True):
            self.patch_embed = Qwen2_5_VisionPatchEmbed(
                patch_size=patch_size,
                temporal_patch_size=temporal_patch_size,
@@ -651,7 +651,7 @@ class Qwen2_5_VisionTransformer(nn.Module):
                f"Qwen2.5-VL does not support {self.attn_backend} backend now."
            )

        with set_model_tag("Qwen2_5_VisionBlock"):
        with set_model_tag("Qwen2_5_VisionBlock", is_encoder=True):
            self.blocks = nn.ModuleList(
                [
                    Qwen2_5_VisionBlock(
@@ -670,7 +670,7 @@ class Qwen2_5_VisionTransformer(nn.Module):
                ]
            )

        with set_model_tag("Qwen2_5_VisionPatchMerger"):
        with set_model_tag("Qwen2_5_VisionPatchMerger", is_encoder=True):
            self.merger = Qwen2_5_VisionPatchMerger(
                d_model=vision_config.out_hidden_size,
                context_dim=self.hidden_size,
--- a/vllm/model_executor/models/qwen2_vl.py
+++ b/vllm/model_executor/models/qwen2_vl.py
@@ -50,7 +50,7 @@ from vllm.attention.layer import (
 )
 from vllm.config import VllmConfig
 from vllm.config.multimodal import BaseDummyOptions
 from vllm.distributed import parallel_state
 from vllm.distributed import parallel_state, tensor_model_parallel_all_gather
 from vllm.distributed import utils as dist_utils
 from vllm.logger import init_logger
 from vllm.model_executor.layers.activation import QuickGELU
@@ -360,10 +360,21 @@ class Qwen2VisionAttention(nn.Module):
    def split_qkv(self, qkv: torch.Tensor) -> tuple[torch.Tensor, ...]:
        # [s, b, 3 * head * head_dim]
        seq_len, bs, _ = qkv.shape
        if self.tp_size > 1:
            qkv = tensor_model_parallel_all_gather(qkv)

        # [s, b, 3 * head * head_dim] -> 3 * [s, b, head * head_dim]
        q, k, v = qkv.chunk(3, dim=2)

        # 3 * [s, b, head * head_dim]
        if self.tp_size > 1:
            splitter = partial(
                dist_utils.split_tensor_along_last_dim, num_partitions=self.tp_size
            )
            q = splitter(q)[self.tp_rank]
            k = splitter(k)[self.tp_rank]
            v = splitter(v)[self.tp_rank]

        # 3 * [s, b, head * head_dim] -> 3 * [s, b, head, head_dim]
        new_shape = (
            seq_len,
--- a/vllm/model_executor/models/qwen3_vl.py
+++ b/vllm/model_executor/models/qwen3_vl.py
@@ -67,12 +67,19 @@ from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead
 from vllm.model_executor.model_loader.weight_utils import default_weight_loader
 from vllm.model_executor.models.module_mapping import MultiModelKeys
 from vllm.multimodal import MULTIMODAL_REGISTRY
 from vllm.multimodal.evs import (
    compute_mrope_for_media,
    compute_retained_tokens_count,
    compute_retention_mask,
    recompute_mrope_positions,
 )
 from vllm.multimodal.inputs import (
    MultiModalDataDict,
    MultiModalFeatureSpec,
    MultiModalFieldConfig,
    MultiModalKwargsItem,
    MultiModalKwargsItems,
    PlaceholderRange,
    VideoItem,
 )
 from vllm.multimodal.parse import ImageSize, MultiModalDataItems, MultiModalDataParser
@@ -92,6 +99,7 @@ from .interfaces import (
    SupportsLoRA,
    SupportsMRoPE,
    SupportsMultiModal,
    SupportsMultiModalPruning,
    SupportsPP,
    _require_is_multimodal,
 )
@@ -1043,13 +1051,39 @@ class Qwen3VLMultiModalProcessor(BaseMultiModalProcessor[Qwen3VLProcessingInfo])
                tokenizer.encode(f"<{curr_time:.1f} seconds>", add_special_tokens=False)
                for curr_time in timestamps
            ]
            num_tokens_per_frame = int(grid_thw[1:].prod()) // merge_length
            tokens_per_frame = int(grid_thw[1:].prod()) // merge_length
            per_frame_token_counts = [tokens_per_frame for _ in frames_idx_token]

            video_pruning_rate = self.info.ctx.get_mm_config().video_pruning_rate
            if video_pruning_rate is not None and video_pruning_rate > 0.0:
                total_retained = compute_retained_tokens_count(
                    tokens_per_frame,
                    len(frames_idx_token),
                    video_pruning_rate,
                )
                if len(frames_idx_token) == 0:
                    per_frame_token_counts = []
                elif len(frames_idx_token) == 1:
                    per_frame_token_counts = [tokens_per_frame]
                else:
                    first_frame_tokens = tokens_per_frame
                    remaining_tokens = max(total_retained - first_frame_tokens, 0)
                    base = remaining_tokens // (len(frames_idx_token) - 1)
                    remainder = remaining_tokens % (len(frames_idx_token) - 1)
                    per_frame_token_counts = [first_frame_tokens]
                    for frame_idx in range(1, len(frames_idx_token)):
                        extra = base + (1 if (frame_idx - 1) < remainder else 0)
                        per_frame_token_counts.append(extra)

            placeholder = []
            for frame_idx in frames_idx_token:
                placeholder.extend(frame_idx)
            for frame_idx, timestamp_tokens in enumerate(frames_idx_token):
                placeholder.extend(timestamp_tokens)
                tokens_this_frame = per_frame_token_counts[
                    frame_idx if frame_idx < len(per_frame_token_counts) else -1
                ]
                placeholder.extend(
                    [vision_start_token_id]
                    + [video_token_id] * num_tokens_per_frame
                    + [video_token_id] * tokens_this_frame
                    + [vision_end_token_id]
                )
            return PromptUpdateDetails.select_token_id(placeholder, video_token_id)
@@ -1190,6 +1224,7 @@ class Qwen3VLForConditionalGeneration(
    SupportsPP,
    SupportsMRoPE,
    SupportsEagle3,
    SupportsMultiModalPruning,
 ):
    packed_modules_mapping = {
        "qkv_proj": [
@@ -1232,6 +1267,11 @@ class Qwen3VLForConditionalGeneration(
        self.config = config
        self.multimodal_config = multimodal_config
        self.use_data_parallel = multimodal_config.mm_encoder_tp_mode == "data"
        self.video_pruning_rate = multimodal_config.video_pruning_rate
        self.is_multimodal_pruning_enabled = (
            multimodal_config.is_multimodal_pruning_enabled()
        )

        if not multimodal_config.get_limit_per_prompt(
            "image"
        ) and not multimodal_config.get_limit_per_prompt("video"):
@@ -1418,6 +1458,109 @@ class Qwen3VLForConditionalGeneration(
        sizes = (grid_thw.prod(-1) // merge_size // merge_size).tolist()
        return video_embeds.split(sizes)

    def _postprocess_image_embeds_evs(
        self,
        image_embeds_split: tuple[torch.Tensor, ...],
        image_input: Qwen2_5_VLImageInputs,
    ) -> tuple[torch.Tensor, ...]:
        """
        Append mrope positions for each for images.
        This is necessary to recover correct mrope
        positions after video pruning

        Args:
            image_embeds_split: Tuple of image embeddings for
                each image item.
            image_input: Image input data.

        Returns:
            Tuple of image embeddings for each image item.
            Resulting embeddings will have extra 4 channels for
            computed mrope positions.
        """
        merge_size = self.visual.spatial_merge_size
        grid_thw = image_input["image_grid_thw"]
        grid_thw_list = grid_thw.tolist()
        image_embeds_out = []
        for emb, size in zip(image_embeds_split, grid_thw_list):
            positions = compute_mrope_for_media(size, merge_size).to(emb.device)
            emb = torch.cat([emb, positions], dim=1)
            image_embeds_out.append(emb)
        image_embeds_split = image_embeds_out
        return tuple(image_embeds_split)

    def _postprocess_video_embeds_evs(
        self,
        video_embeds_split: tuple[torch.Tensor, ...],
        video_input: Qwen2_5_VLVideoInputs,
    ) -> tuple[torch.Tensor, ...]:
        """
        Prunes video embeddings via Efficient Video Sampling (EVS)
        and then appends mrope positions for each retained embeddings

        Args:
            video_embeds_split: Tuple of video embeddings for each video item.
            video_input: Video input data.

        Returns:
            Tuple of video embeddings for each video item.
            Resulting embeddings will have extra 4 channels for
            computed mrope positions.
        """
        grid_thw = video_input["video_grid_thw"]
        assert grid_thw.ndim == 2
        grid_thw_list = grid_thw.tolist()
        merge_size = self.visual.spatial_merge_size

        # Cast to long to match the original code
        # https://github.com/huggingface/transformers/blob/41980ce93e775f6c88500c51c8db7946fc6a2add/src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py#L491 # noqa
        second_per_grid_ts = video_input.get("second_per_grid_ts")
        if second_per_grid_ts is None:
            # For Qwen3-VL, second_per_grid_ts might not be available
            # Use default value of 1.0 for each video
            second_per_grid_ts = torch.ones(len(grid_thw_list), dtype=torch.long)
        else:
            second_per_grid_ts = second_per_grid_ts.long()
        tokens_per_second = getattr(self.config.vision_config, "tokens_per_second", 1.0)

        video_embeds_out = []
        for emb, size, video_second_per_grid_t in zip(
            video_embeds_split, grid_thw_list, second_per_grid_ts
        ):
            # For each video, we compute retention mask using EVS
            retention_mask = compute_retention_mask(
                emb,
                size,
                spatial_merge_size=self.visual.spatial_merge_size,
                q=self.video_pruning_rate,
            )

            # Debug logging for EVS pruning
            logger.debug(
                "EVS: Video tokens pruned from %d to %d (T=%d,H=%d,W=%d, "
                "pruning_rate=%.2f, reduction=%.1f%%)",
                emb.shape[0],
                retention_mask.sum().item(),
                size[0],
                size[1],
                size[2],
                self.video_pruning_rate,
                (1 - retention_mask.float().mean().item()) * 100,
            )

            positions = compute_mrope_for_media(
                size,
                merge_size,
                tokens_per_second=tokens_per_second,
                video_second_per_grid=video_second_per_grid_t.item(),
            ).to(emb.device)

            emb = emb[retention_mask]
            positions = positions[retention_mask]
            emb = torch.cat([emb, positions], dim=1)
            video_embeds_out.append(emb)
        return tuple(video_embeds_out)

    def _parse_and_validate_multimodal_inputs(self, **kwargs: object) -> dict:
        mm_input_by_modality = {}
        for input_key in kwargs:
@@ -1440,6 +1583,20 @@ class Qwen3VLForConditionalGeneration(
    def iter_mm_grid_hw(
        self, input_tokens: list[int], mm_features: list[MultiModalFeatureSpec]
    ) -> Iterator[tuple[int, int, int]]:
        """
        Iterate over multimodal features and yield grid information.

        For videos with EVS (Efficient Video Sampling) enabled, this function
        computes the offset based on the pruned token count rather than relying
        on input_tokens.index(), which would fail when tokens are pruned.

        Args:
            input_tokens: List of token IDs in the prompt
            mm_features: List of multimodal feature specifications

        Yields:
            Tuple of (offset, grid_h, grid_w) for each frame/image
        """
        video_token_id = self.config.video_token_id
        spatial_merge_size = self.config.vision_config.spatial_merge_size
        for mm_feature in sorted(mm_features, key=lambda f: f.mm_position.offset):
@@ -1452,42 +1609,289 @@ class Qwen3VLForConditionalGeneration(
                t, h, w = mm_feature.data["video_grid_thw"].data.tolist()
                llm_grid_h = h // spatial_merge_size
                llm_grid_w = w // spatial_merge_size
                for _ in range(t):
                    offset = input_tokens.index(video_token_id, offset)
                    yield offset, llm_grid_h, llm_grid_w
                    offset += llm_grid_h * llm_grid_w

                # Check if EVS (Efficient Video Sampling) is enabled
                is_evs_enabled = (
                    hasattr(self, "video_pruning_rate")
                    and self.video_pruning_rate is not None
                    and self.video_pruning_rate > 0.0
                )

                if is_evs_enabled:
                    frame_offsets = self._extract_frame_offsets_from_mask(
                        mm_feature.mm_position, t
                    )
                    if frame_offsets is not None:
                        for rel_offset in frame_offsets:
                            yield offset + rel_offset, llm_grid_h, llm_grid_w
                        continue

                    # If EVS is enabled but mask is missing, this indicates a bug
                    # in the prompt processing pipeline. The is_embed mask should
                    # always be present when video_pruning_rate > 0.
                    raise RuntimeError(
                        f"EVS is enabled (pruning_rate={self.video_pruning_rate}) "
                        "but is_embed mask is missing from mm_position. "
                        "This indicates a bug in prompt processing."
                    )
                else:
                    # Non-EVS mode: Use original logic with input_tokens.index()
                    for _ in range(t):
                        offset = input_tokens.index(video_token_id, offset)
                        yield offset, llm_grid_h, llm_grid_w
                        offset += llm_grid_h * llm_grid_w
            else:
                raise ValueError(f"Unsupported modality: {mm_feature.modality}")

    def _get_evs_mask_segments(
        self, mm_position: PlaceholderRange, expected_frames: int
    ) -> list[torch.Tensor] | None:
        """Extract contiguous segments from EVS is_embed mask.

        The EVS (Efficient Video Sampling) mask marks which placeholder
        positions should be filled with video embeddings. This method splits
        the mask into contiguous segments, where each segment represents one
        retained frame.

        This is a pure function - it does not modify any state and always
        returns the same output for the same input (idempotent).

        Args:
            mm_position: MultiModal position containing the is_embed mask
            expected_frames: Expected number of frame segments

        Returns:
            List of tensors, each containing indices for one frame segment,
            or None if EVS is not enabled or validation fails.
        """
        is_embed_mask = getattr(mm_position, "is_embed", None)
        if is_embed_mask is None:
            return None

        # Find all True positions in the mask
        mask_tensor = torch.as_tensor(is_embed_mask, dtype=torch.bool).view(-1)
        true_indices = torch.nonzero(mask_tensor, as_tuple=False).flatten()
        if true_indices.numel() == 0:
            return None

        # Split into contiguous segments (where diff > 1 indicates a gap)
        if true_indices.numel() == 1:
            segments = [true_indices]
        else:
            diffs = torch.diff(true_indices)
            split_points = torch.nonzero(diffs != 1, as_tuple=False).flatten()
            if split_points.numel() == 0:
                segments = [true_indices]
            else:
                segments = torch.tensor_split(
                    true_indices, split_points.add(1).tolist()
                )

        # Validate segment count matches expected frames
        if len(segments) < expected_frames:
            logger.debug(
                "EVS mask segments (%d) do not match expected frames (%d)",
                len(segments),
                expected_frames,
            )
            return None

        return segments[:expected_frames]

    def _extract_frame_offsets_from_mask(
        self, mm_position: PlaceholderRange, expected_frames: int
    ) -> list[int] | None:
        """Return relative offsets for each EVS-retained frame.

        The prompt processor stores a boolean mask inside ``mm_position`` that
        marks which placeholder locations should be populated with video
        embeddings. By splitting that mask into contiguous runs we can recover
        the start of every retained frame without probing ``input_tokens``.

        Args:
            mm_position: MultiModal position containing the is_embed mask
            expected_frames: Expected number of frames

        Returns:
            List of starting offsets (relative to mm_position) for each frame,
            or None if EVS is not enabled.
        """
        segments = self._get_evs_mask_segments(mm_position, expected_frames)
        if segments is None:
            return None

        return [int(segment[0].item()) for segment in segments]

    def _get_actual_frame_token_counts(
        self, mm_position: PlaceholderRange, expected_frames: int
    ) -> list[int] | None:
        """Return actual token count for each EVS-retained frame.

        This function calculates the actual number of tokens per frame by
        analyzing the is_embed mask, accounting for EVS pruning. Each frame
        may have a different token count due to content-aware pruning.

        Args:
            mm_position: MultiModal position containing the is_embed mask
            expected_frames: Expected number of frames

        Returns:
            List of token counts for each frame, or None if EVS is not enabled.
        """
        segments = self._get_evs_mask_segments(mm_position, expected_frames)
        if segments is None:
            return None

        return [len(seg) for seg in segments]

    def recompute_mrope_positions(
        self,
        input_ids: list[int],
        multimodal_embeddings: tuple[torch.Tensor, ...],
        mrope_positions: torch.LongTensor,
        num_computed_tokens: int,
    ) -> tuple[tuple[torch.Tensor, ...], torch.Tensor, int]:
        """
        Update part of input mrope positions (starting with
        num_computed_tokens index). Original mrope_positions are computed
        for unpruned sequence and becomes incorrect once pruning occurs,
        so once we prune media tokens we should reflect this in the
        mrope_positions before we feed it to LLM.

        Args:
            input_ids: (N,) All input tokens of the prompt (Containing
                entire sequence).
            multimodal_embeddings: Tuple of multimodal embeddings.
            mrope_positions: Existing mrope positions (3, N) for entire
                sequence
            num_computed_tokens: A number of computed tokens so far.

        Returns:
            Tuple of (multimodal_embeddings, mrope_positions,
                mrope_position_delta).
        """
        image_token_id = self.config.image_token_id
        video_token_id = self.config.video_token_id
        vision_start_token_id = self.config.vision_start_token_id

        # Device
        device = (
            multimodal_embeddings[0].device
            if len(multimodal_embeddings)
            else mrope_positions.device
        )

        # Tensors
        input_ids_t = torch.as_tensor(input_ids, device=device, dtype=torch.long)

        mm_embeddings_out = [mm[:, :-4] for mm in multimodal_embeddings]
        mm_embeddings_pos = [
            mm[:, -4:].permute(1, 0).long() for mm in multimodal_embeddings
        ]

        positions, mrope_positions_delta = recompute_mrope_positions(
            input_ids_t,
            mm_embeddings_pos,
            mrope_positions,
            num_computed_tokens,
            vision_start_token_id,
            image_token_id,
            video_token_id,
        )

        return tuple(mm_embeddings_out), positions, mrope_positions_delta

    def get_mrope_input_positions(
        self,
        input_tokens: list[int],
        mm_features: list[MultiModalFeatureSpec],
    ) -> tuple[torch.Tensor, int]:
        # Pre-collect actual frame token counts for EVS mode
        frame_token_counts_map = {}
        for mm_feature in mm_features:
            if mm_feature.modality == "video":
                is_evs_enabled = (
                    hasattr(self, "video_pruning_rate")
                    and self.video_pruning_rate is not None
                    and self.video_pruning_rate > 0.0
                )
                if is_evs_enabled:
                    t = mm_feature.data["video_grid_thw"].data.tolist()[0]
                    token_counts = self._get_actual_frame_token_counts(
                        mm_feature.mm_position, t
                    )
                    assert token_counts is not None, (
                        "EVS enabled but failed to extract frame token counts "
                        "from is_embed mask"
                    )
                    frame_token_counts_map[mm_feature.mm_position.offset] = token_counts

        llm_pos_ids_list = []
        st = 0
        frame_counts_idx = {}

        for offset, llm_grid_h, llm_grid_w in self.iter_mm_grid_hw(
            input_tokens, mm_features
        ):
            text_len = offset - st
            st_idx = llm_pos_ids_list[-1].max() + 1 if len(llm_pos_ids_list) > 0 else 0
            llm_pos_ids_list.append(

            # Determine actual token count for this frame
            base_offset = None
            for feat_offset in frame_token_counts_map:
                if offset >= feat_offset:
                    base_offset = feat_offset

            if base_offset is not None:
                # EVS mode: use actual token count from is_embed mask
                assert base_offset in frame_token_counts_map, (
                    f"Found base_offset {base_offset} but not in frame_token_counts_map"
                )

                if base_offset not in frame_counts_idx:
                    frame_counts_idx[base_offset] = 0

                counts = frame_token_counts_map[base_offset]
                idx = frame_counts_idx[base_offset]

                assert idx < len(counts), (
                    f"EVS frame index {idx} out of range (total frames: {len(counts)})"
                )

                actual_frame_tokens = counts[idx]
                frame_counts_idx[base_offset] += 1
            else:
                # Non-EVS mode (or image): use theoretical grid size
                actual_frame_tokens = llm_grid_h * llm_grid_w

            # Add text segment
            text_positions = (
                np.broadcast_to(np.arange(text_len), (3, text_len)) + st_idx
            )
            llm_pos_ids_list.append(text_positions)
            st_idx += text_len

            # Add frame segment with actual token count (not theoretical)
            grid_indices = np.indices((1, llm_grid_h, llm_grid_w)).reshape(3, -1)
            llm_pos_ids_list.append(grid_indices + text_len + st_idx)
            st = offset + llm_grid_h * llm_grid_w
            # Only take the first actual_frame_tokens positions
            frame_positions = grid_indices[:, :actual_frame_tokens] + st_idx
            llm_pos_ids_list.append(frame_positions)

            # Update st using actual token count
            st = offset + actual_frame_tokens

        # Handle final text segment
        if st < len(input_tokens):
            st_idx = llm_pos_ids_list[-1].max() + 1 if len(llm_pos_ids_list) > 0 else 0
            text_len = len(input_tokens) - st
            llm_pos_ids_list.append(
            final_text_positions = (
                np.broadcast_to(np.arange(text_len), (3, text_len)) + st_idx
            )
            llm_pos_ids_list.append(final_text_positions)

        llm_positions = np.concatenate(llm_pos_ids_list, axis=1).reshape(3, -1)
        mrope_position_delta = (llm_positions.max() + 1 - len(input_tokens)).item()

        return torch.from_numpy(llm_positions), mrope_position_delta

    def get_language_model(self) -> torch.nn.Module:
@@ -1508,9 +1912,17 @@ class Qwen3VLForConditionalGeneration(
            multimodal_input = mm_input_by_modality[modality]
            if modality == "image":
                image_embeddings = self._process_image_input(multimodal_input)
                if self.is_multimodal_pruning_enabled:
                    image_embeddings = self._postprocess_image_embeds_evs(
                        image_embeddings, multimodal_input
                    )
                multimodal_embeddings += tuple(image_embeddings)
            if modality == "video":
                video_embeddings = self._process_video_input(multimodal_input)
                if self.is_multimodal_pruning_enabled:
                    video_embeddings = self._postprocess_video_embeds_evs(
                        video_embeddings, multimodal_input
                    )
                multimodal_embeddings += tuple(video_embeddings)
        return multimodal_embeddings

--- a/vllm/model_executor/models/registry.py
+++ b/vllm/model_executor/models/registry.py
@@ -264,6 +264,10 @@ _CROSS_ENCODER_MODELS = {
 _MULTIMODAL_MODELS = {
    # [Decoder-only]
    "AriaForConditionalGeneration": ("aria", "AriaForConditionalGeneration"),
    "AudioFlamingo3ForConditionalGeneration": (
        "audioflamingo3",
        "AudioFlamingo3ForConditionalGeneration",
    ),
    "AyaVisionForConditionalGeneration": (
        "aya_vision",
        "AyaVisionForConditionalGeneration",
--- a/vllm/multimodal/parse.py
+++ b/vllm/multimodal/parse.py
@@ -120,7 +120,7 @@ class ProcessorBatchItems(ModalityDataItems[Sequence[_T], _T]):
        return self.data[index]

    def get_processor_data(self) -> Mapping[str, object]:
        return {f"{self.modality}s": self.data}
        return {f"{self.modality}s": self.get_all()}

    def get_passthrough_data(self) -> Mapping[str, object]:
        return {}
--- a/vllm/transformers_utils/config.py
+++ b/vllm/transformers_utils/config.py
@@ -617,6 +617,28 @@ def get_config(
        hf_overrides=hf_overrides_kw,
        **kwargs,
    )

    # Patching defaults for GGUF models
    if _is_gguf:
        # Some models have different default values between GGUF and HF.
        def apply_gguf_default(key: str, gguf_default: Any):
            """
            Apply GGUF defaults unless explicitly configured.

            This function reads/writes external `config` and `config_dict`.
            If the specified `key` is not in `config_dict` (i.e. not explicitly
            configured and the default HF value is used), it updates the
            corresponding `config` value to `gguf_default`.
            """
            if key not in config_dict:
                config.update({key: gguf_default})

        # Apply architecture-specific GGUF defaults.
        if config.model_type in {"qwen3_moe"}:
            # Qwen3 MoE: norm_topk_prob is always true.
            # Note that, this parameter is always false (HF default) on Qwen2 MoE.
            apply_gguf_default("norm_topk_prob", True)

    # Special architecture mapping check for GGUF models
    if _is_gguf:
        if config.model_type not in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES:
--- a/vllm/utils/deep_gemm.py
+++ b/vllm/utils/deep_gemm.py
@@ -381,22 +381,6 @@ def should_use_deepgemm_for_fp8_linear(
    )


 def should_use_deepgemm_for_fp8_linear_for_nk(
    output_dtype: torch.dtype,
    shape0: int,
    shape1: int,
    supports_deep_gemm: bool | None = None,
 ):
    if supports_deep_gemm is None:
        supports_deep_gemm = is_deep_gemm_supported()
    return (
        supports_deep_gemm
        and output_dtype == torch.bfloat16
        and shape0 % 128 == 0
        and shape1 % 128 == 0
    )


 __all__ = [
    "calc_diff",
    "DeepGemmQuantScaleFMT",
@@ -411,7 +395,6 @@ __all__ = [
    "is_deep_gemm_supported",
    "get_num_sms",
    "should_use_deepgemm_for_fp8_linear",
    "should_use_deepgemm_for_fp8_linear_for_nk",
    "get_col_major_tma_aligned_tensor",
    "get_mk_alignment_for_contiguous_layout",
 ]
--- a/vllm/utils/torch_utils.py
+++ b/vllm/utils/torch_utils.py
@@ -194,33 +194,12 @@ def get_kv_cache_torch_dtype(
    return torch_dtype


 def get_kv_cache_quant_algo_dtype(quant_cfg: dict[str, Any]) -> torch.dtype | None:
    quant_method = quant_cfg.get("quant_method", "")
    if quant_method.startswith("modelopt"):
        quantization_inner = quant_cfg.get("quantization", quant_cfg)
        # Check if quant config is specified and use kv cache quant algo
        kv_algo = quantization_inner.get("kv_cache_quant_algo") or quant_cfg.get(
            "kv_cache_quant_algo"
        )
        if isinstance(kv_algo, str):
            return STR_DTYPE_TO_TORCH_DTYPE[kv_algo.lower()]
    return None


 def kv_cache_dtype_str_to_dtype(
    kv_cache_dtype: str, model_config: ModelConfig
 ) -> torch.dtype:
    # Model config may not be specified for unit tests, default to float16
    dtype = model_config.dtype if model_config else torch.half
    if kv_cache_dtype == "auto":
        hf_cfg = getattr(model_config, "hf_config", None)
        if hf_cfg is not None:
            quant_cfg = getattr(hf_cfg, "quantization_config", None)
            if quant_cfg is not None:
                kv_algo_dtype = get_kv_cache_quant_algo_dtype(quant_cfg)
                return kv_algo_dtype if kv_algo_dtype is not None else dtype
        return dtype

        # Model config may not be specified for unit tests, default to float16
        return model_config.dtype if model_config else torch.half
    return STR_DTYPE_TO_TORCH_DTYPE[kv_cache_dtype]


--- a/vllm/v1/attention/backends/gdn_attn.py
+++ b/vllm/v1/attention/backends/gdn_attn.py
@@ -211,7 +211,7 @@ class GDNAttentionMetadataBuilder(AttentionMetadataBuilder[GDNAttentionMetadata]
                spec_token_masks = torch.repeat_interleave(
                    spec_sequence_masks, query_lens
                )
                index = torch.argsort(spec_token_masks)
                index = torch.argsort(spec_token_masks, stable=True)
                num_non_spec_tokens = num_prefill_tokens + num_decode_tokens
                non_spec_token_indx = index[:num_non_spec_tokens]
                spec_token_indx = index[num_non_spec_tokens:]
--- a/vllm/v1/kv_offload/cpu.py
+++ b/vllm/v1/kv_offload/cpu.py
@@ -13,7 +13,7 @@ from vllm.v1.kv_offload.backends.cpu import CPUBackend
 from vllm.v1.kv_offload.lru_manager import LRUOffloadingManager
 from vllm.v1.kv_offload.mediums import CPULoadStoreSpec, GPULoadStoreSpec
 from vllm.v1.kv_offload.spec import OffloadingSpec
 from vllm.v1.kv_offload.worker.cpu_gpu import CpuGpuOffloadingHandler
 from vllm.v1.kv_offload.worker.cpu_gpu import CpuGpuOffloadingHandlers
 from vllm.v1.kv_offload.worker.worker import OffloadingHandler


@@ -32,7 +32,7 @@ class CPUOffloadingSpec(OffloadingSpec):
        self._manager: OffloadingManager | None = None

        # worker-side
        self._handler: OffloadingHandler | None = None
        self._handlers: CpuGpuOffloadingHandlers | None = None

        self.eviction_policy: str = self.extra_config.get("eviction_policy", "lru")

@@ -67,13 +67,13 @@ class CPUOffloadingSpec(OffloadingSpec):
        kv_caches: dict[str, torch.Tensor],
        attn_backends: dict[str, type[AttentionBackend]],
    ) -> Iterator[tuple[type[LoadStoreSpec], type[LoadStoreSpec], OffloadingHandler]]:
        if not self._handler:
        if not self._handlers:
            if not current_platform.is_cuda_alike():
                raise Exception(
                    "CPU Offloading is currently only supported on CUDA-alike GPUs"
                )

            self._handler = CpuGpuOffloadingHandler(
            self._handlers = CpuGpuOffloadingHandlers(
                attn_backends=attn_backends,
                gpu_block_size=self.gpu_block_size,
                cpu_block_size=self.offloaded_block_size,
@@ -81,6 +81,6 @@ class CPUOffloadingSpec(OffloadingSpec):
                gpu_caches=kv_caches,
            )

        assert self._handler is not None
        yield GPULoadStoreSpec, CPULoadStoreSpec, self._handler
        yield CPULoadStoreSpec, GPULoadStoreSpec, self._handler
        assert self._handlers is not None
        yield GPULoadStoreSpec, CPULoadStoreSpec, self._handlers.gpu_to_cpu_handler
        yield CPULoadStoreSpec, GPULoadStoreSpec, self._handlers.cpu_to_gpu_handler
--- a/vllm/v1/kv_offload/worker/cpu_gpu.py
+++ b/vllm/v1/kv_offload/worker/cpu_gpu.py
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 from collections import deque

 import numpy as np
 import torch
@@ -8,7 +9,7 @@ from vllm import _custom_ops as ops
 from vllm.attention.backends.abstract import AttentionBackend
 from vllm.logger import init_logger
 from vllm.utils.platform_utils import is_pin_memory_available
 from vllm.v1.kv_offload.mediums import CPULoadStoreSpec, GPULoadStoreSpec
 from vllm.v1.kv_offload.mediums import BlockIDsLoadStoreSpec
 from vllm.v1.kv_offload.worker.worker import (
    OffloadingHandler,
    TransferResult,
@@ -51,7 +52,123 @@ def expand_block_ids(
        output_idx = output_end_idx


 class CpuGpuOffloadingHandler(OffloadingHandler):
 class SingleDirectionOffloadingHandler(OffloadingHandler):
    """
    SingleDirectionOffloadingHandler handles transfers for a single direction,
    either CPU->GPU or GPU->CPU.
    Transfers are guaranteed to be executed in order of their submission.
    Each transfer uses a unique CUDA stream, and its stream will start
    executing only after the streams of previous transfers have finished.
    """

    def __init__(
        self,
        src_tensors: list[torch.Tensor],
        dst_tensors: list[torch.Tensor],
        kv_dim_before_num_blocks: list[bool],
        src_block_size_factor: int,
        dst_block_size_factor: int,
        priority: int,
    ):
        """
        Initialize a SingleDirectionOffloadingHandler.

        Args:
            src_tensors: list of KV cache tensors to copy from.
            dst_tensors: list of KV cache tensors to copy to.
                Order should match src_tensors.
            kv_dim_before_num_blocks: list of bools, indicating
                whether the respective KV cache tensor has a KV
                dimension before its num_blocks dimension.
                e.g. (2, num_blocks, ...)
            src_block_size_factor: The number of kernel blocks
                per KV block in a source tensor.
            dst_block_size_factor: The number of kernel blocks
                per KV block in a destination tensor.
            priority: The priority of the backing CUDA streams.
                Lower numbers indicate higher priority.
        """
        assert len(src_tensors) == len(dst_tensors) == len(kv_dim_before_num_blocks)

        self.src_tensors: list[torch.Tensor] = src_tensors
        self.dst_tensors: list[torch.Tensor] = dst_tensors
        self.kv_dim_before_num_blocks: list[bool] = kv_dim_before_num_blocks
        self.src_block_size_factor: int = src_block_size_factor
        self.dst_block_size_factor: int = dst_block_size_factor
        self.priority = priority

        # queue of transfers (job_id, stream, event)
        self._transfers: deque[tuple[int, torch.cuda.Stream, torch.Event]] = deque()
        # list of CUDA streams available for re-use
        self._stream_pool: list[torch.cuda.Stream] = []
        # list of CUDA events available for re-use
        self._event_pool: list[torch.Event] = []

    def transfer_async(self, job_id: int, transfer_spec: TransferSpec) -> bool:
        src_spec, dst_spec = transfer_spec
        assert isinstance(src_spec, BlockIDsLoadStoreSpec)
        assert isinstance(dst_spec, BlockIDsLoadStoreSpec)

        src_blocks = src_spec.block_ids
        dst_blocks = dst_spec.block_ids
        assert src_blocks.ndim == 1
        assert dst_blocks.ndim == 1

        src_sub_block_count = src_blocks.size * self.src_block_size_factor
        dst_sub_block_count = dst_blocks.size * self.dst_block_size_factor
        src_sub_blocks_to_skip = -dst_blocks.size % self.src_block_size_factor

        assert dst_sub_block_count == src_sub_block_count - src_sub_blocks_to_skip

        src_to_dst = np.empty((dst_sub_block_count, 2), dtype=np.int64)
        expand_block_ids(
            src_blocks,
            self.src_block_size_factor,
            src_to_dst[:, 0],
            skip_count=src_sub_blocks_to_skip,
        )
        expand_block_ids(dst_blocks, self.dst_block_size_factor, src_to_dst[:, 1])
        src_to_dst_tensor = torch.from_numpy(src_to_dst)

        stream = (
            self._stream_pool.pop()
            if self._stream_pool
            else torch.cuda.Stream(priority=self.priority)
        )
        event = self._event_pool.pop() if self._event_pool else torch.Event()
        if self._transfers:
            _, _, last_event = self._transfers[-1]
            # assure job will start only after the previous one completes
            stream.wait_event(last_event)
        with torch.cuda.stream(stream):
            for src_tensor, dst_tensor, kv_dim in zip(
                self.src_tensors, self.dst_tensors, self.kv_dim_before_num_blocks
            ):
                if kv_dim:
                    src_key_cache, src_value_cache = src_tensor
                    dst_key_cache, dst_value_cache = dst_tensor
                    ops.swap_blocks(src_key_cache, dst_key_cache, src_to_dst_tensor)
                    ops.swap_blocks(src_value_cache, dst_value_cache, src_to_dst_tensor)
                else:
                    ops.swap_blocks(src_tensor, dst_tensor, src_to_dst_tensor)
            event.record(stream)

        self._transfers.append((job_id, stream, event))

        # success
        return True

    def get_finished(self) -> list[TransferResult]:
        results: list[TransferResult] = []
        while self._transfers and self._transfers[0][2].query():
            job_id, stream, event = self._transfers.popleft()
            results.append((job_id, True))
            self._stream_pool.append(stream)
            self._event_pool.append(event)
        return results


 class CpuGpuOffloadingHandlers:
    def __init__(
        self,
        gpu_block_size: int,
@@ -60,27 +177,20 @@ class CpuGpuOffloadingHandler(OffloadingHandler):
        gpu_caches: dict[str, torch.Tensor],
        attn_backends: dict[str, type[AttentionBackend]],
    ):
        assert gpu_caches
        assert cpu_block_size % gpu_block_size == 0
        self.block_size_factor = cpu_block_size // gpu_block_size

        # cuda streams for gpu->cpu and cpu->gpu
        self.d2h_stream = torch.cuda.Stream()
        self.h2d_stream = torch.cuda.Stream()

        # job_id -> transfer cuda event
        self.transfer_events: dict[int, torch.Event] = {}
        # list of cuda events available for re-use
        self.events_pool: list[torch.Event] = []
        block_size_factor = cpu_block_size // gpu_block_size

        pin_memory = is_pin_memory_available()

        # allocate cpu tensors
        logger.info("Allocating %d CPU tensors...", len(gpu_caches))
        self.gpu_tensors: list[torch.Tensor] = []
        self.cpu_tensors: list[torch.Tensor] = []
        self.kv_dim_before_num_blocks: list[bool] = []
        gpu_tensors: list[torch.Tensor] = []
        cpu_tensors: list[torch.Tensor] = []
        kv_dim_before_num_blocks: list[bool] = []
        kernel_block_size: int | None = None
        for layer_name, gpu_tensor in gpu_caches.items():
            self.gpu_tensors.append(gpu_tensor)
            gpu_tensors.append(gpu_tensor)

            gpu_shape = gpu_tensor.shape
            attn_backend = attn_backends[layer_name]
@@ -88,16 +198,21 @@ class CpuGpuOffloadingHandler(OffloadingHandler):
                num_blocks=1234, block_size=16, num_kv_heads=8, head_size=256
            )

            has_layers_dim = False
            if len(gpu_shape) != len(test_shape):
                # cross-layers tensor
                # shape is (num_blocks, ...)
                assert len(gpu_shape) == len(test_shape) + 1
                num_blocks_idx = 0
                self.kv_dim_before_num_blocks.append(False)
                has_layers_dim = True
                kv_dim_before_num_blocks.append(False)

                # prepend a dummy num_layers=80 to test_shape
                test_shape = (80,) + test_shape
            elif test_shape[0] == 1234:
                # shape is (num_blocks, ...)
                num_blocks_idx = 0
                self.kv_dim_before_num_blocks.append(False)
                kv_dim_before_num_blocks.append(False)
            else:
                # shape should be (2, num_blocks, ...)
                assert test_shape[0] == 2
@@ -105,13 +220,32 @@ class CpuGpuOffloadingHandler(OffloadingHandler):
                assert gpu_shape[0] == 2

                num_blocks_idx = 1
                self.kv_dim_before_num_blocks.append(True)
                kv_dim_before_num_blocks.append(True)

            try:
                kv_cache_stride_order = attn_backend.get_kv_cache_stride_order(
                    include_num_layers_dimension=has_layers_dim
                )
                assert len(kv_cache_stride_order) == len(gpu_shape)
            except (AttributeError, NotImplementedError):
                kv_cache_stride_order = tuple(range(len(gpu_shape)))

            # permute test_shape according to stride_order
            test_shape = tuple(test_shape[i] for i in kv_cache_stride_order)

            # find block_size (16) dimension index
            block_size_idx = test_shape.index(16)
            if kernel_block_size is not None:
                assert kernel_block_size == gpu_shape[block_size_idx]
            else:
                kernel_block_size = gpu_shape[block_size_idx]
                assert gpu_block_size % kernel_block_size == 0

            cpu_shape = list(gpu_shape)
            cpu_shape[num_blocks_idx] = num_cpu_blocks * self.block_size_factor
            cpu_shape[num_blocks_idx] = num_cpu_blocks * block_size_factor

            logger.debug("Allocating CPU tensor of shape %r", cpu_shape)
            self.cpu_tensors.append(
            cpu_tensors.append(
                torch.zeros(
                    cpu_shape,
                    dtype=gpu_tensor.dtype,
@@ -120,72 +254,27 @@ class CpuGpuOffloadingHandler(OffloadingHandler):
                )
            )

    def transfer_async(self, job_id: int, spec: TransferSpec) -> bool:
        src_spec, dst_spec = spec
        if isinstance(src_spec, CPULoadStoreSpec):
            assert isinstance(dst_spec, GPULoadStoreSpec)
            stream = self.h2d_stream
            src_tensors = self.cpu_tensors
            dst_tensors = self.gpu_tensors
            src_block_size_factor = self.block_size_factor
            dst_block_size_factor = 1
        else:
            assert isinstance(src_spec, GPULoadStoreSpec)
            assert isinstance(dst_spec, CPULoadStoreSpec)
            stream = self.d2h_stream
            src_tensors = self.gpu_tensors
            dst_tensors = self.cpu_tensors
            src_block_size_factor = 1
            dst_block_size_factor = self.block_size_factor

        src_blocks = src_spec.block_ids
        dst_blocks = dst_spec.block_ids
        assert src_blocks.ndim == 1
        assert dst_blocks.ndim == 1
        assert kernel_block_size is not None
        gpu_block_size_factor = gpu_block_size // kernel_block_size
        cpu_block_size_factor = cpu_block_size // kernel_block_size

        src_sub_block_count = src_blocks.size * src_block_size_factor
        dst_sub_block_count = dst_blocks.size * dst_block_size_factor
        src_sub_blocks_to_skip = -dst_blocks.size % src_block_size_factor
        # TODO (orozery): adapt swap_blocks to support gpu_block_size_factor
        assert gpu_block_size_factor == 1

        assert dst_sub_block_count == src_sub_block_count - src_sub_blocks_to_skip

        src_to_dst = np.empty((dst_sub_block_count, 2), dtype=np.int64)
        expand_block_ids(
            src_blocks,
            src_block_size_factor,
            src_to_dst[:, 0],
            skip_count=src_sub_blocks_to_skip,
        self.gpu_to_cpu_handler = SingleDirectionOffloadingHandler(
            src_tensors=gpu_tensors,
            dst_tensors=cpu_tensors,
            kv_dim_before_num_blocks=kv_dim_before_num_blocks,
            src_block_size_factor=gpu_block_size_factor,
            dst_block_size_factor=cpu_block_size_factor,
            priority=1,
        )
        expand_block_ids(dst_blocks, dst_block_size_factor, src_to_dst[:, 1])
        src_to_dst_tensor = torch.from_numpy(src_to_dst)

        event = self.events_pool.pop() if self.events_pool else torch.Event()
        with torch.cuda.stream(stream):
            for src_tensor, dst_tensor, kv_dim in zip(
                src_tensors, dst_tensors, self.kv_dim_before_num_blocks
            ):
                if kv_dim:
                    src_key_cache = src_tensor[0]
                    dst_key_cache = dst_tensor[0]
                    ops.swap_blocks(src_key_cache, dst_key_cache, src_to_dst_tensor)
                    src_value_cache = src_tensor[1]
                    dst_value_cache = dst_tensor[1]
                    ops.swap_blocks(src_value_cache, dst_value_cache, src_to_dst_tensor)
                else:
                    ops.swap_blocks(src_tensor, dst_tensor, src_to_dst_tensor)
            event.record(stream)

        self.transfer_events[job_id] = event

        # success
        return True

    def get_finished(self) -> list[TransferResult]:
        results: list[TransferResult] = []
        for job_id, event in self.transfer_events.items():
            if event.query():
                results.append((job_id, True))
                self.events_pool.append(event)
        for job_id, _ in results:
            del self.transfer_events[job_id]
        return results
        self.cpu_to_gpu_handler = SingleDirectionOffloadingHandler(
            src_tensors=cpu_tensors,
            dst_tensors=gpu_tensors,
            kv_dim_before_num_blocks=kv_dim_before_num_blocks,
            src_block_size_factor=cpu_block_size_factor,
            dst_block_size_factor=gpu_block_size_factor,
            priority=-1,
        )
--- a/vllm/v1/structured_output/backend_xgrammar.py
+++ b/vllm/v1/structured_output/backend_xgrammar.py
@@ -268,13 +268,7 @@ def has_xgrammar_unsupported_json_features(schema: dict[str, Any]) -> bool:

        # Unsupported keywords for objects
        if obj.get("type") == "object" and any(
            key in obj
            for key in (
                "minProperties",
                "maxProperties",
                "propertyNames",
                "patternProperties",
            )
            key in obj for key in ("patternProperties", "propertyNames")
        ):
            return True
Author	SHA1	Message	Date
Robert Shaw	e2ed238885	Revert "[Fix]Load kv-cache dtype from hf_quant_config.json automatically" (#30653 )	1 day ago
Or Ozeri	174e39ead7	CPU KV Offloading: Use more CUDA streams (#29013 ) Signed-off-by: Or Ozeri <oro@il.ibm.com>	1 day ago
RioS	9ccbf6b692	[responsesAPI]add extra body parameters (#30532 ) Signed-off-by: Ri0S <aa248424@gmail.com>	1 day ago
Chendi.Xue	ae2e503dda	[NIXL][BUG FIX] Fix a bug for PD with host_buffer after merging 29665 (#30420 ) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>	1 day ago
Tsukasa OI	9e33a1a75b	[Model][Quantization] Override HF defaults to GGUF ones (incl. Qwen3 MoE) (#30118 ) Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>	1 day ago
Vensen	add4b0ca44	[Bugfix][benchmarks] Fix input token calculation for rerank benchmark metrics (#30596 ) Signed-off-by: vensen <vensenmu@gmail.com>	1 day ago
ZiTian Zhao	ae88aada38	[Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL (#29752 ) Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com> Co-authored-by: deitxfge <huhaibo1990@126.com>	1 day ago
yifant-code	5ccf0efa84	[Bugfix] Improve error messages in ModelConfig validation (#30213 ) Signed-off-by: ytian218 <ytian218@bloomberg.net> Co-authored-by: ytian218 <ytian218@bloomberg.net>	1 day ago
ElizaWszola	994acec0cc	[Bugfix] Fix fusion for VL models (#30244 ) Signed-off-by: ElizaWszola <ewszola@redhat.com>	1 day ago
zifeitong	48b8456ff9	[Bugfix] Revert Qwen2-VL part of change in #28271 (#30542 ) Signed-off-by: Zifei Tong <zifeitong@gmail.com>	1 day ago
Drew Botwinick	5b64ac21f9	[Bugfix] Update get_processor_data to use get_all method (#30583 ) Signed-off-by: Drew Botwinick <6953152+dbotwinick@users.noreply.github.com>	1 day ago
Bin Bao	a8ec486592	[Misc] Add a script to benchmark compilation time (#29919 ) Signed-off-by: Bin Bao <binbao@meta.com>	1 day ago
tjp_zju	6ecc1e411b	[Bugfix] fix _get_quant_method of FusedMoE for deepseekV3.2 on non-NV… (#30057 ) Signed-off-by: tjp_zju <tanjianpingzju1990@gmail.com>	2 days ago
Shengliang Xu	0bb0bae436	Nvidia ModelOpt workaround for issue 28072 (#30164 ) Signed-off-by: Shengliang Xu <shengliangx@nvidia.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com>	2 days ago
Johannes F	060893654d	fix: Update json features supported by xGrammar (#30390 ) Signed-off-by: Johannes Flommersfeld <johannes.flommersfeld@tngtech.com> Signed-off-by: Johannes F <johannesflommersfeld@users.noreply.github.com> Co-authored-by: Johannes Flommersfeld <johannes.flommersfeld@tngtech.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2 days ago
Matthias Gehre	e9add129ad	[Bugfix] awq_gemm: fix argument order swap (#30364 ) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2 days ago
Ilya Markov	3224ea9915	[torch.compile] Add encoder tag for compilation (#30489 ) Signed-off-by: ilmarkov <markovilya197@gmail.com>	2 days ago
Lasha Koroshinadze	3a20450d31	Add AudioFlamingo3 model support (#30539 ) Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com> Signed-off-by: Lasha Koroshinadze <26011196+lashahub@users.noreply.github.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2 days ago
Didier Durand	1a55cfafcb	[Doc]: fixing typos in various files (#30540 ) Signed-off-by: Didier Durand <durand.didier@gmail.com> Signed-off-by: Didier Durand <2927957+didier-durand@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2 days ago
drslark	add1b9d3de	[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#30632 ) Signed-off-by: drslark <slarksblood@qq.com>	2 days ago
Cyrus Leung	dcb31196da	[Chore] Remove redundant `RequestPrompt` (#30612 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2 days ago