[docs] Add NVIDIA Dynamo serving example (#7333 )

* nvidia-dynamo examples * Trigger CI * update automatic example generation * format * Update examples/serve/nvidia-dynamo/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update examples/serve/nvidia-dynamo/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update examples/serve/nvidia-dynamo/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update examples/serve/nvidia-dynamo/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * address comments * Update examples/serve/nvidia-dynamo/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * address comments * updates * Add image * update banner * update what's next --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Fixed plugin load in metrics process (#8318 )
--- a/docs/source/examples/serving/index.rst
+++ b/docs/source/examples/serving/index.rst
@@ -6,6 +6,7 @@ Serving

   vLLM <vllm>
   SGLang <sglang>
   Nvidia Dynamo <nvidia-dynamo>
   Ollama <ollama>
   Hugging Face TGI <tgi>
   LoRAX <lorax>
--- a/docs/source/examples/serving/nvidia-dynamo.md
+++ b/docs/source/examples/serving/nvidia-dynamo.md
@@ -0,0 +1 @@
 ../../generated-examples/nvidia-dynamo.md
--- a/docs/source/generate_examples.py
+++ b/docs/source/generate_examples.py
@@ -241,10 +241,11 @@ def _work(example_dir: pathlib.Path):
    globs = [example_dir.glob(pattern) for pattern in _GLOB_PATTERNS]
    for path in itertools.chain(*globs):
        examples.append(Example(path))
    # Find examples in subdirectories (search up to 2 levels deep)

    # Find examples in subdirectories (up to 3 levels deep)
    for path in example_dir.glob("*/*.md"):
        examples.append(Example(path.parent))
    # Also search 2 levels deep for nested examples like training/torchtitan

    for path in example_dir.glob("*/*/*.md"):
        examples.append(Example(path.parent))

--- a/docs/source/getting-started/installation.rst
+++ b/docs/source/getting-started/installation.rst
@@ -268,6 +268,11 @@ section :ref:`below <cloud-account-setup>`.

  To check credentials only for specific clouds, pass the clouds as arguments: :code:`sky check aws gcp`

 .. tip::

  If you are having trouble setting up credentials, it may be because the API server started before they were
  configured. Try restarting the API server by running :code:`sky api stop` and then :code:`sky api start`.

 .. _cloud-account-setup:

 Set up Kubernetes or clouds
--- a/docs/source/reference/api-server/api-server-admin-deploy.rst
+++ b/docs/source/reference/api-server/api-server-admin-deploy.rst
@@ -213,16 +213,20 @@ Following tabs describe how to configure credentials for different clouds on the

        .. tip::

            If you are using a kubeconfig file that contains `exec-based authentication <https://kubernetes.io/docs/reference/access-authn-authz/authentication/#configuration>`_ (e.g., GKE's default ``gke-gcloud-auth-plugin`` based authentication), you will need to strip the path information from the ``command`` field in the exec configuration.
            You can use the ``exec_kubeconfig_converter.py`` script to do this.
            If you are using a kubeconfig file that contains `exec-based authentication <https://kubernetes.io/docs/reference/access-authn-authz/authentication/#configuration>`_ (e.g., GKE's default ``gke-gcloud-auth-plugin``, Nebius Managed Kubernetes, OCI, etc.), you will need to generate a kubeconfig with static authentication instead.
            You can use the ``generate_kubeconfig.sh`` script to do this.

            .. code-block:: bash

                python -m sky.utils.kubernetes.exec_kubeconfig_converter --input ~/.kube/config --output ~/.kube/config.converted
                # Download the script
                curl -O https://raw.githubusercontent.com/skypilot-org/skypilot/refs/heads/master/sky/utils/kubernetes/generate_kubeconfig.sh && chmod +x generate_kubeconfig.sh

            Then create the Kubernetes secret with the converted kubeconfig file ``~/.kube/config.converted``.
                # Generate the kubeconfig
                export KUBECONFIG=$HOME/.kube/config # or the path to your kubeconfig file
                ./generate_kubeconfig.sh

            Then create the Kubernetes secret with the generated kubeconfig file ``./kubeconfig``.

            The specific cloud's credential for the exec-based authentication also needs to be configured. For example, to enable exec-based authentication for GKE, you also need to setup GCP credentials (see the GCP tab above).

        .. dropdown:: Update Kubernetes credentials

--- a/examples/serve/nvidia-dynamo/README.md
+++ b/examples/serve/nvidia-dynamo/README.md
@@ -0,0 +1,198 @@
 # Run Nvidia Dynamo on any cloud or Kubernetes with SkyPilot

 <p align="center">
  <picture>
    <img src="https://i.imgur.com/CBb1Yyi.png" width=75%>
  </picture>
 </p>


 This recipe shows how to deploy and serve models using [Nvidia Dynamo](https://github.com/ai-dynamo/dynamo) on any cloud provider or Kubernetes cluster with [SkyPilot](https://docs.skypilot.co/en/latest/docs/index.html). Run Dynamo seamlessly across AWS, GCP, Azure, Lambda Labs, Nebius and more - or bring your own Kubernetes infrastructure.

 Together, SkyPilot and Dynamo offer developers unparalleled flexibility: deploy any LLM, on any cloud, using any inference framework, all with minimal effort and operational overhead.

 ## What is Nvidia Dynamo?

 NVIDIA Dynamo is a high-performance inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Built in Rust for performance and Python for extensibility, Dynamo solves the computational challenges of large language models that exceed single GPU capabilities.

 ### Core Features
 - **Disaggregated Prefill & Decode**: Separates inference phases for optimal resource utilization
 - **Dynamic GPU Scheduling**: Intelligent workload distribution across available GPUs
 - **LLM-Aware Request Routing**: Smart routing based on model characteristics and cache states
 - **Accelerated Data Transfer**: High-performance data movement between nodes
 - **KV Cache Offloading**: Multi-tiered memory management for efficient cache utilization

 ## Launching Nvidia Dynamo with SkyPilot

 ### Single-Node Example (`nvidia-dynamo.sky.yaml`)
 - ✅ **SGLang Backend**: High-performance inference engine. Can be swapped with vLLM if required.
 - ✅ **OpenAI-Compatible API**: Drop-in replacement for OpenAI endpoints
 - ✅ **Basic Load Balancing**: Round-robin request distribution
 - ✅ **Auto-Discovery**: Dynamic worker registration

 ### Multi-Node Example (`nvidia-dynamo-multinode.sky.yaml`)
 - ✅ **KV-Aware Routing**: Intelligent cache-based request routing (`--router-mode kv`)
 - ✅ **Multi-Node Distribution**: 2 nodes × 8 H100 GPUs (16 total GPUs)
 - ✅ **Data Parallel Attention**: DP=2 across nodes (`--enable-dp-attention`)
 - ✅ **Tensor Parallelism**: TP=8 per node for large model support
 - ✅ **Disaggregated Transfer**: NIXL backend for KV cache transfers

 **Model**: `Qwen/Qwen3-8B` (8B parameter reasoning model)

 **Architecture**: 2 nodes, each with 8×H100 GPUs, TP=8, DP=2

 ## Launch Cluster

 Once SkyPilot is set up (see [Appendix: Preparation](#appendix-preparation)), launch the example with: 

 ```bash
 sky launch -c dynamo nvidia-dynamo.sky.yaml
 ```

 ## Test Endpoint

 ```bash
 export ENDPOINT=$(sky status --endpoint 8080 dynamo)

 curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [
    {
        "role": "user",
        "content": "Hello, how are you?"
    }
    ],
    "stream":false,
    "max_tokens": 300
  }' | jq
 ...
 {
  "id": "chatcmpl-e2b5b2bd-59fb-4321-8afc-3b5bb4a717a7",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "<think>\nOkay, the user greeted me with \"Hello, how are you?\" I should respond in a friendly and natural way. Let me think about the appropriate response.\n\nFirst, I need to acknowledge their greeting. Maybe start with a cheerful \"Hello!\" to match their tone. Then, I should mention that I'm just a virtual assistant, so I don't have feelings, but I'm here to help. It's important to keep it conversational.\n\nI should make sure to invite them to ask questions or share what they need help with. That way, it's open-ended and encourages further interaction. Also, adding an emoji like 😊 can make the response more friendly and approachable.\n\nWait, should I mention my name again? Maybe not necessary since the user already knows. Just keep it simple and welcoming. Let me check the example response they provided. Yes, it's similar to that. I think that's all. Keep the tone positive and helpful.\n</think>\n\nHello! 😊 I'm just a virtual assistant, so I don't have feelings, but I'm here to help you with whatever you need! What can I assist you with today?",
        "role": "assistant",
        "reasoning_content": null
      },
      "finish_reason": "stop"
    }
  ],
  "created": 1758497220,
  "model": "Qwen/Qwen3-8B",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 235,
    "total_tokens": 249
  }
 }
 ```

 ## Multi-Node Serving

 ### Launch Multi-Node Cluster

 ```bash
 sky launch -c dynamo-multi nvidia-dynamo-multinode.sky.yaml
 ```

 ### Test Multi-Node Endpoint

 ```bash
 export ENDPOINT=$(sky status --endpoint 8080 dynamo-multi)

 curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [
    {
        "role": "user",
        "content": "Hello, how are you?"
    }
    ],
    "stream":false,
    "max_tokens": 300
  }' | jq
 ```

 Example output:
 ```json
 {
  "id": "chatcmpl-5524560e-aecd-4b63-a41b-23d0a787c9b0",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "<think>\nOkay, the user greeted me with \"Hello, how are you?\" I need to respond appropriately. Let me start by acknowledging their greeting. I should mention that I'm an AI assistant, so I don't have feelings, but I'm here to help.\n\nI should keep the response friendly and open-ended. Maybe ask them how they're doing to encourage a conversation. Let me check if there's anything specific they might need. Oh, maybe they have a question or need assistance with something. I should make sure to invite them to ask for help if needed. Also, keep the tone positive and approachable. Alright, putting it all together now.\n</think>\n\nHello! I'm just a virtual assistant, so I don't have feelings, but I'm here and ready to help! How are you today? 😊 If you have any questions or need assistance, feel free to ask!",
        "role": "assistant",
        "reasoning_content": null
      },
      "finish_reason": "stop"
    }
  ],
  "created": 1758501329,
  "model": "Qwen/Qwen3-8B",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 181,
    "total_tokens": 195
  }
 }
 ```

 ## Verifying KV-Aware Routing

 Check logs for these indicators:

 ```
 INFO dynamo_llm::kv_router: KV Routing initialized
 INFO dynamo_llm::kv_router::scheduler: Formula for 7587889683284143912 with 0 cached blocks: 0.875 = 1.0 * prefill_blocks + decode_blocks = 1.0 * 0.875 + 0.000
 INFO dynamo_llm::kv_router::scheduler: Selected worker: 7587889683284143912, logit: 0.875, cached blocks: 0, total blocks: 109815
 ```

 The routing formula shows worker selection based on KV cache hits and load balancing.

 ## Appendix: Preparation

 1. Install SkyPilot for launching the serving:
 ```bash
 pip install skypilot-nightly[aws,gcp,kubernetes]
 # or other clouds (17+ clouds and kubernetes are supported) you have setup
 # See: https://docs.skypilot.co/en/latest/getting-started/installation.html
 ```

 2. Check your infra setup:
 ```bash
 sky check

 🎉 Enabled clouds 🎉
    ✔ AWS
    ✔ GCP
    ✔ Azure
    ...
    ✔ Kubernetes
 ```

 3. Set `HF_TOKEN` if you're using a [gated model](https://huggingface.co/docs/hub/en/models-gated) and then pass it to the `sky launch` command:
  
 ```bash
 export HF_TOKEN="xxxx"
 sky launch -c dynamo nvidia-dynamo.sky.yaml --env MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct --env HF_TOKEN
 ```

 ## What's next

 SkyServe support for Nvidia Dynamo is coming soon. 

 More resources:

 * [AI on Kubernetes Without the Pain](https://blog.skypilot.co/ai-on-kubernetes/)
 * [SkyPilot AI Gallery](https://docs.skypilot.co/en/latest/gallery/index.html)
 * [SkyPilot Docs](https://docs.skypilot.co)
 * [SkyPilot GitHub](https://github.com/skypilot-org/skypilot)
--- a/examples/serve/nvidia-dynamo/nvidia-dynamo-multinode.sky.yaml
+++ b/examples/serve/nvidia-dynamo/nvidia-dynamo-multinode.sky.yaml
@@ -0,0 +1,60 @@
 # Multi-node serving with NVIDIA Dynamo and SGLang in disaggregation mode.
 #
 # Usage:
 #
 #  sky launch -c dynamo-multi nvidia-dynamo-multinode.sky.yaml 
 #
 # This config uses 2 nodes with 8x H100 GPUs each for disaggregated serving.
 # Optionally override the model:
 #
 #  sky launch -c dynamo-multi nvidia-dynamo-multinode.sky.yaml --env MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct --env HF_TOKEN

 resources:
  accelerators: H100:8
  ports: 8080

 num_nodes: 2

 envs:
  MODEL_NAME: Qwen/Qwen3-8B
  DIST_INIT_PORT: 29500
  HF_TOKEN: "" # needed if a model is gated in HF Hub. Pass the value with `--env HF_TOKEN`

 setup: |
  sudo usermod -aG docker $USER
  sudo chmod 666 /var/run/docker.sock
  uv pip install "ai-dynamo[sglang]==0.5.0" accelerate --system --prerelease=allow
  uv pip install "sglang[all]==0.5.2" --system --prerelease=allow
  curl -fsSL -o docker-compose.yml https://raw.githubusercontent.com/ai-dynamo/dynamo/v0.5.0/deploy/docker-compose.yml
  docker compose -f docker-compose.yml up -d

 run: |
  export GLOO_SOCKET_IFNAME=$(ip -o -4 route show to default | awk '{print $5}')
  HEAD_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  TOTAL_GPUS=$((SKYPILOT_NUM_NODES * SKYPILOT_NUM_GPUS_PER_NODE))

  # For disaggregation mode, we need dp-size > 1
  # Setting TP to half of total GPUs and DP to 2 for proper distribution
  TP_SIZE=$((TOTAL_GPUS / 2))
  DP_SIZE=2

  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    # Start frontend with KV-aware routing enabled
    python -m dynamo.frontend --router-mode kv --http-port 8080 &
  fi

  python -m dynamo.sglang \
    --model-path $MODEL_NAME \
    --tp $TP_SIZE \
    --dp-size $DP_SIZE \
    --dist-init-addr $HEAD_IP:$DIST_INIT_PORT \
    --nnodes ${SKYPILOT_NUM_NODES} \
    --node-rank ${SKYPILOT_NODE_RANK} \
    --host 0.0.0.0 \
    --port 8081 \
    --enable-dp-attention \
    --trust-remote-code \
    --mem-fraction-static 0.82 \
    --disaggregation-transfer-backend nixl \
    --disaggregation-bootstrap-port 30001 \
    --page-size 16
--- a/examples/serve/nvidia-dynamo/nvidia-dynamo.sky.yaml
+++ b/examples/serve/nvidia-dynamo/nvidia-dynamo.sky.yaml
@@ -0,0 +1,29 @@
 # Single-node serving with NVIDIA Dynamo and SGLang.
 #
 # Usage:
 #
 #  sky launch -c dynamo nvidia-dynamo.sky.yaml 
 #
 # Optionally override the model:
 #
 #  sky launch -c dynamo nvidia-dynamo.sky.yaml  --env MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct --env HF_TOKEN

 resources:
  accelerators: H100:1
  ports: 8080

 envs:
  MODEL_NAME: Qwen/Qwen3-8B
  HF_TOKEN: "" # needed if a model is gated in HF Hub. Pass the value with `--env HF_TOKEN`

 setup: |
  sudo usermod -aG docker $USER
  sudo chmod 666 /var/run/docker.sock

  uv pip install "ai-dynamo[sglang]==0.4.1" accelerate --system --prerelease=allow
  curl -fsSL -o docker-compose.yml https://raw.githubusercontent.com/ai-dynamo/dynamo/release/0.4.1/deploy/docker-compose.yml
  docker compose -f docker-compose.yml up -d

 run: |
  python -m dynamo.frontend &
  python -m dynamo.sglang --model $MODEL_NAME
--- a/sky/server/plugins.py
+++ b/sky/server/plugins.py
@@ -164,6 +164,7 @@ def load_plugins(extension_context: ExtensionContext):

    for plugin_config in config.get('plugins', []):
        class_path = plugin_config['class']
        logger.debug(f'Loading plugins: {class_path}')
        module_path, class_name = class_path.rsplit('.', 1)
        try:
            module = importlib.import_module(module_path)
--- a/sky/server/uvicorn.py
+++ b/sky/server/uvicorn.py
@@ -20,6 +20,7 @@ from uvicorn.supervisors import multiprocess
 from sky import sky_logging
 from sky.server import daemons
 from sky.server import metrics as metrics_lib
 from sky.server import plugins
 from sky.server import state
 from sky.server.requests import requests as requests_lib
 from sky.skylet import constants
@@ -237,6 +238,10 @@ def run(config: uvicorn.Config, max_db_connections: Optional[int] = None):
    server = Server(config=config, max_db_connections=max_db_connections)
    try:
        if config.workers is not None and config.workers > 1:
            # When workers > 1, uvicorn does not run server app in the main
            # process. In this case, plugins are not loaded at this point, so
            # load plugins here without uvicorn app.
            plugins.load_plugins(plugins.ExtensionContext())
            sock = config.bind_socket()
            SlowStartMultiprocess(config, target=server.run,
                                  sockets=[sock]).run()
--- a/sky/utils/kubernetes/generate_kubeconfig.sh
+++ b/sky/utils/kubernetes/generate_kubeconfig.sh
@@ -12,20 +12,20 @@
 #   * Specify SKYPILOT_NAMESPACE env var to override the default namespace where the service account is created.
 #   * Specify SKYPILOT_SA_NAME env var to override the default service account name.
 #   * Specify SKIP_SA_CREATION=1 to skip creating the service account and use an existing one
 #   * Specify SUPER_USER=1 to create a service account with cluster-admin permissions
 #   * Specify SUPER_USER=0 to create a service account with minimal permissions
 #
 # Usage:
 #   # Create "sky-sa" service account with minimal permissions in "default" namespace and generate kubeconfig
 #   # Create "sky-sa" service account in "default" namespace and generate kubeconfig
 #   $ ./generate_kubeconfig.sh
 #
 #   # Create "my-sa" service account with minimal permissions in "my-namespace" namespace and generate kubeconfig
 #   # Create "my-sa" service account in "my-namespace" namespace and generate kubeconfig
 #   $ SKYPILOT_SA_NAME=my-sa SKYPILOT_NAMESPACE=my-namespace ./generate_kubeconfig.sh
 #
 #   # Use an existing service account "my-sa" in "my-namespace" namespace and generate kubeconfig
 #   $ SKIP_SA_CREATION=1 SKYPILOT_SA_NAME=my-sa SKYPILOT_NAMESPACE=my-namespace ./generate_kubeconfig.sh
 #
 #   # Create "sky-sa" service account with cluster-admin permissions in "default" namespace
 #   $ SUPER_USER=1 ./generate_kubeconfig.sh
 #   # Create "sky-sa" service account with minimal permissions in "default" namespace (manual setup may be required)
 #   $ SUPER_USER=0 ./generate_kubeconfig.sh

 set -eu -o pipefail

@@ -33,11 +33,18 @@ set -eu -o pipefail
 # use default.
 SKYPILOT_SA=${SKYPILOT_SA_NAME:-sky-sa}
 NAMESPACE=${SKYPILOT_NAMESPACE:-default}
 SUPER_USER=${SUPER_USER:-0}
 SUPER_USER=${SUPER_USER:-1}

 echo "Service account: ${SKYPILOT_SA}"
 echo "Namespace: ${NAMESPACE}"
 echo "Super user permissions: ${SUPER_USER}"
 echo "=========================================="
 echo "SkyPilot Kubeconfig Generation"
 echo "=========================================="
 echo "Service Account: ${SKYPILOT_SA}"
 echo "Namespace:       ${NAMESPACE}"
 if [ "${SUPER_USER}" != "1" ]; then
  echo "Permissions:     Minimal (manual setup may be required)"
  SUPER_USER=0
 fi
 echo ""

 # Set OS specific values.
 if [[ "$OSTYPE" == "linux-gnu" ]]; then
@@ -53,7 +60,7 @@ fi

 # If the user has set SKIP_SA_CREATION=1, skip creating the service account.
 if [ -z ${SKIP_SA_CREATION+x} ]; then
  echo "Creating the Kubernetes Service Account with ${SUPER_USER:+super user}${SUPER_USER:-minimal} RBAC permissions."
  echo "[1/3] Creating Kubernetes Service Account and RBAC permissions..."
  if [ "${SUPER_USER}" = "1" ]; then
    # Create service account with cluster-admin permissions
    kubectl apply -f - <<EOF
@@ -219,7 +226,8 @@ roleRef:
 EOF
  fi
 # Apply optional ingress-related roles, but don't make the script fail if it fails
 kubectl apply -f - <<EOF || echo "Failed to apply optional ingress-related roles. Nginx ingress is likely not installed. This is not critical and the script will continue."
 echo "      → Applying optional ingress permissions (skipped if ingress-nginx not installed)..."
 kubectl apply -f - 2>/dev/null <<EOF || true
 # Optional: Role for accessing ingress resources
 apiVersion: rbac.authorization.k8s.io/v1
 kind: Role
@@ -253,8 +261,13 @@ roleRef:
  name: ${SKYPILOT_SA}-role-ingress-nginx  # Use the same name as the role at line 119
  apiGroup: rbac.authorization.k8s.io
 EOF
 else
  echo "[1/3] Skipping service account creation (using existing account)..."
 fi

 echo ""
 echo "[2/3] Creating service account token..."

 # Checks if secret entry was defined for Service account. If defined it means that Kubernetes server has a
 # version bellow 1.24, otherwise one must manually create the secret and bind it to the Service account to have a non expiring token.
 # After Kubernetes v1.24 Service accounts no longer generate automatic tokens/secrets.
@@ -293,7 +306,9 @@ CURRENT_CONTEXT=$(kubectl config current-context)
 CURRENT_CLUSTER=$(kubectl config view -o jsonpath="{.contexts[?(@.name == \"${CURRENT_CONTEXT}\"})].context.cluster}")
 CURRENT_CLUSTER_ADDR=$(kubectl config view -o jsonpath="{.clusters[?(@.name == \"${CURRENT_CLUSTER}\"})].cluster.server}")

 echo "Writing kubeconfig."
 echo ""
 echo "[3/3] Generating kubeconfig file..."

 cat > kubeconfig <<EOF
 apiVersion: v1
 clusters:
@@ -316,24 +331,18 @@ users:
    token: ${SA_TOKEN}
 EOF

 echo "---
 Done!

 Kubeconfig using service account '${SKYPILOT_SA}' in namespace '${NAMESPACE}' written at $(pwd)/kubeconfig

 Copy the generated kubeconfig file to your ~/.kube/ directory to use it with
 kubectl and skypilot:

 # Backup your existing kubeconfig file
 mv ~/.kube/config ~/.kube/config.bak
 cp kubeconfig ~/.kube/config

 # Verify that you can access the cluster
 kubectl get pods

 Also add this to your ~/.sky/config.yaml to use the new service account:

 # ~/.sky/config.yaml
 kubernetes:
  remote_identity: ${SKYPILOT_SA}
 "
 echo ""
 echo "=========================================="
 echo "✓ SUCCESS!"
 echo "=========================================="
 echo ""
 echo "Kubeconfig file created successfully!"
 echo ""
 echo "  Service Account: ${SKYPILOT_SA}"
 echo "  Namespace:       ${NAMESPACE}"
 echo "  Location:        $(pwd)/kubeconfig"
 echo ""
 echo "Next steps:"
 echo "  Refer to this page for setting up the credential for remote API server:"
 echo "  https://docs.skypilot.co/en/latest/reference/api-server/api-server-admin-deploy.html#optional-configure-cloud-accounts"
 echo ""
--- a/sky_templates/ray/start_cluster
+++ b/sky_templates/ray/start_cluster
@@ -77,14 +77,18 @@ if ! run_ray --version > /dev/null; then
 fi
 echo -e "${GREEN}Ray $(run_ray --version | cut -d' ' -f3) is installed.${NC}"

 RAY_ADDRESS="127.0.0.1:${RAY_HEAD_PORT}"
 LOCAL_RAY_ADDRESS="127.0.0.1:${RAY_HEAD_PORT}"
 RAY_ADDRESS=${LOCAL_RAY_ADDRESS}
 if [ "${SKYPILOT_NODE_RANK}" -ne 0 ]; then
    HEAD_IP=$(echo "${SKYPILOT_NODE_IPS}" | head -n1)
    RAY_ADDRESS="${HEAD_IP}:${RAY_HEAD_PORT}"
 fi

 # Check if user-space Ray is already running
 if run_ray status --address="${RAY_ADDRESS}" &> /dev/null; then
 # Check if user-space Ray is already running. Use local address to check, as
 # if we use the head node address, the check will succeed even if the Ray
 # cluster is started on the head node but not started on the current worker
 # node.
 if run_ray status --address="${LOCAL_RAY_ADDRESS}" &> /dev/null; then
    echo -e "${YELLOW}Ray cluster is already running.${NC}"
    run_ray status --address="${RAY_ADDRESS}"
    exit 0
@@ -140,7 +144,7 @@ if [ "${SKYPILOT_NODE_RANK}" -eq 0 ]; then
                echo -e "${RED}Error: Timeout waiting for nodes.${NC}" >&2
                exit 1
            fi
            ready_nodes=$(run_ray list nodes --format=json | python3 -c "import sys, json; print(len(json.load(sys.stdin)))")
            ready_nodes=$(run_ray list nodes --address="${RAY_ADDRESS}" --format=json | python3 -c "import sys, json; print(len(json.load(sys.stdin)))")
            if [ "${ready_nodes}" -ge "${SKYPILOT_NUM_NODES}" ]; then
                break
            fi
--- a/tests/unit_tests/test_sky/server/test_plugins.py
+++ b/tests/unit_tests/test_sky/server/test_plugins.py
@@ -1,12 +1,15 @@
 """Unit tests for the SkyPilot API server plugins."""

 import importlib
 import sys
 import types
 from unittest import mock

 from fastapi import FastAPI
 import yaml

 from sky.server import plugins
 from sky.server import uvicorn as skyuvicorn


 def test_load_plugins_registers_and_installs(monkeypatch, tmp_path):
@@ -50,3 +53,61 @@ def test_load_plugins_registers_and_installs(monkeypatch, tmp_path):
    assert isinstance(plugin, DummyPlugin)
    assert plugin.value == 42
    assert installed['ctx'] is ctx


 def test_server_import_loads_plugins(monkeypatch):
    load_mock = mock.MagicMock()
    monkeypatch.setattr(plugins, 'load_plugins', load_mock)

    server_module = importlib.import_module('sky.server.server')
    load_mock.reset_mock()

    importlib.reload(server_module)

    load_mock.assert_called_once()
    ctx = load_mock.call_args.args[0]
    assert isinstance(ctx, plugins.ExtensionContext)
    assert ctx.app is server_module.app


 def test_uvicorn_run_loads_plugins_for_multiple_workers(monkeypatch):
    load_mock = mock.MagicMock()
    monkeypatch.setattr(plugins, 'load_plugins', load_mock)

    class DummyServer:

        def __init__(self, config, max_db_connections=None):
            del config, max_db_connections

        def run(self, *args, **kwargs):
            del args, kwargs

    class DummyMultiprocess:

        def __init__(self, config, target, sockets):
            self.config = config
            self.target = target
            self.sockets = sockets
            self.run_called = False

        def run(self):
            self.run_called = True

    class DummyConfig:
        reload = False
        workers = 2
        uds = None

        def bind_socket(self):
            return object()

    monkeypatch.setattr(skyuvicorn, 'Server', DummyServer)
    monkeypatch.setattr(skyuvicorn, 'SlowStartMultiprocess', DummyMultiprocess)

    dummy_config = DummyConfig()
    skyuvicorn.run(dummy_config)

    load_mock.assert_called_once()
    ctx = load_mock.call_args.args[0]
    assert isinstance(ctx, plugins.ExtensionContext)
    assert ctx.app is None
Author	SHA1	Message	Date
Alex Kim	54ee820143	[docs] Add NVIDIA Dynamo serving example (#7333 ) * nvidia-dynamo examples * Trigger CI * update automatic example generation * format * Update examples/serve/nvidia-dynamo/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update examples/serve/nvidia-dynamo/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update examples/serve/nvidia-dynamo/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update examples/serve/nvidia-dynamo/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * address comments * Update examples/serve/nvidia-dynamo/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * address comments * updates * Add image * update banner * update what's next --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>	16 hours ago
Aylei	8572b31924	Fixed plugin load in metrics process (#8318 ) Signed-off-by: Aylei <rayingecho@gmail.com>	22 hours ago
Zhanghao Wu	82a0ea1051	[Template] Add `--address` for list nodes to avoid warning for multiple ray cluster and fix a race in ray template (#8306 ) * Add --address for list nodes to avoid warning for multiple ray cluster * Add debug sleep * Use local ray address for worker for the initial check * remove debug sleep	1 day ago
Zhanghao Wu	1aa2398db3	[k8s] Update the instruction for dealing with exec-based kubeconfig (#8210 ) * [k8s] Update the instruction for dealing with exec-based kubeconfig * update * update * fix comment	1 day ago
lloyd-brown	3161f0d2b3	[Docs] Add Tip to Restart API Server if Credential Setup Fails (#8314 ) * Add tip. * Update docs/source/getting-started/installation.rst Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	1 day ago