Use vLLM on GKE to run inference with Qwen3

This tutorial shows you how to deploy and serve a Qwen3 large language model (LLM) with the vLLM serving framework. You deploy the model on a single A4 virtual machine (VM) instance on Google Kubernetes Engine (GKE).

This tutorial is intended for machine learning (ML) engineers, platform administrators and operators, and for data and AI specialists who are interested in using Kubernetes container orchestration capabilities to handle inference workloads.

Access Qwen3 by using Hugging Face

To use Hugging Face to access Qwen3, follow these steps:

  1. Sign in to Hugging Face
  2. Create a Hugging Face read access token. Click Your Profile > Settings > Access Tokens > +Create new token.
  3. Specify a name of your choice for the token and then select a role. The minimum role permission level that you can select for this tutorial is Read.
  4. Select Create token.
  5. Copy and save the generated token to your clipboard. You use it later in this tutorial.

Prepare your environment

To prepare your environment, set the default environment variables:

gcloud config set project PROJECT_ID
gcloud config set billing/quota_project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export RESERVATION_URL=RESERVATION_URL
export REGION=REGION
export CLUSTER_NAME=CLUSTER_NAME
export HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN
export NETWORK=NETWORK_NAME
export SUBNETWORK=SUBNETWORK_NAME

Replace the following:

  • PROJECT_ID: the ID of the Google Cloud project where you want to create the GKE cluster.

  • RESERVATION_URL: the URL of the reservation that you want to use to create your GKE cluster. Based on the project in which the reservation exists, specify one of the following values:

    • The reservation exists in your project: RESERVATION_NAME

    • The reservation exists in a different project, and your project can use the reservation: projects/RESERVATION_PROJECT_ID/reservations/RESERVATION_NAME

  • REGION: the region where you want to create your GKE cluster. You can only create the cluster in the region where your reservation exists.

  • CLUSTER_NAME: the name of the GKE cluster to create.

  • HUGGING_FACE_TOKEN: the Hugging Face access token that you created in the previous section.

  • NETWORK_NAME: the network that the GKE cluster uses. Specify one of the following values:

    • If you created a custom network, then specify the name of your network.

    • Otherwise, specify default.

  • SUBNETWORK_NAME: the subnetwork that the GKE cluster uses. Specify one of the following values:

    • If you created a custom subnetwork, then specify the name of your subnetwork. You can only specify a subnetwork that exists in the same region as the reservation.

    • Otherwise, specify default.

Create a GKE cluster in Autopilot mode

To create a GKE cluster in Autopilot mode, run the following command:

gcloud container clusters create-auto $CLUSTER_NAME \
    --project=$PROJECT_ID \
    --region=$REGION \
    --release-channel=rapid \
    --network=$NETWORK \
    --subnetwork=$SUBNETWORK

Creating the GKE cluster might take some time to complete. To verify that Google Cloud has finished creating your cluster, go to Kubernetes clusters on the Google Cloud console.

Create a Kubernetes secret for Hugging Face credentials

To create a Kubernetes secret for Hugging Face credentials, follow these steps:

  1. Configure kubectl to communicate with your GKE cluster:

    gcloud container clusters get-credentials $CLUSTER_NAME \
        --location=$REGION
    
  2. Create a Kubernetes secret to store your Hugging Face token:

    kubectl create secret generic hf-secret \
        --from-literal=hf_token=${HUGGING_FACE_TOKEN} \
        --dry-run=client -o yaml | kubectl apply -f -
    

Deploy a vLLM container to your GKE cluster

To deploy the vLLM container to serve the Qwen3 model by using Kubernetes Deployments, do the following:

  1. Create a qwen3-235b-deploy.yaml file with your chosen vLLM deployment. :

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-qwen3-deployment
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: qwen3-server
      template:
        metadata:
          labels:
            app: qwen3-server
            ai.gke.io/model: Qwen3-235B-A22B-Instruct-2507
            ai.gke.io/inference-server: vllm
        spec:
          containers:
          - name: qwen-inference-server
            image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250801_0916_RC01
            resources:
              requests:
                cpu: "10"
                memory: "1000Gi"
                ephemeral-storage: "500Gi"
                nvidia.com/gpu: "8"
              limits:
                cpu: "10"
                memory: "1000Gi"
                ephemeral-storage: "500Gi"
                nvidia.com/gpu: "8"
            command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
            args:
            - --model=$(MODEL_ID)
            - --tensor-parallel-size=8
            - --host=0.0.0.0
            - --port=8000
            - --max-model-len=8192
            - --max-num-seqs=4
            - --dtype=bfloat16
            env:
            - name: MODEL_ID
              value: "Qwen/Qwen3-235B-A22B-Instruct-2507"
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_token
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            livenessProbe:
              httpGet:
                path: /health
                port: 8000
              initialDelaySeconds: 1320
              periodSeconds: 10
            readinessProbe:
              httpGet:
                path: /health
                port: 8000
              initialDelaySeconds: 1320
              periodSeconds: 5
          volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          nodeSelector:
            cloud.google.com/gke-accelerator: nvidia-b200
            cloud.google.com/reservation-name: RESERVATION_URL
            cloud.google.com/reservation-affinity: "specific"
            cloud.google.com/gke-gpu-driver-version: latest
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: qwen3-service
    spec:
      selector:
        app: qwen3-server
      type: ClusterIP
      ports:
        - protocol: TCP
          port: 8000
          targetPort: 8000
    ---
    apiVersion: monitoring.googleapis.com/v1
    kind: PodMonitoring
    metadata:
      name: vllm-qwen3-monitoring
    spec:
      selector:
        matchLabels:
          app: qwen3-server
      endpoints:
      - port: 8000
        path: /metrics
        interval: 30s
    
  2. Apply the qwen3-235b-deploy.yaml file to your GKE cluster:

    kubectl apply -f qwen3-235b-deploy.yaml
    

    During the deployment process, the container must download the Qwen3-235B-A22B-Instruct-2507 model from Hugging Face. For this reason, deployment of the container might take up to 30 minutes to complete.

  3. To see the completion status, run the following command:

    kubectl wait \
        --for=condition=Available \
        --timeout=1500s deployment/vllm-qwen3-deployment
    

    The --timeout=1500s flag allows the command to monitor the deployment for up to 25 minutes.

Interact with Qwen3 by using curl

To verify the Qwen3 model that you deployed, do the following:

  1. Set up port forwarding to Qwen3:

    kubectl port-forward service/qwen3-service 8000:8000
    
  2. Open a new terminal window. You can then chat with your model by using curl:

    curl http://127.0.0.1:8000/v1/chat/completions \
    -X POST \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen/Qwen3-235B-A22B-Instruct-2507",
      "messages": [
        {
          "role": "user",
          "content": "Describe a GPU in one short sentence?"
        }
      ]
    }'
    

    The output is similar to the following:

    {
      "id": "chatcmpl-a926ddf7ef2745ca832bda096e867764",
      "object": "chat.completion",
      "created": 1755023619,
      "model": "Qwen/Qwen3-235B-A22B-Instruct-2507",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "A GPU is a specialized electronic circuit designed to rapidly process and render graphics and perform parallel computations.",
            "refusal": null,
            "annotations": null,
            "audio": null,
            "function_call": null,
            "tool_calls": [],
            "reasoning_content": null
          },
          "logprobs": null,
          "finish_reason": "stop",
          "stop_reason": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 16,
        "total_tokens": 36,
        "completion_tokens": 20,
        "prompt_tokens_details": null
      },
      "prompt_logprobs": null,
      "kv_transfer_params": null
    }
    

Observe model performance

If you want to observe your model's performance, then you can use the vLLM dashboard integration in Cloud Monitoring. This dashboard helps you view critical performance metrics for your model like token throughput, network latency, and error rates. For information, see vLLM in the Monitoring documentation.