This tutorial shows you how to deploy and serve a Gemma 3 27B large language model (LLM) with the vLLM serving framework. You deploy Gemma 3 on a single A4 virtual machine (VM) instance on Google Kubernetes Engine (GKE).
This tutorial is intended for machine learning (ML) engineers, platform administrators and operators, and for data and AI specialists who are interested in using Kubernetes container orchestration capabilities to handle inference workloads.
Access Gemma 3 by using Hugging Face
To use Hugging Face to access Gemma 3, do the following:
- Sign in to Hugging Face
- Create a Hugging Face
read
access token. Click Your Profile > Settings > Access tokens > +Create new token - Copy and save the
read access
token value. You use it later in this tutorial.
Prepare your environment
To prepare your environment, set the default environment variables:
gcloud config set project PROJECT_ID
gcloud config set billing/quota_project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export RESERVATION_URL=RESERVATION_URL
export REGION=REGION
export CLUSTER_NAME=CLUSTER_NAME
export HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN
export NETWORK=NETWORK_NAME
export SUBNETWORK=SUBNETWORK_NAME
Replace the following:
PROJECT_ID
: the ID of the Google Cloud project where you want to create the GKE cluster.RESERVATION_URL
: the URL of the reservation that you want to use to create your GKE cluster. Based on the project in which the reservation exists, specify one of the following values:The reservation exists in your project:
RESERVATION_NAME
The reservation exists in a different project, and your project can use the reservation:
projects/RESERVATION_PROJECT_ID/reservations/RESERVATION_NAME
REGION
: the region where you want to create your GKE cluster. You can only create the cluster in the region where you reservation exists.CLUSTER_NAME
: the name of the GKE cluster to create.HUGGING_FACE_TOKEN
: the Hugging Face access token that you created in the previous section.NETWORK_NAME
: the network that the GKE cluster uses. Specify one of the following values:If you created a custom network, then specify the name of your network.
Otherwise, specify
default
.
SUBNETWORK_NAME
: the subnetwork that the GKE cluster uses. Specify one of the following values:If you created a custom subnetwork, then specify the name of your subnetwork. You can only specify a subnetwork that exists in the same region as the reservation.
Otherwise, specify
default
.
Create a GKE cluster in Autopilot mode
To create a GKE cluster in Autopilot mode, run the following command:
gcloud container clusters create-auto $CLUSTER_NAME \
--project=$PROJECT_ID \
--region=$REGION \
--release-channel=rapid \
--network=$NETWORK \
--subnetwork=$SUBNETWORK
Creating the GKE cluster might take some time to complete. To verify that Google Cloud has finished creating your cluster, go to Kubernetes clusters on the Google Cloud console.
Create a Kubernetes secret for Hugging Face credentials
To create a Kubernetes secret for Hugging Face credentials, follow these steps:
Configure
kubectl
to communicate with your GKE cluster:gcloud container clusters get-credentials $CLUSTER_NAME \ --location=$REGION
Create a Kubernetes secret to store your Hugging Face token:
kubectl create secret generic hf-secret \ --from-literal=hf_api_token=${HUGGING_FACE_TOKEN} \ --dry-run=client -o yaml | kubectl apply -f -
Deploy a vLLM container to your GKE cluster
To deploy the vLLM container to serve the Gemma 3 27B model by using Kubernetes Deployments, follow these steps:
Create a
vllm-3-27b-it.yaml
file with your chosen vLLM deployment:apiVersion: apps/v1 kind: Deployment metadata: name: vllm-gemma-deployment spec: replicas: 1 selector: matchLabels: app: gemma-server template: metadata: labels: app: gemma-server ai.gke.io/model: gemma-3-27b-it ai.gke.io/inference-server: vllm examples.ai.gke.io/source: user-guide spec: containers: - name: inference-server image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250801_0916_RC01 resources: requests: cpu: "10" memory: "128Gi" ephemeral-storage: "120Gi" nvidia.com/gpu: "8" limits: cpu: "10" memory: "128Gi" ephemeral-storage: "120Gi" nvidia.com/gpu: "8" command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] args: - --model=$(MODEL_ID) - --tensor-parallel-size=8 - --host=0.0.0.0 - --port=8000 - --max-model-len=4096 - --max-num-seqs=4 env: - name: MODEL_ID value: google/gemma-3-27b-it - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-secret key: hf_api_token volumeMounts: - mountPath: /dev/shm name: dshm livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 600 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 600 periodSeconds: 5 volumes: - name: dshm emptyDir: medium: Memory nodeSelector: cloud.google.com/gke-accelerator: nvidia-b200 cloud.google.com/reservation-name: RESERVATION_URL cloud.google.com/reservation-affinity: "specific" cloud.google.com/gke-gpu-driver-version: latest --- apiVersion: v1 kind: Service metadata: name: llm-service spec: selector: app: gemma-server type: ClusterIP ports: - protocol: TCP port: 8000 targetPort: 8000
Apply the
vllm-3-27b-it.yaml
file to your GKE cluster:kubectl apply -f vllm-3-27b-it.yaml
During the deployment process, the container must download Gemma 3 from Hugging Face. For this reason, deployment of the container might take up to 30 minutes to complete.
Wait for the deployment to complete:
kubectl wait \ --for=condition=Available \ --timeout=1800s deployment/vllm-gemma-deployment
Interact with Gemma 3 by using curl
To verify your deployed Gemma 3 27B instruction-tuned models, follow these steps:
Set up port forwarding to Gemma 3:
kubectl port-forward service/llm-service 8000:8000
Open a new terminal window. You can then chat with your model by using
curl
:curl http://127.0.0.1:8000/v1/chat/completions \ -X POST \ -H "Content-Type: application/json" \ -d '{ "model": "google/gemma-3-27b-it", "messages": [ { "role": "user", "content": "Why is the sky blue?" } ] }'
The output is similar to the following:
{ "id": "chatcmpl-e4a2e624bea849d9b09f838a571c4d9e", "object": "chat.completion", "created": 1741763029, "model": "google/gemma-3-27b-it", "choices": [ { "index": 0, "message": { "role": "assistant", "reasoning_content": null, "content": "Okay, let's break down why the sky appears blue! It's a fascinating phenomenon rooted in physics, specifically something called **Rayleigh scattering**. Here's the explanation: ...", "tool_calls": [] }, "logprobs": null, "finish_reason": "stop", "stop_reason": 106 } ], "usage": { "prompt_tokens": 15, "total_tokens": 668, "completion_tokens": 653, "prompt_tokens_details": null }, "prompt_logprobs": null }
If you want to observe your model's performance, then you can use the vLLM dashboard integration in Cloud Monitoring. This dashboard helps you view critical performance metrics for your model like token throughput, network latency, and error rates. For information, see vLLM in the Monitoring documentation.