This document describes how to deploy and serve open models on Vertex AI using prebuilt container images. Vertex AI provides prebuilt containers for popular serving frameworks like vLLM, Hex-LLM, and SGLang, as well as support for Hugging Face Text Generation Inference (TGI), Text Embeddings Inference (TEI), Inference Toolkit (via Google Cloud Hugging Face PyTorch Inference Containers) and Tensor-RT-LLM containers to serve supported models on Vertex AI.
vLLM is an open-source library for fast inference and serving of Large Language Models (LLMs). Vertex AI uses an optimized and customized version of vLLM. This version is specifically designed for enhanced performance, reliability, and seamless integration within Google Cloud. You can use Vertex AI's customized vLLM container image to serve models on Vertex AI. The prebuilt vLLM container can download models from Hugging Face or from Cloud Storage. For more information about model serving with Vertex AI prebuilt vLLM container images, see Model serving with Vertex AI prebuilt vLLM container images.
Example Notebooks
The following notebooks demonstrate how to use Vertex AI prebuilt containers for model serving. You can find more sample notebooks in the GitHub repository for Vertex AI samples.
| Notebook Name | Description | Direct Link (GitHub/Colab) |
|---|---|---|
| Vertex AI Model Garden - Gemma 3 (deployment) | Demonstrates deploying Gemma 3 models on GPU using vLLM. | View on GitHub |
| Vertex AI Model Garden - Serve Multimodal Llama 3.2 with vLLM | Deploys multimodal Llama 3.2 models using the vLLM prebuilt container. | View on GitHub |
| Vertex AI Model Garden - Hugging Face Text Generation Inference Deployment | Demonstrates deploying Gemma-2-2b-it model with Text Generation Inference (TGI) from Hugging Face | View on GitHub |
| Vertex AI Model Garden - Hugging Face Text Embeddings Inference Deployment | Demonstrates deploying nomic-ai/nomic-embed-text-v1 with Text Embeddings Inference (TEI) from Hugging Face | View on GitHub |
| Vertex AI Model Garden - Hugging Face PyTorch Inference Deployment | Demonstrates deploying distilbert/distilbert-base-uncased-finetuned-sst-2-english with Hugging Face PyTorch Inference | View on GitHub |
| Vertex AI Model Garden - DeepSeek Deployment | Demonstrates serving DeepSeek models with vLLM, SGLang, or TensorRT-LLM | View on GitHub |
| Vertex AI Model Garden - Qwen3 Deployment | Demonstrates serving Qwen3 models with SGLang | View on GitHub |
| Vertex AI Model Garden - Gemma 3n Deployment | Demonstrates serving Gemma3n models with SGLang | View on GitHub |
| Vertex AI Model Garden - Deep dive: Deploy Llama 3.1 and 3.2 with Hex-LLM | Demonstrates deploying Llama 3.1 and 3.2 models using Hex-LLM on TPUs using Vertex AI Model Garden | View on GitHub |
What's next
- Choose an open model serving option
- Use open models using Model as a Service (MaaS)
- Deploy open models from Model Garden
- Deploy open models with a custom vLLM container