Deploy open models with prebuilt containers

This document describes how to deploy and serve open models on Vertex AI using prebuilt container images. Vertex AI provides prebuilt containers for popular serving frameworks like vLLM, Hex-LLM, and SGLang, as well as support for Hugging Face Text Generation Inference (TGI), Text Embeddings Inference (TEI), Inference Toolkit (via Google Cloud Hugging Face PyTorch Inference Containers) and Tensor-RT-LLM containers to serve supported models on Vertex AI.

vLLM is an open-source library for fast inference and serving of Large Language Models (LLMs). Vertex AI uses an optimized and customized version of vLLM. This version is specifically designed for enhanced performance, reliability, and seamless integration within Google Cloud. You can use Vertex AI's customized vLLM container image to serve models on Vertex AI. The prebuilt vLLM container can download models from Hugging Face or from Cloud Storage. For more information about model serving with Vertex AI prebuilt vLLM container images, see Model serving with Vertex AI prebuilt vLLM container images.

Example Notebooks

The following notebooks demonstrate how to use Vertex AI prebuilt containers for model serving. You can find more sample notebooks in the GitHub repository for Vertex AI samples.

Notebook Name Description Direct Link (GitHub/Colab)
Vertex AI Model Garden - Gemma 3 (deployment) Demonstrates deploying Gemma 3 models on GPU using vLLM. View on GitHub
Vertex AI Model Garden - Serve Multimodal Llama 3.2 with vLLM Deploys multimodal Llama 3.2 models using the vLLM prebuilt container. View on GitHub
Vertex AI Model Garden - Hugging Face Text Generation Inference Deployment Demonstrates deploying Gemma-2-2b-it model with Text Generation Inference (TGI) from Hugging Face View on GitHub
Vertex AI Model Garden - Hugging Face Text Embeddings Inference Deployment Demonstrates deploying nomic-ai/nomic-embed-text-v1 with Text Embeddings Inference (TEI) from Hugging Face View on GitHub
Vertex AI Model Garden - Hugging Face PyTorch Inference Deployment Demonstrates deploying distilbert/distilbert-base-uncased-finetuned-sst-2-english with Hugging Face PyTorch Inference View on GitHub
Vertex AI Model Garden - DeepSeek Deployment Demonstrates serving DeepSeek models with vLLM, SGLang, or TensorRT-LLM View on GitHub
Vertex AI Model Garden - Qwen3 Deployment Demonstrates serving Qwen3 models with SGLang View on GitHub
Vertex AI Model Garden - Gemma 3n Deployment Demonstrates serving Gemma3n models with SGLang View on GitHub
Vertex AI Model Garden - Deep dive: Deploy Llama 3.1 and 3.2 with Hex-LLM Demonstrates deploying Llama 3.1 and 3.2 models using Hex-LLM on TPUs using Vertex AI Model Garden View on GitHub

What's next