Deploy open models with prebuilt containers

This document describes how to deploy and serve open models on Vertex AI using prebuilt container images. Vertex AI provides prebuilt containers for popular serving frameworks like vLLM, Hex-LLM, and SGLang, as well as support for Hugging Face Text Generation Inference (TGI), Text Embeddings Inference (TEI), Inference Toolkit (via Google Cloud Hugging Face PyTorch Inference Containers) and Tensor-RT-LLM containers to serve supported models on Vertex AI.

vLLM is an open-source library for fast inference and serving of Large Language Models (LLMs). Vertex AI uses an optimized and customized version of vLLM. This version is specifically designed for enhanced performance, reliability, and seamless integration within Google Cloud. You can use Vertex AI's customized vLLM container image to serve models on Vertex AI. The prebuilt vLLM container can download models from Hugging Face or from Cloud Storage. For more information about model serving with Vertex AI prebuilt vLLM container images, see Model serving with Vertex AI prebuilt vLLM container images.

Example Notebooks

The following notebooks demonstrate how to use Vertex AI prebuilt containers for model serving. You can find more sample notebooks in the GitHub repository for Vertex AI samples.

Notebook Name	Description	Direct Link (GitHub/Colab)
Vertex AI Model Garden - Gemma 3 (deployment)	Demonstrates deploying Gemma 3 models on GPU using vLLM.	View on GitHub
Vertex AI Model Garden - Serve Multimodal Llama 3.2 with vLLM	Deploys multimodal Llama 3.2 models using the vLLM prebuilt container.	View on GitHub
Vertex AI Model Garden - Hugging Face Text Generation Inference Deployment	Demonstrates deploying Gemma-2-2b-it model with Text Generation Inference (TGI) from Hugging Face	View on GitHub
Vertex AI Model Garden - Hugging Face Text Embeddings Inference Deployment	Demonstrates deploying nomic-ai/nomic-embed-text-v1 with Text Embeddings Inference (TEI) from Hugging Face	View on GitHub
Vertex AI Model Garden - Hugging Face PyTorch Inference Deployment	Demonstrates deploying distilbert/distilbert-base-uncased-finetuned-sst-2-english with Hugging Face PyTorch Inference	View on GitHub
Vertex AI Model Garden - DeepSeek Deployment	Demonstrates serving DeepSeek models with vLLM, SGLang, or TensorRT-LLM	View on GitHub
Vertex AI Model Garden - Qwen3 Deployment	Demonstrates serving Qwen3 models with SGLang	View on GitHub
Vertex AI Model Garden - Gemma 3n Deployment	Demonstrates serving Gemma3n models with SGLang	View on GitHub
Vertex AI Model Garden - Deep dive: Deploy Llama 3.1 and 3.2 with Hex-LLM	Demonstrates deploying Llama 3.1 and 3.2 models using Hex-LLM on TPUs using Vertex AI Model Garden	View on GitHub

Deploy open models with prebuilt containers Stay organized with collections Save and categorize content based on your preferences.

Example Notebooks

What's next

Deploy open models with prebuilt containers