xAI Grok models are available for use as managed APIs on Vertex AI. You can stream your responses to reduce the end-user latency perception. A streamed response uses server-sent events (SSE) to incrementally stream the response.
Managed xAI models
The following models are available from xAI to use in Vertex AI. To access a xAI model, go to its Model Garden model card.
Grok 4.1 Fast (Reasoning)
Grok 4.1 Fast (Reasoning) is xAI's most cost-effective model. It excels at tool calling for lightweight tasks, powers latency-sensitive applications, and shines in search-related tasks.
Go to the Grok 4.1 Fast (Reasoning) model card
Grok 4.1 Fast (Non-Reasoning)
Grok 4.1 Fast (Non-Reasoning) is xAI's most cost-effective model. It excels at tool calling for lightweight tasks, powers latency-sensitive applications, and shines in search-related tasks.
Go to the Grok 4.1 Fast (Non-Reasoning) model card
Grok 4.20 (Non-Reasoning)
Grok 4.20 (Non-Reasoning) is xAI's flagship model that offers industry-leading inference speed and reliable agentic tool calling for complex tasks.
Go to the Grok 4.20 (Non-Reasoning) model card
Use xAI models
For managed models, you can use curl commands to send requests to the Vertex AI endpoint using the following model names. To learn how to make streaming and non-streaming calls to xAI models, see Call open model APIs.
For managed models, you can use curl commands to send requests to the Vertex AI endpoint using the following model names:
- For Grok 4.1 Fast (Reasoning), use
grok-4.1-fast-reasoning - For Grok 4.1 Fast (Non-Reasoning), use
grok-4.1-fast-non-reasoning - For Grok 4.20 (Non-Reasoning), use
grok-4.20-non-reasoning
Grok quotas
Grok models have a global quota. The quota is specified in queries per minute (QPM) and tokens per minute (TPM). TPM includes both input and output tokens.
To maintain overall service performance and acceptable use, the maximum quotas might vary by account and, in some cases, access might be restricted. View your project's quotas on the Quotas & Systems Limits page in the Google Cloud console. You must also have the following quotas available:
global_generate_content_requests_per_minute_per_project_per_base_modeldefines your QPM quota.For TPM, there are two quota values that apply to particular models:
global_generate_content_input_tokens_per_minute_per_base_modeldefines the input TPM quota andglobal_generate_content_output_tokens_per_minute_per_base_modeldefines the output TPM quota.
To see which models count input and output tokens separately, see the specific model pages.
What's next
- Learn how to Call open model APIs.