xAI Grok models

xAI Grok models are available for use as managed APIs on Vertex AI. You can stream your responses to reduce the end-user latency perception. A streamed response uses server-sent events (SSE) to incrementally stream the response.

Managed xAI models

The following models are available from xAI to use in Vertex AI. To access a xAI model, go to its Model Garden model card.

Grok 4.1 Fast (Reasoning)

Grok 4.1 Fast (Reasoning) is xAI's most cost-effective model. It excels at tool calling for lightweight tasks, powers latency-sensitive applications, and shines in search-related tasks.

Go to the Grok 4.1 Fast (Reasoning) model card

Grok 4.1 Fast (Non-Reasoning)

Grok 4.1 Fast (Non-Reasoning) is xAI's most cost-effective model. It excels at tool calling for lightweight tasks, powers latency-sensitive applications, and shines in search-related tasks.

Go to the Grok 4.1 Fast (Non-Reasoning) model card

Grok 4.20 (Non-Reasoning)

Grok 4.20 (Non-Reasoning) is xAI's flagship model that offers industry-leading inference speed and reliable agentic tool calling for complex tasks.

Go to the Grok 4.20 (Non-Reasoning) model card

Use xAI models

For managed models, you can use curl commands to send requests to the Vertex AI endpoint using the following model names. To learn how to make streaming and non-streaming calls to xAI models, see Call open model APIs.

For managed models, you can use curl commands to send requests to the Vertex AI endpoint using the following model names:

  • For Grok 4.1 Fast (Reasoning), use grok-4.1-fast-reasoning
  • For Grok 4.1 Fast (Non-Reasoning), use grok-4.1-fast-non-reasoning
  • For Grok 4.20 (Non-Reasoning), use grok-4.20-non-reasoning

Grok quotas

Grok models have a global quota. The quota is specified in queries per minute (QPM) and tokens per minute (TPM). TPM includes both input and output tokens.

To maintain overall service performance and acceptable use, the maximum quotas might vary by account and, in some cases, access might be restricted. View your project's quotas on the Quotas & Systems Limits page in the Google Cloud console. You must also have the following quotas available:

  • global_generate_content_requests_per_minute_per_project_per_base_model defines your QPM quota.

  • For TPM, there are two quota values that apply to particular models: global_generate_content_input_tokens_per_minute_per_base_model defines the input TPM quota and global_generate_content_output_tokens_per_minute_per_base_modeldefines the output TPM quota.

To see which models count input and output tokens separately, see the specific model pages.

What's next