GLM models

GLM models on Vertex AI offer fully managed and serverless models as APIs. To use a GLM model on Vertex AI, send a request directly to the Vertex AI API endpoint. Because GLM models use a managed API, there's no need to provision or manage infrastructure.

You can stream your responses to reduce the end-user latency perception. A streamed response uses server-sent events (SSE) to incrementally stream the response.

Available GLM models

The following models are available from GLM to use in Vertex AI. To access a GLM model, go to its Model Garden model card.

GLM 4.7

GLM 4.7 is a model from GLM designed for core or vibe coding, tool use, and complex reasoning.

Go to the GLM 4.7 model card

Use GLM models

You can use curl commands to send requests to the Vertex AI endpoint using the following model names:

  • For GLM 4.7, use glm-4.7-maas

To learn how to make streaming and non-streaming calls to GLM models, see Call open model APIs.

GLM model region availability and quotas

For GLM models, a quota applies for each region where the model is available. The quota is specified in queries per minute (QPM).

Model Region Quotas Context length
GLM 4.7
global endpoint
200,000

If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see the Cloud Quotas overview.

What's next