In a rolling deployment, a deployed model is replaced with a new version of the same model. The new model reuses the compute resources from the previous one.
In the rolling deployment request, the traffic split and dedicatedResources
values are the same as for the previous deployment. After the rolling deployment
completes, the traffic split is updated to show that all of the traffic from the
previous DeployedModel has migrated to the new deployment.
Other configurable fields in DeployedModel (such as serviceAccount,
disableContainerLogging, and enableAccessLogging) are set to the same values
as for the previous DeployedModel by default. However, you can optionally
specify new values for these fields.
When a model is deployed using a rolling deployment, a new DeployedModel is
created. The new DeployedModel receives a new ID that is different from that
of the previous one. It also receives a new revisionNumber value in the
rolloutOptions field.
If there are multiple rolling deployments targeting the same backing resources,
the DeployedModel with the highest revisionNumber is treated as the
intended final state.
As the rolling deployment progresses, all the existing replicas for the previous
DeployedModel are replaced with replicas of the new DeployedModel. This
happens quickly, and replicas are updated whenever the deployment has enough
available replicas or enough surge capacity to bring up additional replicas.
Additionally, as the rolling deployment progresses, the traffic for the old
DeployedModel is gradually migrated to the new DeployedModel. The traffic
is load-balanced in proportion to the number of ready-to-serve replicas of each
DeployedModel.
If the rolling deployment's new replicas never become ready because their health
route consistently returns a non-200 response code, traffic isn't sent
to those unready replicas. In this case, the rolling deployment eventually
fails, and the replicas are reverted to the previous DeployedModel.
Start a rolling deployment
To start a rolling deployment, include the rolloutOptions field in the model
deployment request as shown in the following example.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_ID: The ID for the endpoint.
- MODEL_ID: The ID for the model to be deployed.
-
PREVIOUS_DEPLOYED_MODEL: The
DeployedModelID of a model on the same endpoint. This specifies theDeployedModelwhose backing resources are to be reused. You can callGetEndpointto get a list of deployed models on an endpoint along with their numeric IDs. - MAX_UNAVAILABLE_REPLICAS: The number of model replicas that can be taken down during the rolling deployment.
- MAX_SURGE_REPLICAS: The number of additional model replicas that can be brought up during the rolling deployment. If this is set to zero, then only the existing capacity is used.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel
Request JSON body:
{
"deployedModel": {
"model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID",
"rolloutOptions": {
"previousDeployedModel": "PREVIOUS_DEPLOYED_MODEL",
"maxUnavailableReplicas": "MAX_UNAVAILABLE_REPLICAS",
"maxSurgeReplicas": "MAX_SURGE_REPLICAS"
}
}
}
To send your request, expand one of these options:
You should receive a successful status code (2xx) and an empty response.
If desired, you can replace maxSurgeReplicas and maxUnavailableReplicas,
or both, with percentage values, as shown in the following example.
REST
Before using any of the request data, make the following replacements:
- MAX_UNAVAILABLE_PERCENTAGE: The percentage of model replicas that can be taken down during the rolling deployment.
- MAX_SURGE_PERCENTAGE: The percentage of additional model replicas that can be brought up during the rolling deployment. If this is set to zero, then only the existing capacity is used.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel
Request JSON body:
{
"deployedModel": {
"model": "projects/PROJECT/locations/LOCATION_ID/models/MODEL_ID",
"rolloutOptions": {
"previousDeployedModel": "PREVIOUS_DEPLOYED_MODEL",
"maxUnavailablePercentage": "MAX_UNAVAILABLE_PERCENTAGE",
"maxSurgePercentage": "MAX_SURGE_PERCENTAGE"
}
}
}
To send your request, expand one of these options:
You should receive a successful status code (2xx) and an empty response.
Roll back a rolling deployment
To roll back a rolling deployment, start a new rolling deployment of the
previous model, using the ongoing rolling deployment's DeployedModel ID as the
previousDeployedModel.
To get the DeployedModel ID for an ongoing deployment, set the parameter
allDeploymentStates=true in the call to GetEndpoint, as shown in the
following example.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_ID: The ID for the endpoint.
HTTP method and URL:
GET https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID?allDeploymentStates=true
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{
"name": "projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID",
"displayName": "rolling-deployments-endpoint",
"deployedModels": [
{
"id": "2718281828459045",
"model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID@1",
"displayName": "rd-test-model",
"createTime": "2024-09-11T21:37:48.522692Z",
"dedicatedResources": {
"machineSpec": {
"machineType": "e2-standard-2"
},
"minReplicaCount": 5,
"maxReplicaCount": 5
},
"modelVersionId": "1",
"state": "BEING_DEPLOYED"
}
],
"etag": "AMEw9yMs3TdZMn8CUg-3DY3wS74bkIaTDQhqJ7-Ld_Zp7wgT8gsEfJlrCOyg67lr9dwn",
"createTime": "2024-09-11T21:22:36.588538Z",
"updateTime": "2024-09-11T21:27:28.563579Z",
"dedicatedEndpointEnabled": true,
"dedicatedEndpointDns": "ENDPOINT_ID.LOCATION_ID-PROJECT_ID.prediction.vertexai.goog"
}
Constraints and limitations
- The previous
DeployedModelmust be on the same endpoint as the newDeployedModel. - You can't create multiple rolling deployments with the same
previousDeployedModel. - You can't create rolling deployments on top of a
DeployedModelthat isn't fully deployed. Exception: IfpreviousDeployedModelis itself an in-progress rolling deployment, then a new rolling deployment can be created on top of it. This allows for rolling back deployments that start to fail. - Previous models don't automatically undeploy after a rolling deployment completes successfully. You can undeploy the model manually.
- For rolling deployments on shared public endpoints, the
predictRouteandhealthRoutefor the new model must be the same as for the previous model. - Rolling deployments aren't compatible with model cohosting.
- Rolling deployments can't be used for models that require
online explanations.