Use reservations with online inference

This document explains how to use Compute Engine reservations to gain a high level of assurance that your online inference jobs have the necessary virtual machine (VM) resources to run.

Reservations are a Compute Engine feature. They help ensure that you have the resources available to create VMs with the same hardware (memory and vCPUs) and optional resources (CPUs, GPUs, TPUs, and Local SSD disks) whenever you need them.

When you create a reservation, Compute Engine verifies that the requested capacity is available in the specified zone. If so, then Compute Engine reserves the resources, creates the reservation, and the following happens:

You can immediately consume the reserved resources, and they remain available until you delete the reservation.
You're charged for the reserved resources at the same on-demand rate as running VMs, including any applicable discounts, until the reservation is deleted. A VM consuming a reservation doesn't incur separate charges. You're charged only for the resources outside of the reservation, such as disks or IP addresses. To learn more, see pricing for reservations.

Limitations and requirements

When using Compute Engine reservations with Agent Platform, consider the following limitations and requirements:

Agent Platform can only use reservations for CPUs, GPU VMs, or TPUs (Preview).
Agent Platform can't consume reservations of VMs that have Local SSD disks manually attached.
Using Compute Engine reservations with Agent Platform is only supported for Gemini Enterprise Agent Platform serverless training, inference, and Gemini Enterprise Agent Platform Workbench (Preview).
A reservation's VM properties must match exactly with your Agent Platform workload to consume the reservation. For example, if a reservation specifies an a2-ultragpu-8g machine type, then the Agent Platform workload can only consume the reservation if it also uses an a2-ultragpu-8g machine type. See Requirements.
To consume a shared reservation of GPU VMs or TPUs, you must consume it using its owner project or a consumer project with which the reservation is shared. See How shared reservations work.
To support regular updates of your Agent Platform deployments, we recommend increasing your VM count by more than the total number of replicas as follows, according to the reservation type used by your DeployedModel:
- SPECIFIC_RESERVATION: Must specify at least 1 additional VM; we recommend 10% (but at least 1). Deployed models using SPECIFIC_RESERVATION are guaranteed to only consume VMs from the reservation. Agent Platform cannot perform updates if there is no additional VM.
- ANY:
To consume a SPECIFIC_RESERVATION reservation, grant the Compute Viewer IAM role to the Agent Platform service account in the project that owns the reservations (service-${PROJECT_NUMBER}@gcp-sa-aiplatform.iam.gserviceaccount.com, where PROJECT_NUMBER is the project number of the project that consumes the reservation).

Billing

When using Compute Engine reservations, you're billed for the following:

Compute Engine pricing for the Compute Engine resources, including any applicable committed use discounts (CUDs). See Compute Engine pricing.
Agent Platform online inference management fees in addition to your infrastructure usage. See Prediction pricing.

Note: When consuming from a reservation or spot capacity, billing is spread across two SKUs: the Compute Engine SKU with the label goog-vertex-ai-product:vertex-ai-online-prediction and the Agent Platform Management Fee SKU. This enables you to use your Committed Use Discounts (CUDs) in Agent Platform.

Before you begin

Review the requirements and restrictions for reservations.
Review the quota requirements and restrictions for shared reservations.

Allow a reservation to be consumed

Before consuming a reservation of CPUs, GPU VMs, or TPUs, you must set its sharing policy to allow Agent Platform to consume the reservation. To do so, use one of the following methods:

Allow consumption while creating a reservation
Allow consumption in an existing reservation

Allow consumption while creating a reservation

When creating a single-project or shared reservation of GPU VMs, you can allow Agent Platform to consume the reservation as follows:

If you're using the Google Cloud console, then, in the Google Cloud services section, select Share reservation.
If you're using the Google Cloud CLI, then include the --reservation-sharing-policy flag set to ALLOW_ALL.
If you're using the REST API, then, in the request body, include the serviceShareType field set to ALLOW_ALL.

Allow consumption of an existing reservation

You can modify an auto-created reservation of GPU VMs or TPUs for a future reservation only after the reservation's start time.

To allow Agent Platform to consume an existing reservation, use one of the following methods:

Allow consumption of multiple specific reservations

You can allow consumption of multiple specific reservations by specifying two or more reservations, in priority order, in the list of reservation names in the values field of the reservation affinity specification. You should list them in order of priority.

Each reservation must be shared with Vertex, and the zone of the reservation must be in the region of the endpoint. Otherwise, you can mix reservations across source projects and multiple zones.

Verify that a reservation is consumed

To verify that the reservation is being consumed, see Verify reservations consumption in the Compute Engine documentation.

Get online inferences by using a reservation

To create a model deployment that consumes a Compute Engine reservation of GPU VMs, use the REST API or Agent Platform SDK for Python.

REST

Before using any of the request data, make the following replacements:

LOCATION_ID: The region where you are using Agent Platform.
PROJECT_ID: the project where the reservation was created. To consume a shared reservation from another project, you must share the reservation with that project. For more information, see Modify the consumer projects in a shared reservation.
ENDPOINT_ID: The ID for the endpoint.
MODEL_ID: The ID for the model to be deployed.
DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of the Model for the DeployedModel as well.
MACHINE_TYPE: the machine type to use for each node in this deployment. Its default setting is n1-standard-2. For more information about the supported machine types, see Configure compute resources for prediction.
ACCELERATOR_TYPE: the type of accelerator to attach to the machine. For more information about the type of GPU that each machine type supports, see GPUs for compute workloads.
ACCELERATOR_COUNT: the number of accelerators to attach to the machine.
RESERVATION_AFFINITY_TYPE: Must be ANY, SPECIFIC_RESERVATION, or NONE.
- ANY means that the VMs of your customJob automatically can consume any reservation with matching properties.
- SPECIFIC_RESERVATION means that the VMs of your customJob can consume only reservations that the VMs specifically target by name.
- NONE means that the VMs of your customJob can't consume any reservation. Specifying NONE has the same effect as omitting a reservation affinity specification.
ZONE: the zone where the reservation was created.
RESERVATION_NAME_N: the names of your reservations, in priority order. Each must be the full resource name of the reservation or reservation block.
MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the inference load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the inference load, up to this number of nodes and never fewer than the minimum number of nodes.
TRAFFIC_SPLIT_THIS_MODEL: the percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
TRAFFIC_SPLIT_MODEL_N: the traffic split percentage value for the deployed model ID key.
PROJECT_NUMBER: Your project's automatically generated project number.

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel

Request JSON body:

{
  "deployedModel": {
    "model": "projects/PROJECT/locations/LOCATION_ID/models/MODEL_ID",
    "displayName": "DEPLOYED_MODEL_NAME",
    "dedicatedResources": {
      "machineSpec": {
        "machineType": "MACHINE_TYPE",
        "acceleratorType": "ACCELERATOR_TYPE",
        "acceleratorCount": ACCELERATOR_COUNT,
        "reservationAffinity": {
          "reservationAffinityType": "RESERVATION_AFFINITY_TYPE",
          "key": "compute.googleapis.com/reservation-name",
          "values": [
            "projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION_NAME_1",
            "projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION_NAME_2"
          ]
        }
      },
      "minReplicaCount": MIN_REPLICA_COUNT,
      "maxReplicaCount": MAX_REPLICA_COUNT
    },
  },
  "trafficSplit": {
    "0": TRAFFIC_SPLIT_THIS_MODEL,
    "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1,
    "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2
  },
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel"

PowerShell (Windows)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.DeployModelOperationMetadata",
    "genericMetadata": {
      "createTime": "2020-10-19T17:53:16.502088Z",
      "updateTime": "2020-10-19T17:53:16.502088Z"
    }
  }
}

Python

To learn how to install or update the Agent Platform SDK for Python, see Install the Agent Platform SDK for Python. For more information, see the Agent Platform SDK for Python API reference documentation.

Before running any of the following scripts, make the following replacements:

DEPLOYED_NAME: a name for the deployed model.
TRAFFIC_SPLIT: the traffic split percentage value for the deployed model ID key.
MACHINE_TYPE: the machine used for each node of this deployment. Its default setting is n1-standard-2. Learn more about machine types.
ACCELERATOR_TYPE: the type of accelerator to attach to the machine. For more information about the type of GPU that each machine type supports, see GPUs for compute workloads.
ACCELERATOR_COUNT: the number of accelerators to attach to the machine.
PROJECT_ID: the project where the reservation was created in. To consume a shared reservation from another project, you must share the reservation with that project. For more information, see Modify the consumer projects in a shared reservation.
ZONE: the zone where the reservation is located.
RESERVATION_NAME_N: the names of your reservations, in priority order. Each must be the full resource name of the reservation or reservation block.
MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the inference load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the inference load, up to this number of nodes and never fewer than the minimum number of nodes.

Depending on the type of reservation that you want to consume, do one of the following:

To consume one or more specific reservations:

endpoint5.deploy(
    model = model,
    deployed_model_display_name=DEPLOYED_NAME,
    traffic_split=TRAFFIC_SPLIT,
    machine_type="MACHINE_TYPE",
    accelerator_type="ACCELERATOR_TYPE",
    accelerator_count=ACCELERATOR_COUNT,
    reservation_affinity_type="SPECIFIC_RESERVATION",
    reservation_affinity_key="compute.googleapis.com/reservation-name",
    reservation_affinity_values=[
        "projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION_NAME_1",
        "projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION_NAME_2"
        ],
    min_replica_count=MIN_REPLICA_COUNT,
    max_replica_count=MAX_REPLICA_COUNT,
    sync=True
)

To consume an automatically consumed reservation:

endpoint5.deploy(
    model = model,
    deployed_model_display_name=DEPLOYED_NAME,
    traffic_split=TRAFFIC_SPLIT,
    machine_type="MACHINE_TYPE",
    accelerator_type="ACCELERATOR_TYPE",
    accelerator_count=ACCELERATOR_COUNT,
    reservation_affinity_type="ANY_RESERVATION",
    min_replica_count=MIN_REPLICA_COUNT,
    max_replica_count=MAX_REPLICA_COUNT,
    sync=True
)

What's next

Learn more about reservations of Compute Engine zonal resources.
Learn how to use reservations with Agent Platform batch inference.
Learn how to use reservations with Agent Platform training.
Learn how to view reservations.
Learn how to monitor reservations consumption.

Use reservations with online inference Stay organized with collections Save and categorize content based on your preferences.

Limitations and requirements

Billing

Before you begin

Allow a reservation to be consumed

Allow consumption while creating a reservation

Allow consumption of an existing reservation

Allow consumption of multiple specific reservations

Verify that a reservation is consumed

Get online inferences by using a reservation

REST

curl (Linux, macOS, or Cloud Shell)

PowerShell (Windows)

Python

What's next

Use reservations with online inference