This guide describes the benefits and limitations of using Flex-start VMs with Vertex AI inference. This guide also describes how to deploy a model that uses Flex-start VMs.
Overview
You can reduce the cost of running your inference jobs by using Flex-start VMs, which are powered by Dynamic Workload Scheduler. Flex-start VMs offer significant discounts and are well-suited for short-duration workloads.
You can specify how long you need a Flex-start VM, for any duration up to seven days. After the requested time expires, your deployed model is automatically undeployed. You can also manually undeploy the model before the time expires.
Automatic undeployment
If you request a Flex-start VM for a specific duration, your model is automatically undeployed after that time period. For example, if you request a Flex-start VM for five hours, the model is automatically undeployed five hours after submission. You are only charged for the amount of time your workload is running.
Limitations and requirements
Consider the following limitations and requirements when you use Flex-start VMs:
- Maximum duration: Flex-start VMs have a maximum usage duration of seven days. Any deployment request for a longer duration will be rejected.
- TPU support: Using Flex-start VMs with TPU Pods isn't supported.
- Quota: Make sure you have sufficient Vertex AI preemptible quota before launching your job. To learn more, see Rate quotas.
- Queued provisioning: Using Flex-start VMs with queued provisioning isn't supported.
- Node recycling: Node recycling isn't supported.
Billing
If your workload runs for less than seven days, using Flex-start VMs can reduce your costs.
When you use Flex-start VMs, you're billed based on the duration of your job and the machine type that you select. You are only charged for the time your workload is actively running. You don't pay for the time that the job is in a queue or for any time after the requested duration has expired.
Billing is distributed across two SKUs:
- The Compute Engine SKU, with the label - vertex-ai-online-prediction. See Dynamic Workload Scheduler pricing.
- The Vertex AI management fee SKU. See Vertex AI pricing. 
Get inferences by using Flex-start VMs
To use Flex-start VMs when you deploy a model to get inferences, you can use the REST API.
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_ID: The ID for the endpoint.
- MODEL_ID: The ID for the model to be deployed.
- 
DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of theModelfor theDeployedModelas well.
- 
MACHINE_TYPE: Optional. The machine resources used for each node of this
deployment. Its default setting is n1-standard-2. Learn more about machine types.
- ACCELERATOR_TYPE: Optional. The type of accelerator to attach to the machine. Learn more.
- ACCELERATOR_COUNT: Optional. The number of accelerators for each replica to use.
- 
    MAX_RUNTIME_DURATION: The maximum duration for the flex-start deployment.
    The deployed model is automatically undeployed after this duration. Specify the duration
    in seconds, ending with an s. For example,3600sfor one hour. The maximum value is604800s(7 days).
- PROJECT_NUMBER: Your project's automatically generated project number.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:deployModel
Request JSON body:
{
  "deployedModel": {
    "model": "projects/PROJECT/locations/LOCATION/models/MODEL_ID",
    "displayName": "DEPLOYED_MODEL_NAME",
    "enableContainerLogging": true,
    "dedicatedResources": {
      "machineSpec": {
        "machineType": "MACHINE_TYPE",
        "acceleratorType": "ACCELERATOR_TYPE",
        "acceleratorCount": ACCELERATOR_COUNT
      },
      "flexStart": {
        "maxRuntimeDuration": "MAX_RUNTIME_DURATION"
      },
      "minReplicaCount": 2,
      "maxReplicaCount": 2
    },
  },
}
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{
  "name": "projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.DeployModelOperationMetadata",
    "genericMetadata": {
      "createTime": "2020-10-19T17:53:16.502088Z",
      "updateTime": "2020-10-19T17:53:16.502088Z"
    }
  }
}