Overview
Before you serve online inference with private endpoints, you must configure private services access to create peering connections between your network and Vertex AI. If you have already set this up, you can use your existing peering connections.
This guide covers the following tasks:
- Verifying the status of your existing peering connections.
- Verifying the necessary APIs are enabled.
- Creating a private endpoint.
- Deploying a model to a private endpoint.
- Only support one model per private endpoint. This is different from a public Vertex AI endpoint where you can split traffic across multiple models deployed to one endpoint.
- Private endpoint supports AutoML tabular and custom trained models.
 
- Sending an inference to a private endpoint.
- Cleaning up resources
Check the status of existing peering connections
If you have existing peering connections you use with Vertex AI, you can list them to check status:
gcloud compute networks peerings list --network NETWORK_NAME
You should see that the state of your peering connections is ACTIVE.
Learn more about active peering connections.
Enable the necessary APIs
gcloud services enable aiplatform.googleapis.com
gcloud services enable dns.googleapis.com
Create a private endpoint
To create a private endpoint, add the --network flag when you create an
endpoint using the Google Cloud CLI:
gcloud beta ai endpoints create \
  --display-name=ENDPOINT_DISPLAY_NAME \
  --network=FULLY_QUALIFIED_NETWORK_NAME \
  --region=REGION
Replace NETWORK_NAME with the fully qualified network name:
projects/PROJECT_NUMBER/global/networks/NETWORK_NAME
If you create the endpoint without specifying a network, then you create a public endpoint.
Limitations of private endpoints
Note the following limitations for private endpoints:
- Private endpoints don't support traffic splitting. As a workaround, you can create traffic splitting manually by deploying your model to multiple private endpoints, and splitting traffic among the resulting inference URLs for each private endpoint.
- Private endpoints don't support SSL/TLS.
- To enable access logging on a private endpoint, contact vertex-ai-feedback@google.com.
- You can use only one network for all private endpoints in a Google Cloud
project. If you want to change to another network,
contact vertex-ai-feedback@google.com.
- Client side retry on recoverable errors are highly recommended. These can
include the following errors:
- Empty response (HTTP error code 0), possibly due to a transient broken connection.
- HTTP error codes 5xxthat indicate the service might be temporarily unavailable.
 
- Empty response (HTTP error code 
- For the HTTP error code 429that indicates the system is overloaded, consider slowing down traffic to mitigate this issue instead of retrying.
- Inference requests from PredictionServiceClientin the Vertex AI Python client library are not supported.
- The Private Service Access endpoint does not support tuned foundation models. For a tuned foundation model, deploy it using a Private Service Connect endpoint.
Monitor private endpoints
You can use the metrics dashboard to inspect the availability and latency of the traffic sent to a private endpoint.
To customize monitoring, query the following metrics in Cloud Monitoring:
- aiplatform.googleapis.com/prediction/online/private/response_count- The number of inference responses. You can filter this metric by - deployed_model_idor HTTP response code.
- aiplatform.googleapis.com/prediction/online/private/prediction_latencies- The latency of the inference request in milliseconds. You can filter this metric by - deployed_model_id, only for successful requests.
Learn how to select, query, and display these metrics in Metrics Explorer.
Deploy a model
You can import a new model, or deploy an existing model that you have already
uploaded. To upload a new model, use
gcloud ai models upload.
For more information, see
Import models to Vertex AI.
- To deploy a model to a private endpoint, see the guide to deploy models. Besides traffic splitting and manually enabling access logging, you can use any of the other options available for deploying custom-trained models. Refer to the limitations of private endpoints to learn more about how they are different from public endpoints. 
- After you deploy the endpoint, you can get the inference URI from the metadata of your private endpoint. - If you have the display name of your private endpoint, run this command to get the endpoint ID: - ENDPOINT_ID=$(gcloud ai endpoints list \ --region=REGION \ --filter=displayName:ENDPOINT_DISPLAY_NAME \ --format="value(ENDPOINT_ID.scope())")- Otherwise, to view the endpoint ID and display name for all of your endpoints, run the following command: - gcloud ai endpoints list --region=REGION
- Finally, to get the inference URI, run the following command: - gcloud beta ai endpoints describe ENDPOINT_ID \ --region=REGION \ --format="value(deployedModels.privateEndpoints.predictHttpUri)"
 
Private inference URI format
The inference URI looks different for private endpoints compared to Vertex AI public endpoints:
http://ENDPOINT_ID.aiplatform.googleapis.com/v1/models/DEPLOYED_MODEL_ID:predict
If you choose to undeploy the current model and redeploy with a new one, the domain name is reused but the path includes a different deployed model ID.
Send an inference to a private endpoint
- Create a Compute Engine instance in your VPC network. Make sure to create the instance in the same VPC network that you have peered with Vertex AI. 
- SSH into your Compute Engine instance, and install your inference client, if applicable. Otherwise, you can use curl. 
- When predicting, use the inference URL obtained from model deployment. In this example, you're sending the request from your inference client in your Compute Engine instance in the same VPC network: - curl -X POST -d@PATH_TO_JSON_FILE http://ENDPOINT_ID.aiplatform.googleapis.com/v1/models/DEPLOYED_MODEL_ID:predict- In this sample request, PATH_TO_JSON_FILE is the path to your inference request, saved as a JSON file. For example, - example-request.json.
Clean up resources
You can undeploy models and delete private endpoints the same way as for public models and endpoints.
Example: Test private endpoints in Shared VPC
This example uses two Google Cloud projects with a Shared VPC network:
- The host project hosts the Shared VPC network.
- The client project hosts a Compute Engine instance where you run an inference client, such as curl, or your own REST client in the Compute Engine instance, to send inference requests.
When you create the Compute Engine instance in the client project, it must be within the custom subnet in the host project's Shared VPC network, and in the same region where the model gets deployed.
- Create the peering connections for private services access in the host project. Run - gcloud services vpc-peerings connect:- gcloud services vpc-peerings connect \ --service=servicenetworking.googleapis.com \ --network=HOST_SHARED_VPC_NAME \ --ranges=PREDICTION_RESERVED_RANGE_NAME \ --project=HOST_PROJECT_ID
- Create the endpoint in the client project, using the host project's network name. Run - gcloud beta ai endpoints create:- gcloud beta ai endpoints create \ --display-name=ENDPOINT_DISPLAY_NAME \ --network=HOST_SHARED_VPC_NAME \ --region=REGION \ --project=CLIENT_PROJECT_ID
- Send inference requests, using the inference client within the client project.