This document in the Google Cloud Well-Architected Framework: AI and ML perspective provides principles and recommendations to help you optimize the performance of your AI and ML workloads on Google Cloud. The recommendations in this document align with the performance optimization pillar of the Well-Architected Framework.
AI and ML systems enable advanced automation and decision-making capabilities for your organization. The performance of these systems can directly affect important business drivers like revenue, costs, and customer satisfaction. To realize the full potential of AI and ML systems, you must optimize their performance based on your business goals and technical requirements. The performance optimization process often involves trade-offs. For example, a design choice that provides the required performance might lead to higher costs. The recommendations in this document prioritize performance over other considerations.
To optimize AI and ML performance, you need to make decisions regarding factors like the model architecture, parameters, and training strategy. When you make these decisions, consider the entire lifecycle of the AI and ML systems and their deployment environment. For example, very large LLMs can be highly performant on massive training infrastructure, but might not perform well in capacity-constrained environments like mobile devices.
The recommendations in this document are mapped to the following core principles:
- Establish performance objectives and evaluation methods
- Run and track frequent experiments
- Build and automate training and serving infrastructure
- Match design choices to performance requirements
- Link performance metrics to design and configuration choices
Establish performance objectives and evaluation methods
Your business strategy and goals are the foundation for leveraging AI and ML technologies. Translate your business goals into measurable key performance indicators (KPIs). Examples of KPIs include total revenue, costs, conversion rate, retention or churn rate, customer satisfaction, and employee satisfaction.
Define realistic objectives
According to site reliability engineering (SRE) best practices, the objectives of a service must reflect a performance level that satisfies the requirements of typical customers. This means that service objectives must be realistic in terms of scale and feature performance.
Unrealistic objectives can lead to wasted resources for minimal performance gains. Models that provide the highest performance might not lead to optimal business outcomes. Such models might require more time and cost to train and run.
When you define objectives, distinguish between and prioritize quality and performance objectives:
- Quality refers to inherent characteristics that determine the value of an entity. It helps you assess whether the entity meets your expectations and standards.
- Performance refers to how efficiently and effectively an entity functions or carries out its intended purpose.
ML engineers can improve the performance metrics of a model during the training process. Vertex AI provides an evaluation service that ML engineers can use to implement standardized and repeatable tracking of quality metrics. The prediction efficiency of a model indicates how well a model performs in production or at inference time. To monitor performance, use Cloud Monitoring and Vertex AI Model Monitoring. To select appropriate models and decide how to train them, you must translate business goals into technical requirements that determine quality and performance metrics.
To understand how to set realistic objectives and identify appropriate performance metrics, consider the following example for an AI-powered fraud detection system:
- Business objective: For a fraud detection system, an unrealistic business objective is to detect 100% of fraudulent transactions accurately within one nanosecond at a peak traffic of 100 billion transactions per second. A more realistic objective is to detect fraudulent transactions with 95% accuracy in 100 milliseconds for 90% of online predictions during US working hours at a peak volume of one million transactions per second.
- Performance metrics: Detecting fraud is a classification problem. You can measure the quality of a fraud detection system by using metrics like recall, F1 score, and accuracy. To track system performance or speed, you can measure inference latency. Detecting potentially fraudulent transactions might be more valuable than accuracy. Therefore, a realistic goal might be a high recall with a p90 latency that's less than 100 milliseconds.
Monitor performance at all stages of the model lifecycle
During experimentation and training and after model deployment, monitor your KPIs and observe any deviations from the business objectives. A comprehensive monitoring strategy helps you make critical decisions about model quality and resource utilization, such as the following:
- Decide when to stop a training job.
- Determine whether a model's performance is degrading in production.
- Improve the cost and time-to-market for new models.
Monitoring during experimentation and training
The objective of the experimentation stage is to find the optimal overall approach, model architecture, and hyperparameters for a specific task. Experimentation helps you iteratively determine the configuration that provides optimal performance and how to train the model. Monitoring helps you efficiently identify potential areas of improvement.
To monitor a model's quality and training efficiency, ML engineers must do the following:
- Visualize model quality and performance metrics for each trial.
- Visualize model graphs and metrics, such as histograms of weights and biases.
- Visually represent training data.
- Profile training algorithms on different hardware.
To monitor experimentation and training, consider the following recommendations:
| Monitoring aspect | Recommendation |
|---|---|
| Model quality | To visualize and track experiment metrics like accuracy and to visualize model architecture or training data, use TensorBoard. TensorBoard is an open-source suite of tools that's compatible with ML frameworks like the following:
|
| Experiment tracking | Vertex AI Experiments integrates with managed enterprise-grade Vertex AI TensorBoard instances to support experiment tracking. This integration enables reliable storage and sharing of logs and metrics. To let multiple teams and individuals track experiments, we recommend that you use the principle of least privilege. |
| Training and experimentation efficiency | Vertex AI exports metrics to Monitoring and collects telemetry data and logs by using an observability agent. You can visualize the metrics in the Google Cloud console. Alternatively, create dashboards or alerts based on these metrics by using Monitoring. For more information, see Monitoring metrics for Vertex AI. |
| NVIDIA GPUs | The Ops Agent enables GPU monitoring for Compute Engine and for other products that Ops Agent supports. You can also use the NVIDIA Data Center GPU Manager (DCGM), which is a suite of tools for managing and monitoring NVIDIA GPUs in cluster environments. Monitoring NVIDIA GPUs is particularly useful for training and serving deep learning models. |
| Deep debugging | To debug problems with the training code or the configuration of a Vertex AI Training job, you can inspect the training container by using an interactive shell session. |
Monitoring during serving: Streaming prediction
After you train a model and export it to Vertex AI Model Registry, you can create a Vertex AI endpoint. This endpoint provides an HTTP endpoint for the model.
Model Monitoring helps you identify large changes in the distribution of input or output features. You can also monitor feature attributions in production when compared to a baseline distribution. The baseline distribution can be the training set or it can be based on past distributions of production traffic. A change in the serving distribution might imply a reduction in predictive performance compared to training.
- Choose a monitoring objective: Depending on the sensitivity of a use case to changes in the data that's provided to the model, you can monitor different types of objectives: input feature drift, output drift, and feature attribution. Model Monitoring v2 lets you monitor models that you deploy on a managed serving platform like Vertex AI and also on self-hosted services like Google Kubernetes Engine (GKE). In addition, for granular performance tracking, you can monitor parameters at the model level rather than for an endpoint.
- Monitor generative AI model serving: To ensure stability and minimize latency, particularly for LLM endpoints, set up a robust monitoring stack. Gemini models provide built-in metrics, like time to first token (TTFT), which you can access directly in Metrics Explorer. To monitor throughput, latency, and error rates across all Google Cloud models, use the model observability dashboard.
Monitoring during serving: Batch prediction
To monitor batch prediction, you can run standard evaluation jobs in the Vertex AI evaluation service. Model Monitoring supports monitoring of batch inferences. If you use Batch to run your serving workload, you can monitor resource consumption by using the metrics in Metrics Explorer.
Automate evaluation for reproducibility and standardization
To transition models from prototypes to reliable production systems, you need a standardized evaluation process. This process helps you track progress across iterations, compare different models, detect and mitigate bias, and ensure that you meet regulatory requirements. To ensure reproducibility and scalability, you must automate the evaluation process.
To standardize and automate the evaluation process for ML performance, complete the following steps:
- Define quantitative and qualitative indicators.
- Choose appropriate data sources and techniques.
- Standardize the evaluation pipeline.
These steps are described in the following sections.
1. Define quantitative and qualitative indicators
Computation-based metrics are calculated by using numeric formulas. Remember that training loss metrics might differ from the evaluation metrics that are relevant to business goals. For example, a model that's used for supervised fraud detection might use cross-entropy loss for training. However, to evaluate inference performance, a more relevant metric might be recall, which indicates the coverage of fraudulent transactions. Vertex AI provides an evaluation service for metrics like recall, precision, and area under the precision-recall curve (AuPRC). For more information, see Model evaluation in Vertex AI.
Qualitative indicators, such as the fluency or entertainment value of generated content, can't be objectively computed. To evaluate these indicators, you can use the LLM-as-a-judge strategy or human labeling services like Labelbox.
2. Choose appropriate data sources and techniques
An evaluation is statistically significant when it runs on a certain minimum volume of varied examples. Choose the datasets and techniques that you use for evaluations by using approaches such as the following:
- Golden dataset: Use trusted, consistent, and accurate data samples that reflect the probability distribution of a model in production.
- LLM-as-a-judge: Evaluate the output of a generative model by using an LLM. This approach is relevant only to tasks where an LLM can evaluate a model.
- User feedback: To guide future improvements, capture direct user feedback as a part of production traffic.
Depending on the evaluation technique, size and type of evaluation data, and frequency of evaluation, you can use BigQuery or Cloud Storage as data sources, including for the Vertex AI evaluation service.
- BigQuery lets you use SQL commands to run inference tasks like natural language processing and machine translation. For more information, see Task-specific solutions overview.
- Cloud Storage provides a cost-efficient storage solution for large datasets.
3. Standardize the evaluation pipeline
To automate the evaluation process, consider the following services and tools:
- Vertex AI evaluation service: Provides ready-to-use primitives to track model performance as a part of the ML lifecycle on Vertex AI.
- Gen AI evaluation service: Lets you evaluate any generative model or application and benchmark the evaluation results against your own judgment and evaluation criteria. This service also helps you perform specialized tasks like prompt engineering, retrieval-augmented generation (RAG), and AI agent optimization.
- Vertex AI automatic side-by-side (AutoSxS) tool: Supports pairwise model-based evaluation.
- Kubeflow: Provides special components to run model evaluations.
Run and track frequent experiments
To effectively optimize ML performance, you need a dedicated, powerful, and interactive platform for experimentation. The platform must have the following capabilities:
- Facilitate iterative development, so that teams can move from idea to validated results with speed, reliability, and scale.
- Let teams efficiently discover optimal configurations that they can use to trigger training jobs.
- Provide controlled access to relevant data, features, and tools to perform and track experiments.
- Support reproducibility and data lineage tracking.
Treat data as a service
Isolate experimental workloads from production systems and set up appropriate security controls for your data assets by using the following techniques:
| Technique | Description | Benefits |
|---|---|---|
| Resource isolation | Isolate the resources for different environments
in separate Google Cloud projects. For example, provision the
resources for development, staging, and production environments in
separate projects like ml-dev, ml-staging, and
ml-prod. |
Resource isolation helps to prevent experimental workloads from consuming resources that production systems need. For example, if you use a single project for experiments and production, an experiment might consume all of the available NVIDIA A100 GPUs for Vertex AI Training. This might cause interruptions in the retraining of a critical production model. |
| Identity and access control | Apply the principles of
zero trust and least privilege and
use workload-specific service accounts. Grant access by using
predefined Identity and Access Management (IAM) roles like
Vertex AI User
(roles/aiplatform.user). |
This approach helps to prevent accidental or malicious actions that might corrupt experiments. |
| Network security | Isolate network traffic by using Virtual Private Cloud (VPC) networks and enforce security perimeters by using VPC Service Controls. | This approach helps to protect sensitive data and prevents experimental traffic from affecting production services. |
| Data isolation | Store experimental data in separate Cloud Storage buckets and BigQuery datasets. | Data isolation prevents accidental modification of production data. For example, without data isolation, an experiment might inadvertently alter the feature values in a shared BigQuery table, which might lead to a significant degradation in model accuracy in the production environment. |
Equip teams with appropriate tools
To establish a curated set of tools to accelerate the entire experimentation lifecycle—from data exploration to model training and analysis—use the following techniques:
- Interactive prototyping: For rapid data exploration, hypothesis testing, and code prototyping, use Colab Enterprise or managed JupyterLab instances on Vertex AI Workbench. For more information, see Choose a notebook solution.
- Scalable model training: Run training jobs on a managed service that supports distributed training and scalable compute resources like GPUs and TPUs. This approach helps to reduce training time from days to hours and enables more parallel experimentation. For more information, see Use specialized components for training.
- In-database ML: Train models directly in BigQuery ML using SQL. This technique helps to eliminate data movement and accelerates experimentation for analysts and SQL-centric users.
- Tracking experiments: Create a searchable and comparable history of experiment data by logging parameters, metrics, and artifacts for every experiment run. For more information, see Build a data and model lineage system.
- Optimizing generative AI: To optimize the performance of generative AI applications, you must experiment with prompts, model selection, and fine-tuning. For rapid prompt engineering, use Vertex AI Studio. To experiment with foundation models (like Gemini) and find a model that's appropriate for your use case and business goals, use Model Garden.
Standardize for reproducibility and efficiency
To ensure that experiments consume resources efficiently and produce consistent and trustworthy results, standardize and automate experiments by using the following approaches:
- Ensure consistent environments by using containers: Package your training code and dependencies as Docker containers. Manage and serve the containers by using Artifact Registry. This approach lets you reproduce issues on different machines by repeating experiments in identical environments. Vertex AI provides prebuilt containers for serverless training.
- Automate ML workflows as pipelines: Orchestrate the end-to-end ML workflow as a codified pipeline by using Vertex AI Pipelines. This approach helps to enforce consistency, ensure reproducibility, and automatically track all of the artifacts and metadata in Vertex ML Metadata.
- Automate provisioning with infrastructure-as-code (IaC): Define and deploy standardized experimentation environments by using IaC tools like Terraform. To ensure that every project adheres to a standardized set of configurations for security, networking, and governance, use Terraform modules.
Build and automate training and serving infrastructure
To train and serve AI models, set up a robust platform that supports efficient and reliable development, deployment, and serving. This platform lets your teams efficiently improve the quality and performance of training and serving in the long run.
Use specialized components for training
A reliable training platform helps to accelerate performance and provides a standardized approach to automate repeatable tasks in the ML lifecycle–from data preparation to model validation.
Data collection and preparation: For effective model training, you need to collect and prepare the data that's necessary for training, testing, and validation. The data might come from different sources and be of different data types. You also need to reuse relevant data across training runs and share features across teams. To improve the repeatability of the data collection and preparation phase, consider the following recommendations:
- Improve data discoverability with Dataplex Universal Catalog.
- Centralize feature engineering in Vertex AI Feature Store.
- Preprocess data by using Dataflow.
Training execution: When you train a model, you use data to create a model object. To do this, you need to set up the necessary infrastructure and dependencies of the training code. You also need to decide how to persist the training models, track training progress, evaluate the model, and present the results. To improve training repeatability, consider the following recommendations:
- Follow the guidelines in Automate evaluation for reproducibility and standardization.
- To submit training jobs on one or more worker nodes without managing the
underlying infrastructure provisioning or dependencies, use the
CustomJobresource in Vertex AI Training. You can also use theCustomJobresource for hyperparameter tuning jobs. - Optimize scheduling goodput by leveraging Dynamic Workload Scheduler or reservations for Vertex AI Training.
- Register your models to Vertex AI Model Registry.
Training orchestration: Deploy training workloads as stages of a pipeline by using Vertex AI Pipelines. This service provides managed Kubeflow and TFX services. Define each step of the pipeline as a component that runs in a container. Each component works like a function, with input parameters and output artifacts that become inputs to the subsequent components in the pipeline. To optimize the efficiency of your pipeline, consider the recommendations in the following table:
Goal Recommendations Implement core automation. - Use Kubeflow Pipelines components.
- Use Control Flows for advanced pipeline designs, such as conditional gates.
- Integrate the pipeline compilation as a part of your CI/CD flow.
- Execute the pipeline with Cloud Scheduler.
Increase speed and cost efficiency. - To increase the speed of iterations and reduce cost, use the execution cache of Vertex AI Pipelines.
- Specify the machine configuration for each component based on the resource requirements of each step.
Increase robustness. - To increase robustness against temporary issues without requiring manual intervention, configure retries.
- To fail fast and iterate efficiently, configure a failure policy.
- To handle failures, configure email notifications.
Implement governance and tracking. - To experiment with different model configurations or training configurations, add pipeline runs to experiments.
- Monitor logs and metrics for Vertex AI Pipelines.
- Follow the recommendations in Monitor performance at all stages of the model lifecycle.
Use specialized infrastructure for prediction
To reduce the toil of managing infrastructure and model deployments, automate the repeatable task flows. A service-oriented approach lets you focus on speed and faster time to value (TTV). Consider the following recommendations:
| Recommendation | Techniques |
|---|---|
| Implement automatic deployment. |
|
| Take advantage of managed scaling features. |
|
| Optimize latency and throughput on Vertex AI endpoints. |
|
| Optimize resource utilization. |
|
| Optimize model deployment. |
|
| Monitor performance. |
|
Match design choices to performance requirements
When you make design choices to improve performance, assess whether the choices support your business requirements or are wasteful and counterproductive. To choose appropriate infrastructure, models, and configurations, identify performance bottlenecks and assess how they're linked to performance metrics. For example, even on very powerful GPU accelerators, training tasks can experience performance bottlenecks. These bottlenecks can be caused by data I/O issues in the storage layer or by performance limitations of the model.
Focus on holistic performance of the ML flow
As training requirements grow in terms of model size and cluster size, the failure rate and infrastructure costs might increase. Therefore, the cost of failure might increase quadratically. You can't rely solely on conventional resource-efficiency metrics like model FLOPs utilization (MFU). To understand why MFU might not be a sufficient indicator of overall training performance, examine the lifecycle of a typical training job. The lifecycle consists of the following cyclical flow:
- Cluster creation: Worker nodes are provisioned.
- Initialization: Training is initialized on the worker nodes.
- Training execution: Resources are used for forward or backward propagation.
- Interruption: The training process is interrupted during model checkpointing or due to worker-node preemptions.
After each interruption, the preceding flow is repeated.
The training execution step constitutes a fraction of the lifecycle of an ML job. Therefore, the utilization of worker nodes for the training execution step doesn't indicate the overall efficiency of the job. For example, even if the training execution step runs at 100% efficiency, the overall efficiency might be low if interruptions occur frequently or if it takes a long time to resume training after interruptions.
Adopt and track goodput metrics
To ensure holistic performance measurement and optimization, shift your focus from conventional resource-efficiency metrics like MFU to goodput. Goodput considers the availability and utilization of your clusters and compute resources and it helps to measure resource efficiency across multiple layers.
The focus of goodput metrics is the overall progress of a job rather than whether the job appears to be busy. Goodput metrics help you optimize training jobs for tangible overall gains in productivity and performance.
Goodput gives you a granular understanding of potential losses in efficiency through the following metrics:
- Scheduling goodput is the fraction of time when all of the resources that are required for training or serving are available for use.
- Runtime goodput represents the proportion of useful training steps that are completed during a given period.
- Program goodput is the peak hardware performance or MFU that a training job can extract from the accelerator. It depends on efficient utilization of the underlying compute resources during training.
Optimize scheduling goodput
To optimize scheduling goodput for a workload, you must identify the specific infrastructure requirements of the workload. For example, batch inference, streaming inference, and training have different requirements:
- Batch inference workloads might accommodate some interruptions and delays in resource availability.
- Streaming inference workloads require stateless infrastructure.
- Training workloads require longer term infrastructure commitments.
Choose appropriate obtainability modes
In cloud computing, obtainability is the ability to provision resources when they're required. Google Cloud provides the following obtainability modes:
- On-demand VMs: You provision Compute Engine VMs when they're needed and run your workloads on the VMs. The provisioning request is subject to the availability of resources, such as GPUs. If a sufficient quantity of a requested resource type isn't available, the request fails.
- Spot VMs: You create VMs by using unused compute capacity. Spot VMs are billed at a discounted price when compared to on-demand VMs, but Google Cloud might preempt Spot VMs at any time. We recommend that you use Spot VMs for stateless workloads that can fail gracefully when the host VMs are preempted.
- Reservations: You reserve capacity as a pool of VMs. Reservations are ideal for workloads that need capacity assurance. Use reservations to maximize scheduling goodput by ensuring that resources are available when required.
Dynamic Workload Scheduler: This provisioning mechanism queues requests for GPU-powered VMs in a dedicated pool. Dynamic Workload Scheduler helps you avoid the constraints of the other obtainability modes:
- Stock-out situations in the on-demand mode.
- Statelessness constraint and preemption risk of Spot VMs.
- Cost and availability implications of reservations.
The following table summarizes the obtainability modes for Google Cloud services and provides links to relevant documentation:
| Product | On-demand VMs | Spot VMs | Reservations | Dynamic Workload Scheduler |
|---|---|---|---|---|
| Compute Engine | Create and start a Compute Engine instance | About Spot VMs | About reservations | Create a MIG with GPU VMs |
| GKE | Add and manage node pools | About Spot VMs in GKE | Consuming reserved zonal resources | GPU, TPU, and H4D consumption with flex-start provisioning |
| Cloud Batch | Create and run a job | Batch job with Spot VMs | Ensure resource availability using VM reservations | Use GPUs and flex-start VMs |
| Vertex AI Training | Create a serverless training job | Use Spot VMs for training jobs | Use reservations for training jobs | Schedule training jobs based on resource availability |
| Vertex AI | Get batch inferences and online inferences from custom-trained models. | Use Spot VMs for inference | Use reservations for online inference | Use flex-start VMs for inference |
Plan for maintenance events
You can improve scheduling goodput by anticipating and planning for infrastructure maintenance and upgrades.
GKE lets you control when automatic cluster maintenance can be performed on your clusters. For more information, see Maintenance windows and exclusions.
Compute Engine provides the following capabilities:
- To keep an instance running during a host event, such as planned maintenance for the underlying hardware, Compute Engine performs a live migration of the instance to another host in the same zone. For more information, see Live migration process during maintenance events.
- To control how an instance responds when the underlying host requires maintenance or has an error, you can set a host maintenance policy for the instance.
For information about planning for host events that are related to large training clusters in AI Hypercomputer, see Manage host events across compute instances.
Optimize runtime goodput
The model training process is frequently interrupted by events like model checkpointing and resource preemption. To optimize runtime goodput, you must ensure that the system resumes training and inference efficiently after the required infrastructure is ready and after any interruption.
During model training, AI researchers use checkpointing to track progress and minimize the learning lost due to resource preemptions. Larger model sizes make checkpointing interruptions longer, which further affects overall efficiency. After interruptions, the training application must be restarted on every node in the cluster. These restarts can take some time because the necessary artifacts must be reloaded.
To optimize runtime goodput, use the following techniques:
| Technique | Description |
|---|---|
| Implement automatic checkpointing. | Frequent checkpointing lets you track the progress of training at a granular level. However, the training process is interrupted for each checkpoint, which reduces runtime goodput. To minimize interruptions, you can set up automatic checkpointing, where the host's SIGTERM signal triggers the creation of a checkpoint. This approach limits checkpointing interruptions to when the host needs maintenance. Remember that some hardware failures might not trigger SIGTERM signals; therefore, you must find a suitable balance between automatic checkpointing and SIGTERM events. You can set up automatic checkpointing by using the following techniques:
|
| Use appropriate container-loading strategies. | In a GKE cluster, before nodes can resume training jobs, it might take some time to complete loading the required artifacts like data or model checkpoints. To reduce the time that's required to reload data and resume training, use the following techniques:
For more information about how to reduce the data reload time, see Tips and tricks to reduce cold-start latency on GKE. |
| Use the compilation cache. | If training requires a compilation-based stack, check whether you can use a compilation cache. When you use a compilation cache, the computation graph isn't recompiled after each training interruption. The resulting reductions in time and cost are particularly beneficial when you use TPUs. JAX lets you store the compilation cache in a Cloud Storage bucket and then use the cached data in the case of interruptions. |
Optimize program goodput
Program goodput represents peak resource utilization during training, which is the conventional way to measure training and serving efficiency. To improve program goodput, you need an optimized distribution strategy, efficient compute-communication overlap, optimized memory access, and efficient pipelines.
To optimize program goodput, use the following strategies:
| Strategy | Description |
|---|---|
| Use framework-level customization options. | Frameworks or compilers like Accelerated Linear Algebra (XLA) provide many key components of program goodput. To further optimize performance, you can customize fundamental components of the computation graph. For example, Pallas supports custom kernels for TPUs and GPUs. |
| Offload memory to the host DRAM. | For large-scale training, which requires significantly high memory from accelerators, you can offload some memory usage to the host DRAM. For example, XLA lets you offload model activations from the forward pass to the host memory instead of using the accelerator's memory. With this strategy, you can improve training performance by increasing the model capacity or the batch size. |
| Leverage quantization during training. | You can improve training efficiency and program goodput by leveraging model quantization during training. This strategy reduces the precision of the gradients or weights during certain steps of the training; therefore, program goodput improves. However, this strategy might require additional engineering effort during model development. For more information, see the following resources: |
| Implement parallelism. | To increase the utilization of the available compute resources, you can use parallelism strategies at the model level during training and when loading data. For information about model parallelism, see the following:
To achieve data parallelism, you can use tools like the following:
|
Focus on workload-specific requirements
To ensure that your performance optimization efforts are effective and holistic, you must match optimization decisions to the specific requirements of your training and inference workloads. Choose appropriate AI models and use relevant prompt optimization strategies. Select appropriate frameworks and tools based on the requirements of your workloads.
Identify workload-specific requirements
Evaluate the requirements and constraints of your workloads across the following areas:
| Area | Description |
|---|---|
| Task and quality requirements | Define the core task of the workload and the performance baseline. Answer questions like the following:
|
| Serving context | Analyze the operational environment where you plan to deploy the model. The serving context often has a significant impact on design decisions. Consider the following factors:
|
| Team skills and economics | Assess the business value of buying the solution against the cost and complexity of building and maintaining it. Determine whether your team has the specialized skills that are required for custom model development or whether a managed service might provide faster time to value. |
Choose an appropriate model
If an API or an open model can deliver the required performance and quality, use that API or model.
For modality-specific tasks like optical character recognition (OCR), labeling, and content moderation, choose ML APIs like the following:
For generative AI applications, consider Google models like Gemini, Imagen, and Veo.
Explore Model Garden and choose from a curated collection of foundation and task-specific Google models. Model Garden also provides open models like Gemma and third-party models, which you can run in Vertex AI or deploy on runtimes like GKE.
If a task can be completed by using either an ML API or a generative AI model, consider the complexity of the task. For complex tasks, large models like Gemini might provide higher performance than smaller models.
Improve quality through better prompting
To improve the quality of your prompts at scale, use the Vertex AI prompt optimizer. You don't need to manually rewrite system instructions and prompts. The prompt optimizer supports the following approaches:
- Zero-shot optimization: A low-latency approach that improves a single prompt or system instruction in real time.
- Data-driven optimization: An advanced approach that improves prompts by evaluating a model's responses to sample prompts against specific evaluation metrics.
For more prompt-optimization guidelines, see Overview of prompting strategies.
Improve performance for ML and generative AI endpoints
To improve the latency or throughput (tokens per second) for ML and generative AI endpoints, consider the following recommendations:
- Cache results by using Memorystore.
- For generative AI APIs, use the following techniques:
- Context caching: Achieve lower latency for requests that contain repeated content.
- Provisioned Throughput: Improve the throughput for Gemini endpoints by reserving throughput capacity.
- For self-hosted models, consider the following optimized inference
frameworks:
Framework Notebooks and guides Optimized vLLM containers in Model Garden Deploy open models with pre-built containers GPU-based inference on GKE TPU-based inference
Use low-code solutions and tuning
If pre-trained models don't meet your requirements, you can improve their performance for specific domains by using the solutions:
- AutoML is a low-code solution to improve inference results with minimal technical effort for a wide range of tasks. AutoML lets you create models that are optimized on several dimensions: architecture, performance, and training stage (through checkpointing).
- Tuning helps you achieve higher quality, more stable generation and lower latency with shorter prompts and without a lot of data. We recommend that you start tuning by using the default values for hyperparameters. For more information, see Supervised Fine Tuning for Gemini: A best practices guide.
Optimize self-managed training
In some cases, you might decide to retrain a model or fully manage a fine-tuning job. This approach requires advanced skills and additional time depending on the model, framework, and resources that you use.
Take advantage of performance-optimized framework options, such as the following:
Use deep learning images or containers, which include the latest software dependencies and Google Cloud-specific libraries.
Run model training with Ray on Google Cloud:
- Ray on Vertex AI lets you bring Ray's distributed training framework to Compute Engine or GKE and it simplifies the overhead of managing the framework.
- You can self-manage Ray on GKE with KubeRay by deploying the Ray operator on an existing cluster.
Deploy training workloads on a compute cluster that you provision by using the open-source Cluster Toolkit. To efficiently provision performance-optimized clusters, use YAML-based blueprints. Manage the clusters by using schedulers like Slurm and GKE.
Train standard model architectures by using GPU-optimized recipes.
Build training architectures and strategies that optimize performance by using the following techniques:
- Implement distributed training on Vertex AI or on the frameworks described earlier. Distributed training enables model parallelism and data parallelism, which can help to increase the training dataset size and model size, and help to reduce training time.
- For efficient model training and to explore different performance configurations, run checkpointing at appropriate intervals. For more information, see Optimize runtime goodput.
Optimize self-managed serving
For self-managed serving, you need efficient inference operations and a high throughput (number of inferences per unit of time).
To optimize your model for inference, consider the following approaches:
Quantization: Reduce the model size by representing its parameters in a lower precision format. This approach helps to reduce memory consumption and latency. However, quantization after training might change model quality. For example, quantization after training might cause a reduction in accuracy.
- Post-training quantization (PTQ) is a repeatable task. Major ML frameworks like PyTorch and TensorFlow support PTQ.
- You can orchestrate PTQ by using a pipeline on Vertex AI Pipelines.
- To stabilize model performance and benefit from reductions in model size, you can use Qwix.
Tensor parallelism: Improve inference throughput by distributing computational load across multiple GPUs.
Memory optimization: Increase throughput and optimize attention caching, batch sizes, and input sizes.
Use inference-optimized frameworks, such as the following:
- For generative models, use an open framework like MaxText, MaxDiffusion, or vLLM.
- Run prebuilt container images on Vertex AI for predictions and explanations. If you choose TensorFlow, then use the optimized TensorFlow runtime. This runtime enables more efficient and lower cost inference than prebuilt containers that use open-source TensorFlow.
- Run multi-host inferencing with large models on GKE by using the LeaderWorkerSet (LWS) API.
- Leverage the NVIDIA Triton inference server for Vertex AI.
- Simplify the deployment of inference workloads on GKE by using LLM-optimized configurations. For more information, see Analyze model serving performance and costs with GKE Inference Quickstart.
Optimize resource consumption based on performance goals
Resource optimization helps to accelerate training, iterate efficiently, improve model quality, and increase the serving capacity.
Choose appropriate processor types
Your choice of compute platform can have a significant impact on the training efficiency of a model.
- Deep learning models perform well on GPUs and TPUs because such models require large amounts of memory and parallel matrix computation. For more information about workloads that are well suited to CPUs, GPUs, and TPUs, see When to use TPUs.
- Compute-optimized VMs are ideal for HPC workloads.
Optimize training and serving on GPUs
To optimize the performance of training and inference workloads that are deployed on GPUs, consider the following recommendations:
| Recommendation | Description |
|---|---|
| Select appropriate memory specifications. | When you choose GPU machine types, select memory
specifications based on the following factors:
|
| Assess core and memory bandwidth requirements. | In addition to memory size, consider other requirements like the number of Tensor cores and memory bandwidth. These factors influence the speed of data access and computations on the chip. |
| Choose appropriate GPU machine types. | Training and serving might need different GPU machine types.
We recommend that you use large machine types for training and smaller, cost-effective machine types for inference. To detect resource utilization issues, use monitoring tools like the NVIDIA DCGM agent and adjust resources appropriately. |
| Leverage GPU sharing on GKE. | Dedicating a full GPU to a single container might be an inefficient approach in some cases. To help you overcome this inefficiency, GKE supports the following GPU-sharing strategies:
To maximize resource utilization, we recommend that you use an appropriate combination of these strategies. For example, when you virtualize a large H100 GPU by using the GPU time-sharing and multi-instance GPU strategies, the serving platform can scale up and down based on traffic. GPU resources are repurposed in real time based on the load on the model containers. |
| Optimize routing and load balancing. | When you deploy multiple models on a cluster, you can use
GKE Inference Gateway
for optimized routing and load balancing. Inference Gateway extends
the routing mechanisms of the
Kubernetes Gateway API by
using the following capabilities:
|
| Share resources for Vertex AI endpoints. | You can configure multiple Vertex AI endpoints to use a common pool of resources. For more information about this feature and its limitations, see Share resources across deployments. |
Optimize training and serving on TPUs
TPUs are Google chips that help solve massive scale challenges for ML algorithms. These chips provide optimal performance for AI training and inference workloads. When compared to GPUs, TPUs provide higher efficiency for deep learning training and serving. For information about the use cases that are suitable for TPUs, see When to use TPUs. TPUs are compatible with ML frameworks like TensorFlow, PyTorch, and JAX.
To optimize TPU performance, use the following techniques, which are described in the Cloud TPU performance guide:
- Maximize the batch size for each TPU memory unit.
- Ensure that TPUs aren't idle. For example, implement parallel data reads.
- Optimize the XLA compiler. Adjust tensor dimensions as required and avoid padding. XLA automatically optimizes for graph execution performance by using tools like fusion and broadcasting.
Optimize training on TPUs and serving on GPUs
TPUs support efficient training. GPUs provide versatility and wider availability for inference workloads. To combine the strengths of TPUs and GPUs, you can train models on TPUs and serve them on GPUs. This approach can help to reduce overall costs and accelerate development, particularly for large models. For information about the locations where TPU and GPU machine types are available, see TPU regions and zones and GPU locations.
Optimize the storage layer
The storage layer of your training and serving infrastructure is critical to performance. Training jobs and inferencing workloads involve the following storage-related activities:
- Loading and processing data.
- Checkpointing the model during training.
- Reloading binaries to resume training after node preemptions.
- Loading the model efficiently to handle inferencing at scale.
The following factors determine your requirements for storage capacity, bandwidth, and latency:
- Model size
- Volume of the training dataset
- Checkpointing frequency
- Scaling patterns
If your training data is in Cloud Storage, you can reduce the data loading latency by using file caching in Cloud Storage FUSE. Cloud Storage FUSE lets you mount a Cloud Storage bucket on compute nodes that have Local SSD disks. For information about improving the performance of Cloud Storage FUSE, see Performance tuning best practices.
A PyTorch connector to Cloud Storage provides high performance for data reads and writes. This connector is particularly beneficial for training with large datasets and for checkpointing large models.
Compute Engine supports various Persistent Disk types. With Google Cloud Hyperdisk ML, you can provision the required throughput and IOPS based on training needs. To optimize disk performance, start by resizing the disks and then consider changing the machine type. For more information, see Optimize Persistent Disk performance. To load test the read-write performance and latency at the storage layer, you can use tools like Flexible I/O tester (FIO).
For more information about choosing and optimizing storage services for your AI and ML workloads, see the following documentation:
Optimize the network layer
To optimize the performance of AI and ML workloads, configure your VPC networks to provide adequate bandwidth and maximum throughput with minimum latency. Consider the following recommendations:
| Recommendation | Suggested techniques for implementation |
|---|---|
| Optimize VPC networks. |
|
| Place VMs closer to each other. | |
| Configure VMs to support higher network speeds. |
|
Link performance metrics to design and configuration choices
To innovate, troubleshoot, and investigate performance issues, you must establish a clear link between design choices and performance outcomes. You need a reliable record of the lineage of ML assets, deployments, model outputs, and the corresponding configurations and inputs that produced the outputs.
Build a data and model lineage system
To reliably improve performance, you need the ability to trace every model version back to the exact data, code, and configurations that were used to produce the model. As you scale a model, such tracing becomes difficult. You need a lineage system that automates the tracing process and creates a record that's clear and can be queried for every experiment. This system lets your teams efficiently identify and reproduce the choices that lead to the optimally performing models.
To view and analyze the lineage of pipeline artifacts for workloads in Vertex AI, you can use Vertex ML Metadata or Dataplex Universal Catalog. Both options let you register events or artifacts to meet governance requirements and to query the metadata and retrieve information when needed. This section provides an overview of the two options. For detailed information about the differences between Vertex ML Metadata and Dataplex Universal Catalog, see Track the lineage of pipeline artifacts.
Default implementation: Vertex AI ML Metadata
Your first pipeline run or experiment in Vertex AI creates a default Vertex ML Metadata service. The parameters and artifact metadata that the pipeline consumes and generates are automatically registered to a Vertex ML Metadata store. The data model that's used to organize and connect the stored metadata contains the following elements:
- Context: A group of artifacts and executions that represents an experimentation run.
- Execution: A step in a workflow, like data validation or model training.
- Artifact: An input or output entity, object, or piece of data that a workflow produces and consumes.
- Event: A relationship between an artifact and an execution.
By default, Vertex ML Metadata captures and tracks all input and output artifacts of a pipeline run. It integrates these artifacts with Vertex AI Experiments, Model Registry, and Vertex AI managed datasets.
Autologging is a built-in feature in Vertex AI Training to automatically log data to Vertex AI Experiments. To efficiently track experiments for optimizing performance, use the built-in integrations between Vertex AI Experiments and the associated Vertex ML Metadata service.
Vertex ML Metadata provides a filtering syntax and operators to run queries about artifacts, executions, and contexts. When required, your teams can efficiently retrieve information about a model's registry link and its dataset or evaluation for a specific experiment run. This metadata can help to accelerate the discovery of choices that optimize performance. For example, you can compare pipeline runs, compare models, and compare experiment runs. For more information, including example queries, see Analyze Vertex ML Metadata.
Alternative implementation: Dataplex Universal Catalog
Dataplex Universal Catalog discovers metadata from Google Cloud resources, including Vertex AI artifacts. You can also integrate a custom data source.
Dataplex Universal Catalog can read metadata across multiple regions and organization-wide stores, whereas Vertex ML Metadata is a project-specific resource. When compared to Vertex ML Metadata, Dataplex Universal Catalog involves more setup effort. However, Dataplex Universal Catalog might be appropriate when you need integration with your wider data portfolio in Google Cloud and with organization-wide stores.
Dataplex Universal Catalog discovers and harvests metadata for projects where the Data Lineage API is enabled. The metadata in the catalog is organized by using a data model that consists of projects, entry groups, entries, and aspects. Dataplex Universal Catalog provides a specific syntax that you can use to discover artifacts. If required, you can map Vertex ML Metadata artifacts to Dataplex Universal Catalog.
Use explainability tools
The behavior of an AI model is based on data that was used to train the model. This behavior is encoded as parameters in mathematical functions. Understanding exactly why a model performs in a certain way can be difficult. However, this knowledge is critical for performance optimization.
For example, consider an image classification model where the training data contains images of only red cars. The model might learn to identify the "car" label based on the color of the object rather than the object's spatial and shape attributes. When the model is tested with images that show cars of different colors, the performance of the model might degrade. The following sections describe tools that you can use to identify and diagnose such problems.
Detect data biases
In the exploratory data analysis (EDA) phase of an ML project, you identify issues with the data, such as class-imbalanced datasets and biases.
In production systems, you often retrain models and run experiments with different datasets. To standardize data and compare across experiments, we recommend a systematic approach to EDA that includes the following characteristics:
- Automation: As a training set grows in size, the EDA process must run automatically in the background.
- Wide coverage: When you add new features, the EDA must reveal insights about the new features.
Many EDA tasks are specific to the data type and the business context. To automate the EDA process, use BigQuery or a managed data processing service like Dataflow. For more information, see Classification on imbalanced data and Data bias metrics for Vertex AI.
Understand model characteristics and behavior
In addition to understanding the distribution of data in the training and validation sets and their biases, you need to understand a model's characteristics and behavior at prediction time. To understand model behavior, use the following tools:
| Tool | Description | Purposes |
|---|---|---|
| Example-based explanations | You can use example-based explanations in Vertex Explainable AI to understand a prediction by finding the most similar examples from the training data. This approach is based on the principle that similar inputs yield similar outputs. |
|
| Feature-based explanations | For predictions that are based on tabular data or images, feature-based explanations show how much each feature affects a prediction when it's compared to a baseline. Vertex AI provides different feature attribution methods depending on the model type and task. The methods typically rely on sampling and sensitivity analysis to measure how much the output changes in response to changes in an input feature. |
|
| What-If Tool | The What-If Tool was developed by Google's People + AI Research (PAIR) initiative to help you understand and visualize the behavior of image and tabular models. For examples of using the tool, see What-If Tool Web Demos. |
|
Contributors
Authors:
- Benjamin Sadik | AI and ML Specialist Customer Engineer
- Filipe Gracio, PhD | Customer Engineer, AI/ML Specialist
Other contributors:
- Daniel Lees | Cloud Security Architect
- Kumar Dhanagopal | Cross-Product Solution Developer