ML Diagnostics platform

Google Cloud ML Diagnostics is an end-to-end managed platform for optimizing and diagnosing AI and ML workloads on Google Cloud. Use ML Diagnostics to collect and visualize all workload metrics, configurations, and profiles within a single platform. ML Diagnostics is applicable to both training and inference workloads, and is compatible with all orchestrators on Cloud TPU, including Google Kubernetes Engine (GKE) and custom orchestrators. ML Diagnostics includes the following features:

  • Machine learning runs: Use ML Diagnostics to create and register your machine learning runs through Google Cloud CLI, or integrate the ML Diagnostics SDK with your workload. You can deploy managed XProf instances with your machine learning runs, and collect and manage workload metrics, configurations, and profile sessions.
  • gcloud CLI experience: Use the ML Diagnostics APIs through gcloud CLI to register and manage runs, deploy managed XProf resources, visualize profile sessions in storage buckets, and trigger profile captures from the CLI.
  • Python SDK: Use the open-source ML Diagnostics SDK integrated with ML workloads for a complete ML workload diagnostics experience. Collect and manage workload metrics, configurations, and profiles on Google Cloud.
  • Managed profiling: ML Diagnostics deploys a managed instance of XProf with a scalable backend into associated accounts, enabling the fast loading of large profiles. It supports multiple users simultaneously accessing profiles, and contains built-in features like multi-host profiling and on-demand profiling.
  • Workload metrics: Track workload metrics, including model quality, model performance, and system metrics.
  • Workload configuration management: Track workload configurations, including software configurations, system configurations, and user-defined configurations.
  • Visualizations in Cluster Director and GKE: Visualize metrics, configurations, and profiles in Cluster Director and Google Kubernetes Engine in the Google Cloud console.
  • Link sharing: Collaborate with shareable links for profiles and machine learning run information.

User paths

You can use the ML Diagnostics platform through the SDK or CLI. With the CLI, you can use ML Diagnostics gcloud CLI to create a machine learning run, and deploy the managed XProf resources. With the ML Diagnostics SDK, the SDK needs to be integrated into your ML workload to collect and manage workload metrics and configurations, and deploy managed XProf resources.

To get started, use one of the following guides:

Managed profiling with XProf

You can get a managed profiling experience with XProf when you use either the CLI or SDK. XProf is an open-source profiling and performance analysis tool for machine learning workloads and part of the OpenXLA ecosystem.

The benefits of a managed profiling experience compared to a self-hosted profiling experience include:

  • No required setup of XProf or other dependencies.
  • Better security and protection from vulnerabilities.
  • Shareable links for collaboration.
  • Faster loading of large profiles.
  • Support for multiple users simultaneously accessing profiles with automatic scaling of resources based on link access load.
  • Built-in features like multi-host profiling and on-demand profiling.
  • Load multiple profile sessions across multiple runs with the same managed XProf instance.
  • There is no charge for the managed XProf resources deployed by the ML Diagnostics platform, making managed XProf more cost effective than self-hosting XProf.

Prerequisites

Before using ML Diagnostics, enable the Cluster Director API and add the required IAM permissions. If you are using GKE, you also need to configure your GKE cluster and label the GKE workload. For more information, see Set up GKE.

Enable Cluster Director API

You don't need to use Cluster Director for deploying and managing your clusters in order to use the ML Diagnostics product. ML Diagnostics works with clusters managed by GKE, Cluster Director, or custom orchestrators. ML Diagnostics is part of the Cluster Director family of APIs, but doesn't depend on users using the Cluster Director product itself.

For more information about enabling Cluster Director API, see Enabling an API in your Google Cloud project.

IAM permissions

The Google Cloud service account used by your workload requires the following IAM roles assigned on the project.

If using the ML Diagnostics SDK:

  • roles/clusterdirector.editor: For full access to create and manage MLRun resources and view the user interface.
  • roles/logging.logWriter: To write logs and metrics to Cloud Logging.
  • roles/storage.objectUser: To save profiles to the Cloud Storage bucket specified in machinelearning_run.

If using the ML Diagnostics gcloud CLI:

  • roles/storage.objectUser: To save profiles to the Cloud Storage bucket specified in machinelearning_run.

For workloads on Google Kubernetes Engine, use Workload Identity Federation to associate a Kubernetes Service Account with a Google Cloud service account that has been granted the required roles.

Pricing

You are charged for the storage of metrics through Cloud Logging, and the storage of profiles through Cloud Storage. There is no need to enable any extra billing for these services when using ML Diagnostics platform. There is no charge for the managed XProf resources deployed by ML Diagnostics platform.