使用 ML Goodput Measurement 库监控有效吞吐量

ML Goodput Measurement 库 (ml-goodput-measurement) 是一个 Python 软件包，可帮助您衡量在 Cloud TPU 虚拟机上运行的机器学习训练工作负载的效率。该库提供的指标可用于衡量工作负载有效吞吐量，即用于取得有意义的训练进展的 TPU 使用时间所占的比例。相反，低效时间是指花费在非生产性活动（如启动开销、I/O 停滞和中断恢复）上的总时间所占的比例。

您可以使用 Cloud Monitoring 信息中心和 TensorBoard 实时直观呈现有效吞吐量指标，从而能够找出瓶颈、优化资源利用率，并最终降低训练成本。

如需了解详情，请参阅 ML Goodput Measurement GitHub 代码库。

有效吞吐量指标

ML Goodput Measurement 库提供以下指标，这些指标也可在 Cloud Monitoring 和 TensorBoard 中查看。此表中的指标必须以 compute.googleapis.com/workload/ 为前缀。例如，goodput_time 的完整指标名称为 compute.googleapis.com/workload/goodput_time。

goodput_time：有效训练总时长（以秒为单位）。这可以解释为累计有效吞吐量。
badput_time：以秒为单位的低效训练总时间（启动、停滞、恢复）。这可以解释为累积的坏吐量。
total_elapsed_time：工作负载的总运行时间（以秒为单位）。经过的时间是从应用启动时开始计算，一直到当前时间或作业完成时间。
interval_goodput：指定时间段（例如过去 24 小时）内的有效吞吐率。该指标提供吞吐量的滚动窗口。
interval_badput：指定时间段内的无效吞吐率。该指标提供了一个用于计算无效吞吐量的滚动窗口，有助于识别 I/O 操作峰值等暂时性问题。
disruptions：作业中断的累计次数。中断是指导致训练过程意外停止并需要重新启动的事件。例如，硬件故障和维护事件。
step_time_deviation：因训练步长变化而导致非生产性时间。这也称为“抖动”。该指标用于衡量对突发情况敏感的稳定性，即最近“最糟糕”步数与历史基准之间的偏差（以秒为单位）。步是训练循环的单次迭代。
performance：以秒为单位的估计最快稳定步频（基准）。理想步进时间是指在最佳条件下（不受瞬态噪声或抖动影响），单个训练步进可花费的最短时间。
max_productive_steps：成功保留的最高步数。

安装

按照以下步骤为 TPU 工作负载设置 ML Goodput Measurement 库：

启用 Cloud Logging API 和 Cloud Monitoring API。
如果要在 Google Kubernetes Engine (GKE) 上部署，请为所有节点池配置 cloud-platform 访问权限范围。

在训练主机和分析机器上安装 ml-goodput-measurement 软件包：

pip install ml-goodput-measurement

使用 Goodput 库进行监控

如需使用 ML Goodput Measurement 库，请初始化 GoodputRecorder 实例，通过使用 record_event 上下文管理器封装关键部分来对训练代码进行插桩，并使用 GoodputMonitor 实时监控指标。监控器会运行一个后台进程，定期根据记录的事件计算有效吞吐量指标，并将其上传到 Cloud Monitoring 和 TensorBoard 以进行实时分析和可视化。

初始化 Goodput 记录器

初始化 GoodputRecorder，它是机器学习有效吞吐量测量库的核心组件。

import datetime
import jax
from ml_goodput_measurement import measurement

# Define a unique logger name for this specific run
logger_name = f'goodput_{config.run_name}'

# Instantiate the recorder
goodput_recorder = measurement.GoodputRecorder(
    job_name=config.run_name,
    logger_name=logger_name,
    logging_enabled=(jax.process_index() == 0)
)

记录活动

使用 record_event 上下文管理器封装训练代码。

def train_loop(config):

  # 1. Wrap the entire Job (Start/End)
  with goodput_recorder.record_event(measurement.Event.JOB):

    # 2. Record Hardware Initialization
    with goodput_recorder.record_event(measurement.Event.ACCELERATOR_INIT):
      # ... perform device mesh setup ...
      initialize_tpu(config)

    # 3. Record Training Prep
    with goodput_recorder.record_event(measurement.Event.TRAINING_PREP):
      # ... create checkpoint managers, setup model ...
      model = training_prep(config)

    # 4. Main Training Loop
    for step in range(config.steps):

      # Record Data Loading
      with goodput_recorder.record_event(measurement.Event.DATA_LOADING):
        batch = get_next_batch()

      # Record Step Start (CRITICAL: Pass the step number!)
      with goodput_recorder.record_event(measurement.Event.STEP, step):
        output = train_step(model, batch)

      # 5. Record Custom Events (e.g., Evaluation)
      if step % eval_interval == 0:
        with goodput_recorder.record_event(measurement.Event.CUSTOM, "eval_step"):
          run_evaluation()

如需查看如何使用 MaxText 将训练代码与 Goodput 库集成，请参阅 goodput.py。

监控活动

使用 GoodputMonitor 监控指标，该工具会在作业运行时启动后台进程来计算和上传指标。将逻辑封装在上下文管理器中，以确保上传器进程按预期启动和停止。

定义一个辅助上下文管理器来处理配置和生命周期管理。

import contextlib
from ml_goodput_measurement import monitoring

@contextlib.contextmanager
def maybe_monitor_goodput(config):
  """Monitor goodput if enabled and on the main process."""
  if not config.monitor_goodput or jax.process_index() != 0:
    yield
    return

  goodput_monitor = None
  try:
    # Configure GCPOptions for Cloud Monitoring
    gcp_options = monitoring.GCPOptions(
      enable_gcp_goodput_metrics=config.enable_gcp_goodput_metrics
    )

    # Instantiate the monitor
    goodput_monitor = monitoring.GoodputMonitor(
      job_name=config.run_name,
      logger_name=f"goodput_{config.run_name}",
      tensorboard_dir=config.tensorboard_dir,
      upload_interval=config.goodput_upload_interval_seconds,
      monitoring_enabled=True,
      pathway_enabled=config.enable_pathways_goodput,
      include_badput_breakdown=True,
      gcp_options=gcp_options,
    )

    # Start the background upload process
    goodput_monitor.start_goodput_uploader()
    print("Started Goodput upload to Tensorboard & GCM in the background!")
    yield

  finally:
    # Ensure clean shutdown of the background process
    if goodput_monitor:
      goodput_monitor.stop_goodput_uploader()
      print("Flushed final metrics and safe exited from Goodput monitoring.")

如需衡量滚动窗口性能，请将监控器配置为使用 start_rolling_window_goodput_uploader 和 stop_rolling_window_goodput_uploader。

try:
    self._rolling_window_monitor.start_rolling_window_goodput_uploader(
        self.config.rolling_window_size
    )
finally:
    if self._rolling_window_monitor:
        self._rolling_window_monitor.stop_rolling_window_goodput_uploader()

使用上下文管理器封装主要训练入口点。这样可确保在训练开始之前启动监控，并在训练结束时清除最终指标。

def main():
  # ... Load configuration ...

  # Wrap the entire execution
  with maybe_monitor_goodput(config):
    # Run the training loop (which contains the GoodputRecorder events)
    train_loop(config)

后期处理和分析

您可以在任何机器（例如标准 CPU 虚拟机或笔记本电脑）上计算已完成作业的有效吞吐量指标。您无需使用 TPU 进行后处理和分析。

以下代码会输出作业总有效吞吐量：

from ml_goodput_measurement import goodput

calculator = goodput.GoodputCalculator(
    job_name="my-run-name",
    logger_name="goodput_my-run-name"
)

goodput, badput, last_step = calculator.get_job_goodput(
    include_badput_breakdown=True
)

print(f"Goodput: {goodput}%")
print(f"Badput (Infra Recovery): {badput[goodput.BadputType.INFRASTRUCTURE_RECOVERY_FROM_DISRUPTION]}%")

如需了解不良输出 (BadputType) 类型，请参阅不良输出细分详情。

如需指定具体的时间范围，请使用 get_job_goodput_interval：

goodput_pct, badput, _, _, _ = calculator.get_job_goodput_interval(
    start_time_utc,
    end_time_utc
)

开始时间 (start_time_utc) 和结束时间 (end_time_utc) 是 datetime 对象。如需了解详情，请参阅 goodput.py。

使用有效吞吐量信息中心进行监控

为了帮助您监控和直观呈现机器学习训练工作负载，Google Cloud 提供了两组有效吞吐量信息中心：GKE JobSet 有效吞吐量信息中心和 Cloud ML 有效吞吐量信息中心。使用 GKE JobSet 信息中心诊断基础架构或调度问题，并使用 Cloud ML Goodput 信息中心精确定位训练代码中的瓶颈。

Google Cloud 机器学习实际吞吐量信息中心

机器学习 Goodput 信息中心用于衡量训练脚本的应用级效率。它可提供有关以下方面的数据洞见：用于高效训练的时间，以及数据加载、初始化或从中断中恢复等低效输出源。

如需在 Cloud Monitoring 中查看 ML Goodput 信息中心，请执行以下操作：

在 Google Cloud 控制台中，前往 Cloud Monitoring 页面。
前往 Monitoring 控制台
在导航窗格中，点击信息中心。
在过滤条件搜索字段中，输入“Cloud ML Goodput Dashboard”。

机器学习有效吞吐量信息中心包含以下指标（以 workload/ 为前缀）：

goodput_time：在有效训练步上花费的累计时间。
badput_time：非生产性活动的累计时间，以及低效输出源：
- ACCELERATOR_INITIALIZATION：设置 TPU 的时间。
- TRAINING_PREP：检查点加载、模型和优化器创建。
- PROGRAM_STARTUP：JIT 编译、图表跟踪。
- DATA_LOADING_SYNC：因数据输入而阻塞的时间。
- CHECKPOINT_SAVE：保存模型状态的时间。
- CHECKPOINT_RESTORE：用于恢复模型状态的时间。
- WASTED_PROGRESS：因检查点之前发生的中断而损失的生产时间。
- INFRASTRUCTURE_RECOVERY：作业重启期间的停机时间。
- CUSTOM_BADPUT_EVENTS：用户定义的同步事件，例如评估。
step_time_deviation：衡量训练步长时间内的抖动和不稳定性。
interval_goodput：在滚动窗口（例如，过去 1 小时）内计算出的有效吞吐量。

GKE JobSet 信息中心

GKE JobSet Goodput 信息中心侧重于编排层的效率，可帮助您了解 GKE 是否快速调度 Pod 并保持其运行。

如需查看有关 JobSet 健康状况和性能的全面信息，请前往 Google Cloud 控制台中的 JobSet 监控信息中心：

前往 JobSet 监控信息中心

JobSet 监控信息中心包含以下指标：

kubernetes.io/jobset/scheduling_goodput：相对于 JobSet 的创建时间，JobSet 的所有必需资源（Pod）处于就绪状态并可用于执行工作的总时间所占的比例。值较低表示 GKE 中存在 Pod 调度、映像拉取或资源分配延迟。
kubernetes.io/jobset/proxy_runtime_goodput：根据加速器占空比等系统级信号估算的 TPU 活跃使用时间比例。这样一来，无需进行应用插桩，即可大致了解运行时生产力。
节点池指标：有关托管 JobSet 的节点池的健康状况、可用性和中断情况的信息。这有助于将有效吞吐量下降与底层节点问题相关联。

如需详细了解 GKE JobSet 信息中心，请参阅监控 JobSet 有效吞吐量。

问题排查指南

本部分提供的问题排查信息可帮助您识别和解决在使用 ML Goodput Measurement 库时可能遇到的常见问题。

缺少 Cloud Monitoring 指标

如果您缺少 Cloud Monitoring 指标，请执行以下操作：

验证 IAM 权限：附加到工作负载的服务账号必须具有 monitoring.timeSeries.create 和 monitoring.metricDescriptors.create 权限。
检查配置：必须在配置中明确启用有效吞吐量指标。例如，验证 enable_gcp_goodput_metrics=True 是否在 GCPOptions 中传递。

缺少常规指标

如果您缺少常规指标，请执行以下操作：

验证监控器状态：必须为监控器实例化启用监控。将 monitoring_enabled=True 传递给监控器。

损坏的指标

如果您的指标已损坏或显示异常值，请执行以下操作：

确保运行名称唯一：每个实验的所有 run_name 值都必须是唯一的。重复使用运行名称会导致旧运行和新运行的日志混杂在一起。

缺失日志

如果您发现 Cloud Logging 中缺少日志，请执行以下操作：

启用 Cloud Logging API：必须在Google Cloud 项目中启用 Cloud Logging API。
检查记录器设置：必须将 logging_enabled=True 标志传递给 Goodput 记录器。
验证主进程：主进程（即 jax.process_index() == 0 所在的进程）必须主动报告应用日志。

缺少 badput 检查点

如果您缺少 badput 检查点，请执行以下操作：

验证您的检查点库：Goodput 库会自动跟踪 Orbax 的检查点时间。不过，如果您使用的是其他库（例如 PyTorch 检查点），则必须使用 record_custom_badput_event_start_time 和 record_custom_badput_event_end_time 手动封装保存和恢复调用。