Google 會運用 AI 技術將內容翻譯成你偏好的語言，但可能會出錯。

使用 ML Goodput Measurement 程式庫監控有效處理量

ML Goodput Measurement 程式庫 (ml-goodput-measurement) 是 Python 套件，可協助您評估在 Cloud TPU VM 上執行的 ML 訓練工作負載效率。這個程式庫提供的指標可衡量工作負載的有效輸送量，也就是 TPU 使用時間中，用於產生有效訓練進展的時間比例。反之，不良輸出是指非生產力活動 (例如啟動負擔、I/O 停滯和中斷復原) 所佔的總時間比例。

您可以使用 Cloud Monitoring 資訊主頁和 TensorBoard，即時查看有效輸送量指標，找出效能瓶頸、最佳化資源用量，並最終降低訓練成本。

詳情請參閱 ML Goodput Measurement GitHub 存放區。

有效輸送量指標

ML Goodput Measurement 程式庫提供下列指標，您也可以在 Cloud Monitoring 和 TensorBoard 中查看這些指標。這個表格中的指標開頭必須為 compute.googleapis.com/workload/。舉例來說，goodput_time 的完整指標名稱為 compute.googleapis.com/workload/goodput_time。

goodput_time：以秒為單位的總生產力訓練時間。這可解讀為累積的有效輸送量。
badput_time：以秒為單位的訓練總時間 (啟動、停滯、復原)。這可以解讀為累積不良輸出。
total_elapsed_time：工作負載的總經過時間 (實際時間)，以秒為單位。經過時間的計算方式是從應用程式啟動時開始，到目前時間或工作完成時間為止。
interval_goodput：指定時間範圍內的有效輸送量率 (例如過去 24 小時)。這項指標提供輸送量的滾動時間範圍。
interval_badput：指定時間範圍內的壞包率。這項指標會提供不良輸出量的滾動視窗，有助於找出暫時性問題，例如 I/O 作業的尖峰。
disruptions：工作中斷的累計次數。中斷是指導致訓練程序意外停止的事件，需要重新啟動。例如硬體故障和維護事件。
step_time_deviation：因訓練步驟時間差異而導致的非生產力時間量。這也稱為「抖動」。這項指標會測量對尖峰變化敏感的穩定性，也就是近期「最差情況」步數與歷史基準的偏差值 (以秒為單位)。步驟是指訓練迴圈的單一疊代。
performance：預估最快的穩定步速 (基準)，以秒為單位。理想步驟時間是指在最佳情況下，單一訓練步驟可達到的最快速度，不受暫時性雜訊或抖動影響。
max_productive_steps：成功保留的最高步數。

安裝

請按照下列步驟，為 TPU 工作負載設定 ML Goodput Measurement 程式庫：

啟用 Cloud Logging API 和 Cloud Monitoring API。
如要在 Google Kubernetes Engine (GKE) 上部署，請為所有節點集區設定 cloud-platform 存取權範圍。

在訓練主機和分析電腦上安裝 ml-goodput-measurement 套件：

pip install ml-goodput-measurement

使用 Goodput 程式庫監控

如要使用 ML Goodput Measurement 程式庫，請初始化 GoodputRecorder 執行個體，使用 record_event 內容管理工具包裝重要區段，藉此檢測訓練程式碼，並透過 GoodputMonitor 監控即時指標。監控器會執行背景程序，定期從記錄的事件計算有效輸送量指標，並上傳至 Cloud Monitoring 和 TensorBoard，以進行即時分析和視覺化。

初始化 Goodput 記錄器

初始化 GoodputRecorder，這是 ML 效率測量程式庫的核心元件。

import datetime
import jax
from ml_goodput_measurement import measurement

# Define a unique logger name for this specific run
logger_name = f'goodput_{config.run_name}'

# Instantiate the recorder
goodput_recorder = measurement.GoodputRecorder(
    job_name=config.run_name,
    logger_name=logger_name,
    logging_enabled=(jax.process_index() == 0)
)

記錄事件

使用 record_event 內容管理工具包裝訓練程式碼。

def train_loop(config):

  # 1. Wrap the entire Job (Start/End)
  with goodput_recorder.record_event(measurement.Event.JOB):

    # 2. Record Hardware Initialization
    with goodput_recorder.record_event(measurement.Event.ACCELERATOR_INIT):
      # ... perform device mesh setup ...
      initialize_tpu(config)

    # 3. Record Training Prep
    with goodput_recorder.record_event(measurement.Event.TRAINING_PREP):
      # ... create checkpoint managers, setup model ...
      model = training_prep(config)

    # 4. Main Training Loop
    for step in range(config.steps):

      # Record Data Loading
      with goodput_recorder.record_event(measurement.Event.DATA_LOADING):
        batch = get_next_batch()

      # Record Step Start (CRITICAL: Pass the step number!)
      with goodput_recorder.record_event(measurement.Event.STEP, step):
        output = train_step(model, batch)

      # 5. Record Custom Events (e.g., Evaluation)
      if step % eval_interval == 0:
        with goodput_recorder.record_event(measurement.Event.CUSTOM, "eval_step"):
          run_evaluation()

如要瞭解如何使用 MaxText 將訓練程式碼與 Goodput 程式庫整合，請參閱 goodput.py。

監控事件

使用 GoodputMonitor 監控指標，這會啟動背景程序，在工作執行期間計算及上傳指標。將邏輯包裝在內容管理員中，確保上傳程序如預期啟動和停止。

定義輔助內容管理工具，處理設定和生命週期管理。

import contextlib
from ml_goodput_measurement import monitoring

@contextlib.contextmanager
def maybe_monitor_goodput(config):
  """Monitor goodput if enabled and on the main process."""
  if not config.monitor_goodput or jax.process_index() != 0:
    yield
    return

  goodput_monitor = None
  try:
    # Configure GCPOptions for Cloud Monitoring
    gcp_options = monitoring.GCPOptions(
      enable_gcp_goodput_metrics=config.enable_gcp_goodput_metrics
    )

    # Instantiate the monitor
    goodput_monitor = monitoring.GoodputMonitor(
      job_name=config.run_name,
      logger_name=f"goodput_{config.run_name}",
      tensorboard_dir=config.tensorboard_dir,
      upload_interval=config.goodput_upload_interval_seconds,
      monitoring_enabled=True,
      pathway_enabled=config.enable_pathways_goodput,
      include_badput_breakdown=True,
      gcp_options=gcp_options,
    )

    # Start the background upload process
    goodput_monitor.start_goodput_uploader()
    print("Started Goodput upload to Tensorboard & GCM in the background!")
    yield

  finally:
    # Ensure clean shutdown of the background process
    if goodput_monitor:
      goodput_monitor.stop_goodput_uploader()
      print("Flushed final metrics and safe exited from Goodput monitoring.")

如要測量滾動週期效能，請將監控器設為使用 start_rolling_window_goodput_uploader 和 stop_rolling_window_goodput_uploader。

try:
    self._rolling_window_monitor.start_rolling_window_goodput_uploader(
        self.config.rolling_window_size
    )
finally:
    if self._rolling_window_monitor:
        self._rolling_window_monitor.stop_rolling_window_goodput_uploader()

使用內容管理工具包裝主要訓練進入點。這可確保監控作業在訓練開始前啟動，並在訓練結束時清除最終指標。

def main():
  # ... Load configuration ...

  # Wrap the entire execution
  with maybe_monitor_goodput(config):
    # Run the training loop (which contains the GoodputRecorder events)
    train_loop(config)

後續處理和分析

您可以從任何機器 (例如標準 CPU VM 或筆電)，計算已完成工作的有效輸送量指標。您不需要使用 TPU 進行後續處理和分析。

下列程式碼會輸出工作總體良率：

from ml_goodput_measurement import goodput

calculator = goodput.GoodputCalculator(
    job_name="my-run-name",
    logger_name="goodput_my-run-name"
)

goodput, badput, last_step = calculator.get_job_goodput(
    include_badput_breakdown=True
)

print(f"Goodput: {goodput}%")
print(f"Badput (Infra Recovery): {badput[goodput.BadputType.INFRASTRUCTURE_RECOVERY_FROM_DISRUPTION]}%")

如要瞭解 badput 類型 (BadputType)，請參閱「Badput breakdown details」(Badput 細目詳細資料)。

如要指定特定時段，請使用 get_job_goodput_interval：

goodput_pct, badput, _, _, _ = calculator.get_job_goodput_interval(
    start_time_utc,
    end_time_utc
)

開始時間 (start_time_utc) 和結束時間 (end_time_utc) 是 datetime 物件。詳情請參閱 goodput.py。

使用有效輸送量資訊主頁監控

為協助您監控及視覺化呈現機器學習訓練工作負載，Google Cloud 提供兩組有效處理量資訊主頁：GKE JobSet 有效處理量資訊主頁和 Cloud ML 有效處理量資訊主頁。使用 GKE JobSet 資訊主頁診斷基礎架構或排程問題，並透過 Cloud ML Goodput 資訊主頁找出訓練程式碼中的瓶頸。

Google Cloud 機器學習效率有效處理量資訊主頁

機器學習效率資訊主頁會評估訓練指令碼的應用程式層級效率。這項工具可深入瞭解用於有效訓練的時間，以及資料載入、初始化或從中斷狀態復原等不良輸出來源。

如要在 Cloud Monitoring 中查看 ML Goodput 資訊主頁，請按照下列步驟操作：

前往 Google Cloud 控制台的「Cloud Monitoring」頁面。
前往 Monitoring 主控台
在導覽窗格中，按一下「資訊主頁」。
在「篩選器搜尋」欄位中，輸入「Cloud ML Goodput Dashboard」。

機器學習有效處理量資訊主頁包含下列指標 (前置字元為 workload/)：

goodput_time：用於有效訓練步驟的累計時間。
badput_time：非生產力活動的累計時間，以及無效處理來源：
- ACCELERATOR_INITIALIZATION：設定 TPU 的時間。
- TRAINING_PREP：檢查點載入、模型和最佳化工具建立。
- PROGRAM_STARTUP：JIT 編譯、圖表追蹤。
- DATA_LOADING_SYNC：因資料輸入而遭到封鎖的時間。
- CHECKPOINT_SAVE：儲存模型狀態的時間。
- CHECKPOINT_RESTORE：還原模型狀態的時間。
- WASTED_PROGRESS：因檢查點前發生中斷而損失的生產力時間。
- INFRASTRUCTURE_RECOVERY：工作重新啟動期間的停機時間。
- CUSTOM_BADPUT_EVENTS：使用者定義的同步事件，例如評估。
step_time_deviation：測量訓練步驟時間的抖動和不穩定性。
interval_goodput：透過滾動視窗計算的有效輸送量 (例如過去一小時)。

GKE JobSet 資訊主頁

GKE JobSet Goodput 資訊主頁著重於調度管理層的效率，可協助您瞭解 GKE 是否能快速調度 Pod 並保持運作。

如要查看 JobSet 健康狀態和效能的完整資訊，請前往 Google Cloud 控制台的「JobSet monitoring」(JobSet 監控) 資訊主頁：

前往 JobSet 監控資訊主頁

JobSet 監控資訊主頁包含下列指標：

kubernetes.io/jobset/scheduling_goodput：JobSet 的所有必要資源 (Pod) 處於就緒狀態並可執行工作的時間比例，以 JobSet 的建立時間為基準。值偏低表示 GKE 內 Pod 排程、映像檔提取或資源分配作業發生延遲。
kubernetes.io/jobset/proxy_runtime_goodput：根據系統層級信號 (例如加速器工作週期)，估算 TPU 實際使用時間的分數。這項工具提供執行階段生產力的概略情形，無須進行應用程式插樁。
節點集區指標：與代管 JobSet 的節點集區健康狀態、可用性和中斷情形相關的資訊。這有助於將有效輸送量下降與基礎節點問題建立關聯。

如要進一步瞭解 GKE JobSet 資訊主頁，請參閱「監控 JobSet 正常輸送量」。

疑難排解指南

本節提供疑難排解資訊，協助您找出並解決使用 ML Goodput Measurement 程式庫時可能遇到的常見問題。

缺少 Cloud Monitoring 指標

如果缺少 Cloud Monitoring 指標，請按照下列步驟操作：

驗證 IAM 權限：附加至工作負載的服務帳戶必須具備 monitoring.timeSeries.create 和 monitoring.metricDescriptors.create 權限。
檢查設定：您必須在設定中明確啟用有效輸送量指標。舉例來說，請確認 enable_gcp_goodput_metrics=True 是否在 GCPOptions 中傳遞。

缺少一般指標

如果缺少一般指標：

確認監控器狀態：監控器例項必須啟用監控功能。將 monitoring_enabled=True 傳遞至監視器。

指標損毀

如果指標損毀或顯示異常值，請按照下列步驟操作：

確保執行名稱不重複：每個實驗的 run_name 值都不得重複。重複使用執行名稱會混用舊執行和新執行的記錄。

缺少記錄

如果 Cloud Logging 缺少記錄：

啟用 Cloud Logging API：您必須在Google Cloud 專案中啟用 Cloud Logging API。
檢查記錄器設定： 必須將 logging_enabled=True 旗標傳遞至 Goodput 記錄器。
驗證主要程序：主要程序 (即 jax.process_index() == 0 所在位置) 必須主動回報應用程式記錄。

缺少 badput 檢查點

如果缺少 badput 檢查點：

驗證檢查點程式庫：Goodput 程式庫會自動追蹤 Orbax 的檢查點時間。不過，如果您使用其他程式庫 (例如 PyTorch 檢查點)，就必須使用 record_custom_badput_event_start_time 和 record_custom_badput_event_end_time 手動包裝儲存和還原呼叫。