Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

排查 Pathways on Cloud 问题

准备工作

本文档介绍了如何排查 Pathways 工作负载的问题。

查看日志

在 Cloud Logging Logs Explorer中，使用以下查询，并根据您的项目、区域、集群和工作负载进行调整。

resource.type="k8s_container"
resource.labels.project_id="PROJECT"
resource.labels.location="LOCATION"
resource.labels.cluster_name="CLUSTER"
resource.labels.namespace_name="default"
resource.labels.pod_name:"WORKLOAD_NAME"

替换以下内容：

PROJECT：您的 Google Cloud 项目 ID
LOCATION：您创建 GKE 集群的区域或可用区
CLUSTER：GKE 集群的名称
WORKLOAD_NAME：使用 XPK 时工作负载的名称，或使用 kubectl 时 JobSet 的名称

此查询将匹配多个名称类似于 pathways-rm、pathways-proxy、pathways-worker 的 Pathways Kubernetes 容器。您可以通过添加容器名称过滤条件（例如 resource.labels.container_name:"<container_name>"）来缩小问题容器的范围

监控

运行状况监控

您可以通过查看容器日志中的条目来监控各种 Pathways 组件的健康状况，例如：

当写入以下日志时，pathways-proxy 便可处理新的连接请求：

kubectl logs ${HEAD_POD_NAME} --container pathways-proxy
...
I1101 04:51:41.967764       1 proxy_server.cc:125] IFRT proxy server started with status OK

当写入以下日志时，pathways-rm 便可处理新的连接请求：

kubectl logs $HEAD_POD_NAME --container pathways-rm
...
I1101 04:50:41.967764       1 server_lib.cc:1473] Pathways Server serving on [::]:29001

如需验证注册到 Pathways Resource Manager 的所有 TPU 是否已准备就绪，您可以查看 pathways-rm 容器日志中是否包含 ***<num slices>/<num slices> Pathways Slices Now Ready ***：

kubectl logs $HEAD_POD_NAME --container pathways-rm
...
I1101 04:52:41.967764       1 multi_job_allocator.cc:1063] *** 2/2 Pathways Slices Now Ready ***

只要 IFRT 代理服务器已就绪，Pathways 客户端就可以连接到该服务器，即使并非所有切片都已就绪。您的作业会使用虚拟切片，直到切片准备就绪。虚拟切片可在 TPU 不可用时运行代码。

当写入以下日志时，pathways-worker 便可处理新的连接请求：

kubectl logs $WORKER_POD_NAME --container pathways-worker
...
I1101 04:50:41.967764       1 server_lib.cc:1473] Pathways Server serving on [::]:29001

指标收集

Pathways 可以将低级系统指标写入 Cloud Monitoring 以进行调试。这些指标包括：

DCN 传输延迟时间
集体延迟时间
从主机到设备的传输延迟时间
从设备到主机的传输延迟时间

您可以在运行 GKE 集群的 {gcp_name} 项目的 Cloud Monitoring 信息中心内找到这些指标。

如需在 Metrics Explorer 中监控这些指标，请执行以下操作：

前往 Metrics Explorer
使用选择指标字段按指标名称进行过滤
选择添加过滤条件，然后按 pod_name 过滤，以按工作负载名称和时间范围进行过滤
根据指标和您监控的工作负载选择合适的聚合类型。

DCN 传输延迟时间

这是一个超大规模 XLA (MXLA) 指标，用于衡量多切片流量的网络传输延迟时间的累积分布。延迟时间测量从发出通过 DCN 传输数据的请求时开始，到收到数据传输已完成的确认时结束。如需监控此指标，请按 dcn_transfer_latencies 指标名称进行过滤。

集体延迟时间

此指标是 MXLA 指标，用于衡量多切片流量的端到端总体延迟时间的累积分布。延迟时间测量从发出集合请求时开始，到收到数据传输已完成的确认时结束。如需监控此指标，请按 collective_e2e_latency 指标名称进行过滤。

主机到设备传输延迟时间

此指标是 MXLA 指标，用于衡量多切片流量的主机到设备传输延迟时间的累积分布。延迟时间测量从发出通过 DCN 传输数据的请求时开始，到收到数据传输已完成的确认时结束。如需监控此指标，请按 host_to_device_transfer_latencies 指标名称进行过滤。

设备到主机传输延迟时间

此 MXLA 指标用于衡量多切片流量的设备到主机传输延迟时间的累积分布。延迟时间测量从发出通过 DCN 传输数据的请求时开始，到收到数据传输已完成的确认时结束。如需监控此指标，请按 device_to_host_transfer_latencies 指标名称进行过滤。

调试常见错误

无法对加速器配置警告进行哈希处理

以下消息仅为警告，不会影响 JAX 代码的性能。

INFO:jax._src.cache_key:get (_hash_accelerator_config): unable to hash accelerator config, falling back to hashing devices + platform: UNIMPLEMENTED: GetTopologyForDevices is not supported for the IFRT proxy client. (type <class 'jaxlib.xla_extension.XlaRuntimeError'>)

对资源的权限 logging.logEntries.create 被拒绝

如果您看到以下错误，请确保 Vertex AI Workbench 使用的 Compute Engine 服务账号具有向 Google Cloud 日志记录系统写入日志条目的权限。

INFO:absl:Created 'ArrayHandler' with primary_host=0, replica-id=0
WARNING:absl:pathwaysutils: Detected Pathways-on-Cloud backend. Applying changes.
Failed to submit 1 logs.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 65, in error_remapped_callable
    return callable_(**args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/grpc/_channel.py", line 1181, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/opt/conda/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 1006, in _end_unary_response_blocking
    raise _InactiveRpcError(state) # pytype: disable-not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
  status = StatusCode.PERMISSION_DENIED
  details = "Permission 'logging.logEntries.create' denied on resource (or it may not exist)."
  debug_error_string = "UNKNOWN:Error received from peer ipv4:216.239.34.174:443 {created_time:"2024-10-03T20:30:44.820425276+00.00", grpc_status:7, grpc_message:"Permission \'logging.logEntries.create\' denied on resource (or it may not exist)."}"
>

如需解决此问题，请向 Compute Engine 服务账号添加 Logs Writer 角色。

无法连接到 IFRT 代理服务器

如果您看到此错误，则表示您的 IFRT 代理客户端无法连接到 IFRT 代理服务器。

确保 VPC 网络已正确配置
确保防火墙已配置为允许连接
确保已预配 Pathways 集群
检查启动错误消息

当您的第一个向 Pathways 集群发送 IFRT 命令的 JAX 命令尝试执行时，它会停止响应约一分钟，然后显示类似于以下内容的 RuntimeError：

RuntimeError: Unable to initialize backend 'proxy': UNAVAILABLE: Unable to establish connection to ifrt_proxy server, please check provided address example-workload-proxy-0-0.example-workload.default.svc.example-cluster-domain.:38676'; detailed error: DNS resolution failed (set JAX_PLATFORMS='' to automatically choose an available backend)

已与 Pathways 集群建立连接

一个 Pathways 集群一次只能与一个客户端保持会话。如果两个不同的笔记本尝试连接到同一 Pathways 集群，则其中一个能够连接，而另一个会显示以下错误。

INFO:absl:Created ArrayHandler with primary_host=0, replica_id=0
WARNING:absl:pathwaysutils: Detected Pathways-on-Cloud backend. Applying changes.
E0927 21:19:52.919607   37624 rpc_helper.cc:56] Connection to IFRT proxy server was terminated: CANCELLED: Cancelled
E0927 21:20:03.467547   37719 rpc_helper.cc:56] Connection to IFRT proxy server was terminated: CANCELLED: Cancelled
E0927 21:20:14.011645   37807 rpc_helper.cc:56] Connection to IFRT proxy server was terminated: CANCELLED: Cancelled
E0927 21:20:24.557955   37924 rpc_helper.cc:56] Connection to IFRT proxy server was terminated: CANCELLED: Cancelled
---------------------------------------------------------------------------
XlaRuntimeError                           Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/jax/_src/xla_bridge.py:887, in backends()
    885   continue
--> 887 backend = _init_backend(platform)
    888 _backends[platform] = backend

File /opt/conda/lib/python3.10/site-packages/jax/_src/xla_bridge.py:973, in _init_backend(platform)
    972 logger.debug("Initializing backend '%s'", platform)
--> 973 backend = registration.factory()
    974 # TODO: consider raising more descriptive errors directly from backend
    975 # factories instead of returning None.

File /opt/conda/lib/python3.10/site-packages/pathwaysutils/proxy_backend.py:24, in register_backend_factory.<locals>.<lambda>()
     21 def register_backend_factory():
     22   xla_bridge.register_backend_factory(
     23       "proxy",
---> 24       lambda: ifrt_proxy.get_client(
     25           jax.config.read("jax_backend_target"),
     26           ifrt_proxy.ClientConnectionOptions(),
     27       ),
     28       priority=-1,
     29   )

XlaRuntimeError: UNAVAILABLE: Unable to establish connection to ifrt_proxy server, please check provided address example-workload-proxy-0-0.example-workload.default.svc.example-cluster-domain.:38676'; detailed error: Connection to IFRT proxy server was terminated: CANCELLED: Cancelled

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[2], line 4
      1 import pathwaysutils
      3 import jax
----> 4 print(jax.devices())

File /opt/conda/lib/python3.10/site-packages/jax/_src/xla_bridge.py:1085, in devices(backend)
   1060 def devices(
   1061     backend: str | xla_client.Client | None = None
   1062 ) -> list[xla_client.Device]:
   1063   """Returns a list of all devices for a given backend.
   1064
   1065   .. currentmodule:: jaxlib.xla_extension
   (...)
   1083     List of Device subclasses.
   1084   """
-> 1085   return get_backend(backend).devices()

File /opt/conda/lib/python3.10/site-packages/jax/_src/xla_bridge.py:1019, in get_backend(platform)
   1015 @lru_cache(maxsize=None)  # don't use util.memoize because there is no X64 dependence.
   1016 def get_backend(
   1017     platform: None | str | xla_client.Client = None
   1018 ) -> xla_client.Client:
-> 1019   return _get_backend_uncached(platform)

File /opt/conda/lib/python3.10/site-packages/jax/_src/xla_bridge.py:998, in _get_backend_uncached(platform)
    994   return platform
    996 platform = (platform or _XLA_BACKEND.value or _PLATFORM_NAME.value or None)
--> 998 bs = backends()
    999 if platform is not None:
   1000   platform = canonicalize_platform(platform)

File /opt/conda/lib/python3.10/site-packages/jax/_src/xla_bridge.py:903, in backends()
    901       else:
    902         err_msg += " (you may need to uninstall the failing plugin package, or set JAX_PLATFORMS=cpu to skip this backend.)"
--> 903       raise RuntimeError(err_msg)
    905 assert _default_backend is not None
    906 if not config.jax_platforms.value:

RuntimeError: Unable to initialize backend 'proxy': UNAVAILABLE: Unable to establish connection to ifrt_proxy server, please check provided address 'example-workload-proxy-0-0.example-workload.default.svc.example-cluster-domain.:38676'; detailed error: Connection to IFRT proxy server was terminated: CANCELLED: Cancelled (set JAX_PLATFORMS='' to automatically choose an available backend)

原始客户端断开连接后，第二个客户端即可连接。在意外断开连接后，您可能需要重启 Pathways 集群，以便其他客户端能够连接。

LocalProxy.init() 获得了意外的关键字实参“unbound_message”

如果您在导入 pathwaysutils 后看到此错误，请检查您的环境中是否安装了过时的 Flask 或 Werkzeug 版本：

 pip3 list --outdated (replacing pip with pip3 as needed)

如果列出了 Flask 或 Werkzeug，请考虑升级它们，但请注意，这样做可能会破坏项目中的其他软件包或依赖项：

 pip install flask Werkzeug --upgrade ()

来自代理服务器的内部错误

"Internal error from proxy server during Array::IsDeleted(): UNAVAILABLE:Connection to IFRT proxy server was terminated: FAILED_PRECONDITION:
GrpcClientSession: writes no longer allowed."

此错误表示 IFRT 代理服务器已与客户端断开连接。重启客户端即可解决此问题。对于笔记本，您可以重启笔记本内核并重新运行笔记本。

SIGTERMS 和 HBM OOM

如果与 RESOURCE_EXHAUSTED 错误关联的日志中发现 SIGTERM，则可能表示 HBM OOM，在这种情况下，您可以减少 JAX 代码中使用的 HBM 内存量。

INVALID_ARGUMENT

"INVALID_ARGUMENT : Permanent error, with a last message of Lifecycle
matches_prefix cannot specify more than 50 prefixes per config.; Error while
initializing persistent cache storage Cloud Storage"

如果传递给 --pathways_gcs_location 标志的 Cloud Storage 存储桶已达到生命周期政策上限，则会出现此错误。如果出现这种情况，请清理不再使用的 Cloud Storage 生命周期政策。

永久性错误

Permanent error, with a last message of The specified bucket does not exist.; Error while initializing persistent cache storage gcs

当您为 Pathways 容器提供无效的 Cloud Storage 位置时，系统会在 Resource Manager 或 Pathways 工作器上输出此错误。

来自守护程序的错误响应

Error response from daemon: dockerfile parse error line 16: Unknown flag: exclude

这是由于 Docker 版本过旧造成的，请升级您的 Docker 版本。

IFRT 代理客户端和服务器未能达成一致

IFRT Proxy client and server failed to agree on the protocol version; supported versions: client = [1, 1], server = [3, 14]

这表示您使用的是旧版 MaxText，请确保重新构建最新的 MaxText 映像。

排查 Pathways on Cloud 问题 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。