面向 AI/机器学习工作负载且基于配置文件的配置

本文档介绍了如何使用基于配置文件的配置来简化 Cloud Storage FUSE 的采用，并提升其在人工智能或机器学习 (AI/ML) 工作负载方面的性能。

为了帮助您简化 Cloud Storage FUSE 配置，以便用于服务、检查点设置或训练工作负载，您可以使用 profile 字段或 --profile 选项，根据工作负载类型应用预配置的配置文件。使用该字段或选项，您可以指定一组预定义且经过优化的 Cloud Storage FUSE 功能，用于缓存、线程处理和缓冲区空间，从而确保以最少的精力实现高性能，以用于训练、检查点设置和服务工作负载，其中配置文件值分别为 aiml-training、aiml-checkpointing 和 aiml-serving。

注意事项

您只能在装载操作期间设置 --profile 选项或 profile 字段。如果您需要更新 --profile 选项或 profile 字段，则需要重新挂载 Cloud Storage FUSE 存储桶。
使用基于配置文件的配置时，Cloud Storage FUSE 会将元数据缓存容量和存留时间 (TTL) 设置为无限制，这意味着条目永远不会从元数据缓存中逐出。如果虚拟机的内存不足，您可能会遇到内存不足 (OOM) 错误。因此，我们建议您在应用基于配置文件的配置之前，先查看内存容量。在内存不足 1 TiB 的机器上，更有可能发生 OOM 错误。
如果以多种方式配置 Cloud Storage FUSE 参数，则遵循以下优先顺序（从最高到最低）：
1. 直接在 gcsfuse 命令或 Cloud Storage FUSE 配置文件中设置的值。
2. 由配置文件设置的值，其中配置文件是在 gcsfuse 命令中使用 --profile 选项或在 Cloud Storage FUSE 配置文件中使用 profile 字段指定的。
3. 当 Cloud Storage FUSE 检测到高性能机器类型时自动应用的默认值。如需了解详情，请参阅高性能机器类型的自动化配置值。
Google Kubernetes Engine Pod 中的 Cloud Storage FUSE CSI 卷不支持 profile 字段或 --profile 选项。
无法使用基于配置文件的配置来启用文件缓存，因为文件缓存需要使用无法泛化的 Cloud Storage FUSE 配置字段和 Cloud Storage FUSE CLI 选项。如需为服务、训练或检查点设置工作负载启用文件缓存，您必须明确配置文件缓存选项或字段。

为训练工作负载应用基于配置文件的配置

训练专用配置文件可优化性能，以实现对大型数据集的高吞吐量读取，并防止 Cloud GPU 和 Cloud TPU 硬件等待数据。

如需应用特定于训练的配置文件，请使用 Cloud Storage FUSE 配置文件指定 profile: aiml-training，或使用 Cloud Storage FUSE CLI 指定 --profile=aiml-training。然后应用以下配置：

   # Create implicit directories locally when accessed:
   - implicit-dirs
   # Disable caching for lookups of files or directories that don't exist:
   - metadata-cache:negative-ttl-secs:0
   # Keep cached metadata (file attributes, types) indefinitely time-wise:
   - metadata-cache:ttl-secs:-1
   # Allow unlimited size for the file attribute (stat) cache:
   - metadata-cache:stat-cache-max-size-mb:-1
   # Allow unlimited size for the file/directory type cache:
   - metadata-cache:type-cache-max-size-mb:-1

应用基于配置文件的配置来为工作负载设置检查点

特定于检查点的配置文件可优化性能，以实现大文件的高吞吐量写入，从而大幅缩短保存多 GB 检查点所需的时间，最大限度地减少训练暂停。

如需应用特定于检查点的配置文件，请使用 Cloud Storage FUSE 配置文件指定 profile: aiml-checkpointing，或使用 Cloud Storage FUSE CLI 指定 --profile=aiml-checkpointing。然后应用以下配置：

  # Create implicit directories locally when accessed:
  - implicit-dirs
  # Disable caching for lookups of files/dirs that don't exist:
  - metadata-cache:negative-ttl-secs:0
  # Keep cached metadata (file attributes, types) indefinitely time-wise:
  - metadata-cache:ttl-secs:-1
  # Allow unlimited size for the file attribute (stat) cache:
  - metadata-cache:stat-cache-max-size-mb:-1
  # Allow unlimited size for the file/directory type cache:
  - metadata-cache:type-cache-max-size-mb:-1
  # Cache the entire file when any part is read sequentially:
  - file-cache:cache-file-for-range-read:true
  # Allow renaming directories with a lot of files in non-HNS buckets.
  - file-system:rename-dir-limit:200000

为服务工作负载应用基于配置文件的配置

服务通过改进数据访问和缓存机制来优化服务工作负载的性能。

如需应用特定于服务的配置文件，请使用 Cloud Storage FUSE 配置文件指定 profile: aiml-serving，或使用 Cloud Storage FUSE CLI 指定 --profile=aiml-serving。然后应用以下配置：

  # Create implicit directories locally when accessed:
  - implicit-dirs
  # Disable caching for lookups of files/dirs that don't exist:
  - metadata-cache:negative-ttl-secs:0
  # Keep cached metadata (file attributes, types) indefinitely time-wise:
  - metadata-cache:ttl-secs:-1
  # Allow unlimited size for the file attribute (stat) cache:
  - metadata-cache:stat-cache-max-size-mb:-1
  # Allow unlimited size for the file/directory type cache:
  - metadata-cache:type-cache-max-size-mb:-1
  # Cache the entire file when any part is read sequentially:
  - file-cache:cache-file-for-range-read:true
  # Enable kernel-list-cache to make listing faster as this is a readonly file system hierarchy.
  - file-system:kernel-list-cache-ttl-secs:-1

后续步骤

了解高性能机器类型的自动化配置值。
了解如何使用预配置的 GKE YAML 文件优化性能。

面向 AI/机器学习工作负载且基于配置文件的配置 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。