Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

在 A3 Mega 虚拟机上使用 Megatron-LM 训练 Llama2

标准

概览

在本快速入门中，您将学习如何在 A3 Mega 上运行基于容器的 Megatron-LM PyTorch 工作负载。您可以在以下 GitHub 仓库 megaron-gke 中找到相应代码。

准备工作

请按照以下步骤启用 Google Kubernetes Engine (GKE) API：

登录您的 Google Cloud 账号。如果您是新手 Google Cloud，请创建一个账号来评估我们的产品在实际场景中的表现。新客户还可获享 $300 赠金，用于运行、测试和部署工作负载。

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the GKE API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the GKE API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

确保您在项目中拥有以下一个或多个角色： roles/container.admin、roles/compute.networkAdmin、roles/iam.serviceAccountUser
检查角色
1. 在 Google Cloud 控制台中，前往 IAM 页面。
  转到 IAM
2. 选择项目。
3. 在主账号 列中，找到标识您或您所属群组的所有行。如需了解您属于哪些群组，请与您的管理员联系。
4. 对于指定或包含您的所有行，请检查角色列以查看角色列表是否包含所需的角色。
授予角色
1. 在 Google Cloud 控制台中，前往 IAM 页面。
  转到 IAM
2. 选择项目。
3. 点击 授予访问权限。
4. 在新的主账号 字段中，输入您的用户标识符。这通常是 Google 账号的电子邮件地址。
5. 点击选择角色，然后搜索相应角色。
6. 如需授予其他角色，请点击 添加其他角色 ，然后添加其他各个角色。
7. 点击 Save （保存）。

创建 A3 Mega 集群

创建具有 GPUDirect-TCPXO 和多网络的 A3 Mega GKE 集群。如需了解详情，请参阅使用 GPUDirect 和多网络功能最大限度地提高 GPU 网络带宽。

设置环境

为一些常见参数创建环境变量
```
export CLUSTER_NAME=CLUSTER_NAME
export CONTROL_PLANE_LOCATION=CONTROL_PLANE_LOCATION
export PROJECT_ID=PROJECT_ID
```
替换以下内容：
- CLUSTER_NAME：启用了 GPUDirect-TCPXO 和多网络的 A3 Mega GKE 集群的名称。
- CONTROL_PLANE_LOCATION：集群控制平面的 Compute Engine 位置。为区域级集群提供区域，或为可用区级集群提供可用区。
- PROJECT_ID：您的 Google Cloud 项目 ID。
配置 Google Cloud CLI 以使用您的 Google Cloud 凭证进行身份验证：
```
gcloud auth login
```
如需了解详情，请参阅使用 Google Cloud CLI 时进行身份验证。

安装 kubectl 和 GKE gcloud CLI 插件：

sudo apt-get install kubectl
sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin

提取 GKE 集群的凭据：

gcloud container clusters get-credentials ${CLUSTER_NAME} \
  --location=${CONTROL_PLANE_LOCATION} \
  --project=${PROJECT_ID}

如果尚未安装 Helm，请安装：

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh && rm get_helm.sh
sudo chmod +x /usr/local/bin/helm

使用拓扑感知调度器来部署 Pod

您可以使用拓扑感知调度器将 GKE Pod 部署到具有指定 GPU 拓扑的节点。

在以下 kubectl 命令中，您将直接从仓库中使用文件。或者，您也可以在本地克隆仓库，kubectl 命令可以改为引用本地文件。

如需了解详情，请参阅拓扑调度器。

设置服务账号：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/service-account.yaml

在 configmap 中安装拓扑调度器脚本：

curl -OL  https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.py
curl -OL  https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.py

kubectl -n kube-system create configmap topology-scheduler-scripts \
    --from-file=schedule-daemon.py=schedule-daemon.py \
    --from-file=label-nodes-daemon.py=label-nodes-daemon.py

安装拓扑标签 daemonset 和拓扑调度器 Pod：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.yaml
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.yaml

观察拓扑调度器的操作：

kubectl -n kube-system logs topology-scheduler-pod

运行工作负载

构建 Dockerfile 并推送到 Google Cloud Artifact Registry

创建 Cloud Storage 存储桶和 Docker 仓库。在 scripts/setup-and-configure-resources.sh script 中，将存储桶和仓库名称替换为您创建的名称，然后运行脚本：
```
bash scripts/setup-and-configure-resources.sh
```
构建 pytorch-megatron:23.11-py3 映像并将其推送到您的仓库。确保 scripts/build-and-push-docker-image.sh 文件中的 Docker 仓库名称与您在 scripts/setup-and-configure-resources.sh 脚本中使用的仓库名称相匹配。您还可以在推送之前修改 Docker 映像标记名称。
```
bash scripts/build-and-push-docker-image.sh
```
注意：此映像基于 nvcr.io/nvidia/pytorch:23.11-py3，进行了细微更改。

启动 Megaron-LM Llama2 基准

修改 helm/values.yaml 文件以指定在前面部分中创建的 Cloud Storage 存储桶和 Docker 映像。如需查看一些示例配置，请参阅 sample-configurations。
可选：您还可以修改 selected-configuration.sh 文件，以指定您对默认 Helm 配置所做的任何更改。
```
helm install HELM_EXPERIMENT_NAME helm/ --values helm/values.yaml
```
将 HELM_EXPERIMENT_NAME 替换为实验的任意名称。

注意：如果要多次运行 Helm 实验，您可以使用 helm uninstall 命令，也可以使用其他名称创建新实验。

该实验会将 Nsight Systems 性能剖析工具中的指标写入 megatron-experiments 目录中指定的 Cloud Storage 存储桶。

清理

为避免因本页中使用的资源导致您的 Google Cloud 账号产生费用，请按照以下步骤操作。

删除 GKE 集群：

转到集群页面：

转到“集群”

选择 CLUSTER_NAME 对应的复选框。
点击删除。
如需确认删除，请输入 CLUSTER_NAME，然后点击删除。

删除 Cloud Storage 存储桶

转至存储桶页面：

进入“存储桶”

选中您为本快速入门创建的 Cloud Storage 存储桶对应的复选框。
点击删除。
如需确认删除，请输入 DELETE，然后点击删除。

后续步骤

详细了解如何在 GKE 中使用 GPU

在 A3 Mega 虚拟机上使用 Megatron-LM 训练 Llama2

概览

准备工作

检查角色

授予角色

创建 A3 Mega 集群

设置环境

使用拓扑感知调度器来部署 Pod

运行工作负载

构建 Dockerfile 并推送到 Google Cloud Artifact Registry

启动 Megaron-LM Llama2 基准

清理

删除 GKE 集群：

删除 Cloud Storage 存储桶

后续步骤