在 A3 Mega 虚拟机上使用 Megatron-LM 训练 Llama2
概览
在本快速入门中,您将学习如何在 A3 Mega 上运行基于容器的 Megatron-LM PyTorch 工作负载。您可以在以下 GitHub 仓库 megaron-gke 中找到相应代码。
准备工作
请按照以下步骤启用 Google Kubernetes Engine (GKE) API:
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the GKE API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles. -
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the GKE API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles. -
Make sure that you have the following role or roles on the project: roles/container.admin, roles/compute.networkAdmin, roles/iam.serviceAccountUser
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
前往 IAM - 选择项目。
- 点击 授予访问权限。
-
在新的主账号字段中,输入您的用户标识符。 这通常是 Google 账号的电子邮件地址。
- 在选择角色列表中,选择一个角色。
- 如需授予其他角色,请点击 添加其他角色,然后添加其他各个角色。
- 点击 Save(保存)。
为一些常见参数创建环境变量
export CLUSTER_NAME=CLUSTER_NAME export CONTROL_PLANE_LOCATION=CONTROL_PLANE_LOCATION export PROJECT_ID=PROJECT_ID
替换以下内容:
CLUSTER_NAME:启用了 GPUDirect-TCPXO 和多网络的 A3 Mega GKE 集群的名称。CONTROL_PLANE_LOCATION:集群控制平面的 Compute Engine 位置。为区域级集群提供区域,或为可用区级集群提供可用区。PROJECT_ID:您的 Google Cloud 项目 ID。
配置 Google Cloud CLI 以使用您的 Google Cloud 凭证进行身份验证:
gcloud auth login
如需了解详情,请参阅使用 Google Cloud CLI 时进行身份验证。
安装
kubectl和 GKE gcloud CLI 插件:sudo apt-get install kubectl sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin
提取 GKE 集群的凭据:
gcloud container clusters get-credentials ${CLUSTER_NAME} \ --location=${CONTROL_PLANE_LOCATION} \ --project=${PROJECT_ID}如果尚未安装 Helm,请安装:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh && rm get_helm.sh sudo chmod +x /usr/local/bin/helm
设置服务账号:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/service-account.yaml
在 configmap 中安装拓扑调度器脚本:
curl -OL https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.py curl -OL https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.py kubectl -n kube-system create configmap topology-scheduler-scripts \ --from-file=schedule-daemon.py=schedule-daemon.py \ --from-file=label-nodes-daemon.py=label-nodes-daemon.py安装拓扑标签 daemonset 和拓扑调度器 Pod:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.yaml $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.yaml观察拓扑调度器的操作:
kubectl -n kube-system logs topology-scheduler-pod
创建 Cloud Storage 存储桶和 Docker 仓库。在
scripts/setup-and-configure-resources.sh script中,将存储桶和仓库名称替换为您创建的名称,然后运行脚本:bash scripts/setup-and-configure-resources.sh
构建
pytorch-megatron:23.11-py3映像并将其推送到您的仓库。确保scripts/build-and-push-docker-image.sh文件中的 Docker 仓库名称与您在scripts/setup-and-configure-resources.sh脚本中使用的仓库名称相匹配。您还可以在推送之前修改 Docker 映像标记名称。bash scripts/build-and-push-docker-image.sh
修改
helm/values.yaml文件以指定在前面部分中创建的 Cloud Storage 存储桶和 Docker 映像。如需查看一些示例配置,请参阅 sample-configurations。可选:您还可以修改
selected-configuration.sh文件,以指定您对默认 Helm 配置所做的任何更改。helm install HELM_EXPERIMENT_NAME helm/ --values helm/values.yaml
将
HELM_EXPERIMENT_NAME替换为实验的任意名称。- 选择 CLUSTER_NAME 对应的复选框。
- 点击 删除。
- 如需确认删除,请输入 CLUSTER_NAME,然后点击删除。
选中您为本快速入门创建的 Cloud Storage 存储桶对应的复选框。
点击 删除。
如需确认删除,请输入
DELETE,然后点击删除。
创建 A3 Mega 集群
创建具有 GPUDirect-TCPXO 和多网络的 A3 Mega GKE 集群。如需了解详情,请参阅使用 GPUDirect 和多网络功能最大限度地提高 GPU 网络带宽。
设置环境
使用拓扑感知调度器来部署 Pod
您可以使用拓扑感知调度器将 GKE Pod 部署到具有指定 GPU 拓扑的节点。
在以下
kubectl命令中,您将直接从仓库中使用文件。或者,您也可以在本地克隆仓库,kubectl命令可以改为引用本地文件。如需了解详情,请参阅拓扑调度器。
运行工作负载
构建 Dockerfile 并推送到 Google Cloud Artifact Registry
启动 Megaron-LM Llama2 基准
该实验会将 Nsight Systems 性能剖析工具中的指标写入
megatron-experiments目录中指定的 Cloud Storage 存储桶。清理
为避免因本页中使用的资源导致您的 Google Cloud 账号产生费用,请按照以下步骤操作。
删除 GKE 集群:
转到集群页面:
删除 Cloud Storage 存储桶
转至存储桶页面:
后续步骤
-