在 A3 Mega 虚拟机上使用 Megatron-LM 训练 Llama2

标准

概览

在本快速入门中，您将学习如何在 A3 Mega 上运行基于容器的 Megatron-LM PyTorch 工作负载。您可以在以下 GitHub 仓库 megaron-gke 中找到相应代码。

准备工作

请按照以下步骤启用 Google Kubernetes Engine (GKE) API：

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the GKE API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the GKE API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/compute.networkAdmin, roles/iam.serviceAccountUser
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  前往 IAM
2. 选择项目。
3. 点击 授予访问权限。
4. 在新的主账号字段中，输入您的用户标识符。这通常是 Google 账号的电子邮件地址。
5. 在选择角色列表中，选择一个角色。
6. 如需授予其他角色，请点击 添加其他角色，然后添加其他各个角色。
7. 点击 Save（保存）。

在 A3 Mega 虚拟机上使用 Megatron-LM 训练 Llama2

概览

准备工作

Check for the roles

Grant the roles

创建 A3 Mega 集群

设置环境

使用拓扑感知调度器来部署 Pod

运行工作负载

构建 Dockerfile 并推送到 Google Cloud Artifact Registry

启动 Megaron-LM Llama2 基准

清理

删除 GKE 集群：

删除 Cloud Storage 存储桶

后续步骤