在 A3 Mega 虛擬機器上,使用 Megatron-LM 訓練 Llama2
總覽
本快速入門導覽課程將說明如何在 A3 Mega 上執行以容器為基礎的 Megatron-LM PyTorch 工作負載。您可以在這個 GitHub 存放區找到程式碼:megatron-gke。
事前準備
請按照下列步驟啟用 Google Kubernetes Engine (GKE) API:
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the GKE API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles. -
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the GKE API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles. -
Make sure that you have the following role or roles on the project: roles/container.admin, roles/compute.networkAdmin, roles/iam.serviceAccountUser
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
前往 IAM - 選取專案。
- 按一下「授予存取權」。
-
在「New principals」(新增主體) 欄位中,輸入您的使用者 ID。 這通常是 Google 帳戶的電子郵件地址。
- 在「Select a role」(選取角色) 清單中,選取角色。
- 如要授予其他角色,請按一下 「新增其他角色」,然後新增每個其他角色。
- 按一下 [Save]。
為一些常見參數建立環境變數
export CLUSTER_NAME=CLUSTER_NAME export CONTROL_PLANE_LOCATION=CONTROL_PLANE_LOCATION export PROJECT_ID=PROJECT_ID
更改下列內容:
- :已啟用 GPUDirect-TCPXO 和多網路功能的 A3 Mega GKE 叢集名稱。
CLUSTER_NAME CONTROL_PLANE_LOCATION:叢集控制層的 Compute Engine 位置。為地區叢集提供地區,或為區域叢集提供區域。PROJECT_ID:您的 Google Cloud 專案 ID。
- :已啟用 GPUDirect-TCPXO 和多網路功能的 A3 Mega GKE 叢集名稱。
將 Google Cloud CLI 設為使用您的 Google Cloud 憑證進行驗證:
gcloud auth login
詳情請參閱「驗證以使用 Google Cloud CLI」。
安裝
kubectl和 GKE gcloud CLI 外掛程式:sudo apt-get install kubectl sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin
擷取 GKE 叢集的憑證:
gcloud container clusters get-credentials ${CLUSTER_NAME} \ --location=${CONTROL_PLANE_LOCATION} \ --project=${PROJECT_ID}如果尚未安裝,請安裝 Helm:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh && rm get_helm.sh sudo chmod +x /usr/local/bin/helm
設定服務帳戶:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/service-account.yaml
在 configmap 中安裝拓撲排程器指令碼:
curl -OL https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.py curl -OL https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.py kubectl -n kube-system create configmap topology-scheduler-scripts \ --from-file=schedule-daemon.py=schedule-daemon.py \ --from-file=label-nodes-daemon.py=label-nodes-daemon.py安裝拓撲標籤 DaemonSet 和拓撲排程器 Pod:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.yaml $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.yaml觀察拓撲排程器的動作:
kubectl -n kube-system logs topology-scheduler-pod
建立 Cloud Storage bucket 和 Docker 存放區。 在
scripts/setup-and-configure-resources.sh script中,將值區和存放區名稱替換為您建立的名稱,然後執行指令碼:bash scripts/setup-and-configure-resources.sh
建構
pytorch-megatron:23.11-py3映像檔並推送到存放區。 確認scripts/build-and-push-docker-image.sh檔案中的 Docker 存放區名稱,與scripts/setup-and-configure-resources.sh指令碼中使用的存放區名稱相符。您也可以在推送前編輯 Docker 映像檔標記名稱。bash scripts/build-and-push-docker-image.sh
編輯
helm/values.yaml檔案,指定您在先前章節中建立的 Cloud Storage bucket 和 Docker 映像檔。如需設定範例,請參閱「sample-configurations」。選用:您也可以編輯
selected-configuration.sh檔案,指定對預設 Helm 設定所做的任何變更。helm install HELM_EXPERIMENT_NAME helm/ --values helm/values.yaml
將
HELM_EXPERIMENT_NAME換成實驗的任意名稱。- 勾選 CLUSTER_NAME 的核取方塊。
- 按一下「刪除」圖示 。
- 如要確認刪除,請輸入 CLUSTER_NAME,然後按一下「刪除」。
選取您為本快速入門建立的 Cloud Storage bucket 核取方塊。
按一下「刪除」圖示 。
如要確認刪除,請輸入
DELETE,然後按一下「刪除」。
建立 A3 Mega 叢集
使用 GPUDirect-TCPXO 和多網路功能建立 A3 Mega GKE 叢集。詳情請參閱「使用 GPUDirect 和多重網路,盡量提高 GPU 網路頻寬」。
設定環境
使用拓撲感知排程器部署 Pod
您可以使用拓撲感知排程器,將 GKE Pod 部署至具有指定 GPU 拓撲的節點。
在下列
kubectl指令中,您會直接使用存放區中的檔案。或者,您也可以在本機複製存放區,然後kubectl指令可以改為參照本機檔案。詳情請參閱「拓撲排程器」。
執行工作負載
建構 Dockerfile 並推送至 Google Cloud Artifact Registry
啟動 Megatron-LM Llama2 基準測試
實驗會將 Nsight Systems 剖析工具的指標寫入
megatron-experiments目錄中指定的 Cloud Storage 值區。清除所用資源
如要避免系統向您的 Google Cloud 帳戶收取本頁所用資源的費用,請按照下列步驟操作。
刪除 GKE 叢集:
前往「Clusters」(叢集) 頁面:
刪除 Cloud Storage 值區
前往「Buckets」(值區) 頁面:
後續步驟
-