Google은 AI 기술을 사용하여 콘텐츠를 사용자의 기본 언어로 번역합니다. AI 번역에는 오류가 있을 수 있습니다.

A3 Mega 가상 머신에서 Megatron-LM으로 Llama2 학습

표준

개요

이 빠른 시작에서는 A3 Mega에서 컨테이너 기반 Megatron-LM PyTorch 워크로드를 실행하는 방법을 알아봅니다. 코드는 GitHub 저장소 megatron-gke에서 제공됩니다.

시작하기 전에

다음 단계에 따라 Google Kubernetes Engine (GKE) API를 사용 설정합니다.

계정에 로그인합니다. Google Cloud 를 처음 사용하는 경우 계정을 만들고 Google 제품의 실제 성능을 평가해 보세요. Google Cloud신규 고객에게는 워크로드를 실행, 테스트, 배포하는 데 사용할 수 있는 $300의 무료 크레딧이 제공됩니다.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the GKE API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the GKE API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

프로젝트에 다음 역할이 있는지 확인합니다. roles/container.admin, roles/compute.networkAdmin, roles/iam.serviceAccountUser
역할 확인
1. Google Cloud 콘솔에서 IAM 페이지로 이동합니다.
  IAM으로 이동
2. 프로젝트를 선택합니다.
3. 보안 주체 열에서 자신 또는 자신이 포함된 그룹을 식별하는 모든 행을 찾습니다. 포함된 그룹을 알아보려면 관리자에게 문의하세요.
4. 자신을 지정하거나 포함하는 모든 행에서 역할 열을 확인하여 역할 목록에 필수 역할이 포함되어 있는지 확인합니다.
역할 부여
1. Google Cloud 콘솔에서 IAM 페이지로 이동합니다.
  IAM으로 이동
2. 프로젝트를 선택합니다.
3. 액세스 권한 부여를 클릭합니다.
4. 새 주 구성원 필드에 사용자 식별자를 입력합니다. 일반적으로 Google 계정의 이메일 주소입니다.
5. 역할 선택을 클릭한 후 역할을 검색합니다.
6. 역할을 추가로 부여하려면 다른 역할 추가 를 클릭하고 각 역할을 추가합니다.
7. 저장 을 클릭합니다.

A3 Mega 클러스터 만들기

GPUDirect-TCPXO 및 멀티 네트워킹을 사용하여 A3 Mega GKE 클러스터를 만듭니다. 자세한 내용은 GPUDirect 및 멀티 네트워킹으로 GPU 네트워크 대역폭 극대화를 참조하세요.

환경 설정하기

몇 가지 공통적인 매개변수에 대한 환경 변수를 만듭니다.
```
export CLUSTER_NAME=CLUSTER_NAME
export CONTROL_PLANE_LOCATION=CONTROL_PLANE_LOCATION
export PROJECT_ID=PROJECT_ID
```
다음을 바꿉니다.
- CLUSTER_NAME: GPUDirect-TCPXO 및 멀티 네트워킹이 사용 설정된 A3 Mega GKE 클러스터의 이름입니다.
- CONTROL_PLANE_LOCATION: 클러스터의 컨트롤 플레인에 대한 Compute Engine 위치입니다. 리전 클러스터의 경우 리전 또는 영역 클러스터의 경우 영역을 제공합니다.
- PROJECT_ID: 프로젝트 ID입니다. Google Cloud
인증에사용자 인증 정보를 사용하도록 Google Cloud CLI를 구성합니다. Google Cloud
```
gcloud auth login
```
자세한 내용은 Google Cloud CLI 인증을 참조하세요.

kubectl 및 GKE gcloud CLI 플러그인을 설치합니다.

sudo apt-get install kubectl
sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin

GKE 클러스터의 사용자 인증 정보를 가져옵니다.

gcloud container clusters get-credentials ${CLUSTER_NAME} \
  --location=${CONTROL_PLANE_LOCATION} \
  --project=${PROJECT_ID}

아직 설치되지 않은 경우 Helm을 설치합니다.

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh && rm get_helm.sh
sudo chmod +x /usr/local/bin/helm

토폴로지 인식 스케줄러를 사용하여 포드 배포

토폴로지 인식 스케줄러를 사용하여 지정된 GPU 토폴로지가 있는 노드에 GKE 포드를 배포할 수 있습니다.

다음 kubectl 명령어에서는 저장소의 파일을 직접 사용합니다. 또는 저장소를 로컬로 클론하고 kubectl 명령어에서 대신 로컬 파일을 참조할 수 있습니다.

자세한 내용은 토폴로지 스케줄러를 참조하세요.

서비스 계정을 설정합니다.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/service-account.yaml

configmap에 토폴로지 스케줄러 스크립트를 설치합니다.

curl -OL  https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.py
curl -OL  https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.py

kubectl -n kube-system create configmap topology-scheduler-scripts \
    --from-file=schedule-daemon.py=schedule-daemon.py \
    --from-file=label-nodes-daemon.py=label-nodes-daemon.py

토폴로지 라벨 데몬세트 및 토폴로지 스케줄러 포드를 설치합니다.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.yaml
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.yaml

토폴로지 스케줄러의 작업을 관찰합니다.
```
kubectl -n kube-system logs topology-scheduler-pod
```

워크로드 실행

Dockerfile을 빌드하여 Google Cloud Artifact Registry로 푸시

Cloud Storage 버킷 및 Docker 저장소를 만듭니다. scripts/setup-and-configure-resources.sh script에서 버킷 및 저장소 이름을 앞에서 만든 이름으로 바꾸고 스크립트를 실행합니다.
```
bash scripts/setup-and-configure-resources.sh
```
pytorch-megatron:23.11-py3 이미지를 빌드하여 저장소로 푸시합니다. scripts/build-and-push-docker-image.sh 파일의 Docker 저장소 이름이 scripts/setup-and-configure-resources.sh 스크립트에서 사용한 저장소 이름과 일치하는지 확인합니다. 푸시하기 전에 Docker 이미지 태그 이름을 수정할 수도 있습니다.
```
bash scripts/build-and-push-docker-image.sh
```
참고: 이 이미지는 nvcr.io/nvidia/pytorch:23.11-py3를 기반으로 하며 최소한의 변경사항이 포함됩니다.

Megatron-LM Llama2 벤치마크 실행

helm/values.yaml 파일을 수정하여 이전 섹션에서 만든 Cloud Storage 버킷과 Docker 이미지를 지정합니다. 구성 예시는 sample-configurations를 참조하세요.
선택사항: selected-configuration.sh 파일을 수정하여 기본 Helm 구성에 적용한 변경사항을 지정할 수도 있습니다.
```
helm install HELM_EXPERIMENT_NAME helm/ --values helm/values.yaml
```
HELM_EXPERIMENT_NAME을 실험에 대한 임의의 이름으로 바꿉니다.

참고: Helm 실험을 여러 번 실행하려면 helm uninstall 명령어를 사용하여 기존 실험을 삭제하거나 다른 이름으로 새 실험을 만들면 됩니다.

실험에서는 Nsight Systems 프로파일링 도구의 측정항목을 megatron-experiments 디렉터리에 지정된 Cloud Storage 버킷에 씁니다.

정리

이 페이지에서 사용한 리소스 비용이 Google Cloud 계정에 청구되지 않도록 하려면 다음 단계를 수행합니다.

GKE 클러스터를 삭제합니다.

클러스터 페이지로 이동합니다.

클러스터로 이동

CLUSTER_NAME의 체크박스를 선택합니다.
삭제를 클릭합니다.
삭제를 확인하려면 CLUSTER_NAME를 입력하고 삭제를 클릭합니다.

Cloud Storage 버킷 삭제

버킷 페이지로 이동합니다.

버킷으로 이동

이 빠른 시작을 위해 만든 Cloud Storage 버킷의 체크박스를 선택합니다.
삭제를 클릭합니다.
삭제를 확인하려면 DELETE를 입력하고 삭제를 클릭합니다.

다음 단계

GKE에서 GPU 사용 자세히 알아보기

A3 Mega 가상 머신에서 Megatron-LM으로 Llama2 학습

개요

시작하기 전에

역할 확인

역할 부여

A3 Mega 클러스터 만들기

환경 설정하기

토폴로지 인식 스케줄러를 사용하여 포드 배포

워크로드 실행

Dockerfile을 빌드하여 Google Cloud Artifact Registry로 푸시

Megatron-LM Llama2 벤치마크 실행

정리

GKE 클러스터를 삭제합니다.

Cloud Storage 버킷 삭제

다음 단계