Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Melatih model dengan GPU pada mode GKE Standar

Standard

Tutorial panduan memulai ini menunjukkan cara men-deploy model pelatihan dengan GPU di Google Kubernetes Engine (GKE) dan menyimpan prediksi di Cloud Storage. Tutorial ini menggunakan model TensorFlow dan cluster GKE Standard. Anda juga dapat menjalankan workload ini di cluster Autopilot dengan langkah penyiapan yang lebih sedikit. Untuk mendapatkan petunjuk, lihat Melatih model dengan GPU pada mode GKE Autopilot.

Dokumen ini ditujukan bagi administrator GKE yang sudah memiliki cluster Standar dan ingin menjalankan workload GPU untuk pertama kalinya.

Sebelum memulai

Login ke akun Google Cloud Anda. Jika Anda baru menggunakan Google Cloud, buat akun untuk mengevaluasi performa produk kami dalam skenario dunia nyata. Pelanggan baru juga mendapatkan kredit gratis senilai $300 untuk menjalankan, menguji, dan men-deploy workload.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

If you're using an existing project for this guide, verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.

Verify that billing is enabled for your Google Cloud project.

Enable the Kubernetes Engine and Cloud Storage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

If you're using an existing project for this guide, verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.

Verify that billing is enabled for your Google Cloud project.

Enable the Kubernetes Engine and Cloud Storage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Di konsol Google Cloud , aktifkan Cloud Shell.

Aktifkan Cloud Shell

Di bagian bawah konsol Google Cloud , sesi Cloud Shell akan dimulai dan menampilkan perintah command line. Cloud Shell adalah lingkungan shell dengan Google Cloud CLI yang sudah terinstal, dan dengan nilai yang sudah ditetapkan untuk project Anda saat ini. Diperlukan waktu beberapa detik untuk melakukan inisialisasi pada sesi.

Peran yang diperlukan

Untuk mendapatkan izin yang Anda perlukan untuk melatih model di GPU, minta administrator Anda untuk memberi Anda peran IAM berikut di project Anda:

Mengelola cluster GKE: Admin Kubernetes Engine (roles/container.admin)
Mengelola bucket Cloud Storage: Storage Admin (roles/storage.admin)
Memberikan peran IAM pada project: Admin IAM Project (roles/resourcemanager.projectIamAdmin)
Membuat dan memberikan peran pada akun layanan IAM: Service Account Admin (roles/iam.serviceAccountAdmin)

Untuk mengetahui informasi selengkapnya tentang pemberian peran, lihat Mengelola akses ke project, folder, dan organisasi.

Anda mungkin juga bisa mendapatkan izin yang diperlukan melalui peran khusus atau peran bawaan lainnya.

Gandakan repositori sampel

Jalankan perintah berikut di Cloud Shell:

git clone https://github.com/GoogleCloudPlatform/ai-on-gke/ ai-on-gke
cd ai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpu

Membuat cluster mode Standar dan node pool GPU

Gunakan Cloud Shell untuk melakukan hal berikut:

Buat cluster Standar yang menggunakan Workload Identity Federation for GKE dan instal driver Cloud Storage FUSE:
```
gcloud container clusters create gke-gpu-cluster \
    --addons GcsFuseCsiDriver \
    --location=us-central1 \
    --num-nodes=1 \
    --workload-pool=PROJECT_ID.svc.id.goog
```
Ganti PROJECT_ID dengan ID project Google Cloud Anda.

Pembuatan cluster mungkin memerlukan waktu beberapa menit.

Buat node pool GPU:

gcloud container node-pools create gke-gpu-pool-1 \
    --accelerator=type=nvidia-tesla-t4,count=1,gpu-driver-version=default \
    --machine-type=n1-standard-16 --num-nodes=1 \
    --location=us-central1 \
    --cluster=gke-gpu-cluster

Membuat bucket Cloud Storage

Di konsol Google Cloud , buka halaman Create a bucket:

Buka Create a bucket
Pada kolom Name your bucket, masukkan nama berikut:
```
PROJECT_ID-gke-gpu-bucket
```
Klik Lanjutkan.
Untuk Jenis lokasi, pilih Region.
Dalam daftar Region, pilih us-central1 (Iowa) dan klik Lanjutkan.
Pada bagian Pilih kelas penyimpanan untuk data Anda, klik Lanjutkan.
Di bagian Choose how to control access to objects, untuk Access control, pilih Uniform.
Klik Buat.
Pada dialog Public access will be prevented pastikan bahwa kotak centang Enforce public access prevention on this bucket sudah dicentang, lalu klik Confirm.

Mengonfigurasi cluster Anda untuk mengakses bucket menggunakan Workload Identity Federation for GKE

Agar cluster Anda dapat mengakses bucket Cloud Storage, lakukan langkah berikut:

Buat akun layanan Google Cloud .
Buat Akun Layanan Kubernetes di cluster Anda.
Ikat Akun Layanan Kubernetes ke akun layanan Google Cloud .

Buat akun layanan Google Cloud

Di konsol Google Cloud , buka halaman Buat akun layanan:

Buka Create service account
Di kolom Service account ID, masukkan gke-ai-sa.
Klik Buat dan lanjutkan.
Di daftar Role, pilih peran Cloud Storage > Storage Insights Collector Service.
Klik Add another role.
Dalam daftar Select a role, pilih peran Cloud Storage > Storage Object Admin.
Klik Lanjutkan lalu klik Selesai.

Membuat Akun Layanan Kubernetes di cluster Anda

Di Cloud Shell, lakukan hal berikut:

Membuat namespace Kubernetes:

kubectl create namespace gke-ai-namespace

Buat Akun Layanan Kubernetes dalam namespace:

kubectl create serviceaccount gpu-k8s-sa --namespace=gke-ai-namespace

Ikat Akun Layanan Kubernetes ke akun layanan Google Cloud

Di Cloud Shell jalankan perintah berikut:

Tambahkan binding IAM ke akun layanan Google Cloud :

gcloud iam service-accounts add-iam-policy-binding gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[gke-ai-namespace/gpu-k8s-sa]"

Flag --member memberikan identitas lengkap Akun Layanan Kubernetes di Google Cloud.

Anotasikan Akun Layanan Kubernetes:

kubectl annotate serviceaccount gpu-k8s-sa \
    --namespace gke-ai-namespace \
    iam.gke.io/gcp-service-account=gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com

Memverifikasi bahwa Pod dapat mengakses bucket Cloud Storage

Di Cloud Shell, buat variabel lingkungan berikut:
```
export K8S_SA_NAME=gpu-k8s-sa
export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket
```
Ganti PROJECT_ID dengan ID project Google Cloud Anda.
Membuat Pod yang memiliki container TensorFlow:
```
envsubst < src/gke-config/standard-tensorflow-bash.yaml | kubectl --namespace=gke-ai-namespace apply -f -
```
Perintah ini mengganti variabel lingkungan yang Anda buat ke dalam referensi yang sesuai dalam manifes. Anda juga dapat membuka manifes di editor teks serta mengganti $K8S_SA_NAME dan $BUCKET_NAME dengan nilai yang sesuai.

Buat file sampel di bucket:

touch sample-file
gcloud storage cp sample-file gs://PROJECT_ID-gke-gpu-bucket

Tunggu hingga Pod Anda siap:

kubectl wait --for=condition=Ready pod/test-tensorflow-pod -n=gke-ai-namespace --timeout=180s

Setelah Pod sudah siap, output-nya adalah sebagai berikut:

pod/test-tensorflow-pod condition met

Buka shell di container TensorFlow:

kubectl -n gke-ai-namespace exec --stdin --tty test-tensorflow-pod --container tensorflow -- /bin/bash

Coba baca file sampel yang Anda buat:
```
ls /data
```
Output menunjukkan file sampel.

Periksa log untuk mengidentifikasi GPU yang terpasang ke Pod:

python3 -m pip install 'tensorflow[and-cuda]'
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Output menunjukkan GPU yang terpasang ke Pod, mirip dengan berikut ini:

...
PhysicalDevice(name='/physical_device:GPU:0',device_type='GPU')

Keluar dari container:
```
exit
```

Hapus contoh Pod:

kubectl delete -f src/gke-config/standard-tensorflow-bash.yaml \
    --namespace=gke-ai-namespace

Latih dan prediksi menggunakan set data `MNIST`

Di bagian ini, Anda akan menjalankan workload pelatihan pada set data contoh MNIST.

Salin data contoh ke bucket Cloud Storage:

gcloud storage cp src/tensorflow-mnist-example gs://PROJECT_ID-gke-gpu-bucket/ --recursive

Buat variabel lingkungan berikut:

export K8S_SA_NAME=gpu-k8s-sa
export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket

Tinjau Tugas pelatihan:

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-training-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu 
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; python tensorflow_mnist_train_distributed.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

Deploy Tugas pelatihan:
```
envsubst < src/gke-config/standard-tf-mnist-train.yaml | kubectl -n gke-ai-namespace apply -f -
```
Perintah ini mengganti variabel lingkungan yang Anda buat ke dalam referensi yang sesuai dalam manifes. Anda juga dapat membuka manifes di editor teks serta mengganti $K8S_SA_NAME dan $BUCKET_NAME dengan nilai yang sesuai.

Tunggu hingga Tugas memiliki status Completed:

kubectl wait -n gke-ai-namespace --for=condition=Complete job/mnist-training-job --timeout=180s

Outputnya akan mirip dengan berikut ini:

job.batch/mnist-training-job condition met

Periksa log dari container TensorFlow:

kubectl logs -f jobs/mnist-training-job -c tensorflow -n gke-ai-namespace

Output menunjukkan bahwa peristiwa berikut terjadi:

Instal paket Python yang diperlukan
Mendownload set data MNIST
Melatih model menggunakan GPU
Simpan model
Mengevaluasi model

...
Epoch 12/12
927/938 [============================>.] - ETA: 0s - loss: 0.0188 - accuracy: 0.9954
Learning rate for epoch 12 is 9.999999747378752e-06
938/938 [==============================] - 5s 6ms/step - loss: 0.0187 - accuracy: 0.9954 - lr: 1.0000e-05
157/157 [==============================] - 1s 4ms/step - loss: 0.0424 - accuracy: 0.9861
Eval loss: 0.04236088693141937, Eval accuracy: 0.9861000180244446
Training finished. Model saved

Hapus workload pelatihan:

kubectl -n gke-ai-namespace delete -f src/gke-config/standard-tf-mnist-train.yaml

Men-deploy workload inferensi

Di bagian ini, Anda akan men-deploy workload inferensi yang mengambil set data sampel sebagai input dan menampilkan prediksi.

Salin image untuk prediksi ke bucket:

gcloud storage cp data/mnist_predict gs://PROJECT_ID-gke-gpu-bucket/ --recursive

Tinjau workload inferensi:

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-batch-prediction-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu 
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; python tensorflow_mnist_batch_predict.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

Deploy workload inferensi:
```
envsubst < src/gke-config/standard-tf-mnist-batch-predict.yaml | kubectl -n gke-ai-namespace apply -f -
```
Perintah ini mengganti variabel lingkungan yang Anda buat ke dalam referensi yang sesuai dalam manifes. Anda juga dapat membuka manifes di editor teks serta mengganti $K8S_SA_NAME dan $BUCKET_NAME dengan nilai yang sesuai.

Tunggu hingga Tugas memiliki status Completed:

kubectl wait -n gke-ai-namespace --for=condition=Complete job/mnist-batch-prediction-job --timeout=180s

Outputnya akan mirip dengan berikut ini:

job.batch/mnist-batch-prediction-job condition met

Periksa log dari container TensorFlow:

kubectl logs -f jobs/mnist-batch-prediction-job -c tensorflow -n gke-ai-namespace

Outputnya adalah prediksi untuk setiap gambar dan keyakinan model pada prediksi tersebut, mirip dengan berikut ini:

Found 10 files belonging to 1 classes.
1/1 [==============================] - 2s 2s/step
The image /data/mnist_predict/0.png is the number 0 with a 100.00 percent confidence.
The image /data/mnist_predict/1.png is the number 1 with a 99.99 percent confidence.
The image /data/mnist_predict/2.png is the number 2 with a 100.00 percent confidence.
The image /data/mnist_predict/3.png is the number 3 with a 99.95 percent confidence.
The image /data/mnist_predict/4.png is the number 4 with a 100.00 percent confidence.
The image /data/mnist_predict/5.png is the number 5 with a 100.00 percent confidence.
The image /data/mnist_predict/6.png is the number 6 with a 99.97 percent confidence.
The image /data/mnist_predict/7.png is the number 7 with a 100.00 percent confidence.
The image /data/mnist_predict/8.png is the number 8 with a 100.00 percent confidence.
The image /data/mnist_predict/9.png is the number 9 with a 99.65 percent confidence.

Pembersihan

Agar akun Google Cloud Anda tidak dikenai biaya untuk resource yang Anda buat dalam panduan ini, lakukan salah satu langkah berikut:

Pertahankan cluster GKE: Hapus resource Kubernetes di cluster dan resource Google Cloud
Pertahankan project Google Cloud : Hapus cluster GKE dan resource Google Cloud
Hapus project

Menghapus resource Kubernetes dalam cluster dan resource Google Cloud

Hapus namespace Kubernetes dan workload yang Anda deploy:

kubectl -n gke-ai-namespace delete -f src/gke-config/standard-tf-mnist-batch-predict.yaml
kubectl delete namespace gke-ai-namespace

Hapus bucket Cloud Storage:
1. Buka halaman Bucket:
  
  Buka Bucket
2. Pilih kotak centang untuk PROJECT_ID-gke-gpu-bucket.
3. Klik Delete.
4. Untuk mengonfirmasi penghapusan, ketik DELETE, lalu klik Delete.
Hapus akun layanan Google Cloud :
1. Buka halaman Service accounts.
  
  Buka Akun layanan
2. Pilih project Anda.
3. Pilih kotak centang untuk gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com.
4. Klik Delete.
5. Untuk mengonfirmasi penghapusan, klik Delete.

Menghapus cluster GKE dan resource Google Cloud

Hapus cluster GKE:
1. Buka halaman Cluster:
  
  Buka Cluster
2. Pilih kotak centang untuk gke-gpu-cluster.
3. Klik Delete.
4. Untuk mengonfirmasi penghapusan, ketik gke-gpu-cluster, lalu klik Delete.
Hapus bucket Cloud Storage:
1. Buka halaman Bucket:
  
  Buka Bucket
2. Pilih kotak centang untuk PROJECT_ID-gke-gpu-bucket.
3. Klik Delete.
4. Untuk mengonfirmasi penghapusan, ketik DELETE, lalu klik Delete.
Hapus akun layanan Google Cloud :
1. Buka halaman Service accounts.
  
  Buka Akun layanan
2. Pilih project Anda.
3. Pilih kotak centang untuk gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com.
4. Klik Delete.
5. Untuk mengonfirmasi penghapusan, klik Hapus.

Menghapus project

Perhatian: Penghapusan project memiliki efek berikut:

Semua hal dalam project akan dihapus. Jika Anda menggunakan project yang ada untuk mengerjakan tugas di dokumen ini, saat Anda menghapusnya, pekerjaan lain dalam project tersebut juga akan dihapus.
Project ID kustom akan hilang. Saat membuat project ini, Anda mungkin juga membuat project ID kustom yang masih ingin digunakan pada masa mendatang. Agar tidak kehilangan URL yang menggunakan project ID tersebut, seperti URL appspot.com, hapus resource yang dipilih di dalam project, bukan menghapus seluruh project.

Jika Anda berencana mempelajari beberapa arsitektur, tutorial atau panduan memulai, dengan menggunakan kembali project dapat membantu Anda agar tidak melampaui batas kuota project.

Di Konsol Google Cloud , buka halaman Manage resources.
Buka Kelola resource
Pada daftar project, pilih project yang ingin Anda hapus, lalu klik Delete.
Pada dialog, ketik project ID, lalu klik Shut down untuk menghapus project.

Langkah berikutnya

Pelajari lebih lanjut cara menggunakan GPU di GKE

Melatih model dengan GPU pada mode GKE Standar

Sebelum memulai

Peran yang diperlukan

Gandakan repositori sampel

Membuat cluster mode Standar dan node pool GPU

Membuat bucket Cloud Storage

Mengonfigurasi cluster Anda untuk mengakses bucket menggunakan Workload Identity Federation for GKE

Buat akun layanan Google Cloud

Membuat Akun Layanan Kubernetes di cluster Anda

Ikat Akun Layanan Kubernetes ke akun layanan Google Cloud

Memverifikasi bahwa Pod dapat mengakses bucket Cloud Storage

Latih dan prediksi menggunakan set data MNIST

Men-deploy workload inferensi

Pembersihan

Menghapus resource Kubernetes dalam cluster dan resource Google Cloud

Menghapus cluster GKE dan resource Google Cloud

Menghapus project

Langkah berikutnya

Latih dan prediksi menggunakan set data `MNIST`