Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Optimum TPU와 함께 TPU를 GKE에서 사용하여 오픈소스 모델 제공

표준

이 튜토리얼에서는 Hugging Face의 Optimum TPU 제공 프레임워크와 함께 Tensor Processing Unit(TPU)을 Google Kubernetes Engine(GKE)에서 사용해 대규모 언어 모델(LLM) 오픈소스 모델을 제공하는 방법을 보여줍니다. 이 튜토리얼에서는 Hugging Face에서 오픈소스 모델을 다운로드하고 Optimum TPU를 실행하는 컨테이너를 사용하여 GKE Standard 클러스터에 모델을 배포합니다.

이 가이드는 AI/ML 워크로드를 배포하고 제공할 때 관리형 Kubernetes의 세밀한 제어, 확장성, 복원력, 이동성, 비용 효율성이 필요한 경우 좋은 출발점이 될 수 있습니다.

이 튜토리얼은 LLM 제공을 위해 Kubernetes 컨테이너 조정 기능을 사용하는 데 관심이 있는 Hugging Face 생태계의 생성형 AI 고객, GKE의 신규 또는 기존 사용자, ML 엔지니어, MLOps(DevOps) 엔지니어, 플랫폼 관리자를 대상으로 합니다.

Google Cloud GKE, Vertex AI, Compute Engine과 같은 제품은 JetStream, vLLM, 기타 파트너 제품과 같은 다양한 제공 라이브러리를 지원합니다. 예를 들어 JetStream을 사용하여 프로젝트의 최신 최적화를 가져올 수 있습니다. Hugging Face 옵션을 선호하는 경우 Optimum TPU를 사용할 수 있습니다.

Optimum TPU는 다음 기능을 지원합니다.

연속 일괄 처리
토큰 스트리밍
Transformer를 사용한 탐욕적 검색 및 다항 샘플링

목표

모델 특성에 따라 권장 TPU 토폴로지를 사용하여 GKE Standard 클러스터를 준비합니다.
GKE에 Optimum TPU를 배포합니다.
Optimum TPU를 사용하여 curl을 통해 지원되는 모델을 제공합니다.

시작하기 전에

Google Cloud 계정에 로그인합니다. Google Cloud를 처음 사용하는 경우 계정을 만들고 Google 제품의 실제 성능을 평가해 보세요. 신규 고객에게는 워크로드를 실행, 테스트, 배포하는 데 사용할 수 있는 $300의 무료 크레딧이 제공됩니다.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

If you're using an existing project for this guide, verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

If you're using an existing project for this guide, verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Hugging Face 계정이 없는 경우 만듭니다.
GKE에서 Cloud TPU에 대해 프로젝트 할당량이 충분한지 확인합니다.

필요한 역할

클러스터와 워크로드를 구성하는 데 필요한 권한을 얻으려면 관리자에게 프로젝트에 대한 다음 IAM 역할을 부여해 달라고 요청하세요.

서비스 계정 관리자 (roles/iam.serviceAccountAdmin)
GKE 클러스터 관리: Kubernetes Engine 관리자 (roles/container.admin)
Artifact Registry에 이미지 빌드 및 푸시: Artifact Registry 관리자 (roles/artifactregistry.admin)

역할 부여에 대한 자세한 내용은 프로젝트, 폴더, 조직에 대한 액세스 관리를 참조하세요.

커스텀 역할이나 다른 사전 정의된 역할을 통해 필요한 권한을 얻을 수도 있습니다.

환경 준비

이 튜토리얼에서는 Cloud Shell을 사용하여Google Cloud에서 호스팅되는 리소스를 관리합니다. Cloud Shell에는 kubectl 및 gcloud CLI를 포함하여 이 튜토리얼에 필요한 소프트웨어가 사전 설치되어 있습니다.

Cloud Shell로 환경을 설정하려면 다음 단계를 따르세요.

Google Cloud 콘솔에서 Cloud Shell을 활성화합니다.

Cloud Shell 활성화

Google Cloud 콘솔 하단에 Cloud Shell 세션이 시작되고 명령줄 프롬프트가 표시됩니다. Cloud Shell은 Google Cloud CLI가 사전 설치된 셸 환경으로, 현재 프로젝트의 값이 이미 설정되어 있습니다. 세션이 초기화되는 데 몇 초 정도 걸릴 수 있습니다.
기본 환경 변수를 설정합니다.
```
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export CLUSTER_NAME=CLUSTER_NAME
export REGION=REGION_NAME
export ZONE=ZONE
export HF_TOKEN=HF_TOKEN
```
다음 값을 바꿉니다.
- PROJECT_ID: Google Cloud 프로젝트 ID
- CLUSTER_NAME: GKE 클러스터의 이름입니다.
- REGION_NAME: GKE 클러스터, Cloud Storage 버킷, TPU 노드가 위치한 리전입니다. 리전에는 TPU v5e 머신 유형을 사용할 수 있는 영역이 포함됩니다(예: us-west1, us-west4, us-central1, us-east1, us-east5 또는 europe-west4).
- (표준 클러스터만 해당) ZONE: TPU 리소스를 사용할 수 있는 영역입니다(예: us-west4-a). Autopilot 클러스터의 경우 영역을 지정할 필요 없이 리전만 지정하면 됩니다.
- HF_TOKEN: HuggingFace 토큰입니다.

Optimum TPU 저장소를 클론합니다.

git clone https://github.com/huggingface/optimum-tpu.git

모델 액세스 권한 얻기

Gemma 2B 또는 Llama3 8B 모델을 사용할 수 있습니다. 이 튜토리얼에서는 이 두 모델에 초점을 맞추지만 Optimum TPU는 더 많은 모델을 지원합니다.

Gemma 2B

GKE에 배포하기 위해 Gemma 모델에 액세스하려면 먼저 라이선스 동의 계약에 서명한 다음 Hugging Face 액세스 토큰을 생성해야 합니다.

Gemma를 사용하려면 동의 계약에 서명해야 합니다. 다음 안내를 따르세요.

모델 동의 페이지에 액세스합니다.
Hugging Face 계정을 사용하여 동의를 확인합니다.
모델 약관에 동의합니다.

액세스 토큰 생성

토큰을 아직 만들지 않았다면 새 Hugging Face 토큰을 생성합니다.

내 프로필 > 설정 > 액세스 토큰을 클릭합니다.
새 토큰을 클릭합니다.
원하는 이름과 Read 이상의 역할을 지정합니다.
Generate a token(토큰 생성)을 클릭합니다.
클립보드에 생성된 토큰을 복사합니다.

Llama3 8B

Hugging Face 저장소에서 Llama3 8b를 사용하려면 동의 계약에 서명해야 합니다.

액세스 토큰 생성

토큰을 아직 만들지 않았다면 새 Hugging Face 토큰을 생성합니다.

내 프로필 > 설정 > 액세스 토큰을 클릭합니다.
새 토큰을 선택합니다.
원하는 이름과 Read 이상의 역할을 지정합니다.
토큰 생성을 선택합니다.
클립보드에 생성된 토큰을 복사합니다.

GKE 클러스터 만들기

CPU 노드가 1개인 GKE Standard 클러스터를 만듭니다.

gcloud container clusters create CLUSTER_NAME \
    --project=PROJECT_ID \
    --num-nodes=1 \
    --location=REGION_NAME

TPU 노드 풀 만들기

노드 1개와 칩 8개가 있는 v5e TPU 노드 풀을 만듭니다.

gcloud container node-pools create tpunodepool \
    --location=REGION_NAME \
    --num-nodes=1 \
    --machine-type=ct5lp-hightpu-8t \
    --node-locations=ZONE \
    --cluster=CLUSTER_NAME

TPU 리소스를 사용할 수 있으면 GKE가 노드 풀을 프로비저닝합니다. TPU 리소스를 일시적으로 사용할 수 없는 경우 출력에 GCE_STOCKOUT 오류 메시지가 표시됩니다. 리소스 가용성 오류를 해결하려면 TPU 요청을 충족할 수 있는 TPU 리소스가 부족함을 참고하세요.

컨테이너 빌드

make 명령어를 실행하여 이미지를 빌드합니다.

cd optimum-tpu && make tpu-tgi

이미지를 Artifact Registry로 푸시

gcloud artifacts repositories create optimum-tpu --repository-format=docker --location=REGION_NAME && \
gcloud auth configure-docker REGION_NAME-docker.pkg.dev && \
docker image tag huggingface/optimum-tpu REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest && \
docker push REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest

Hugging Face 사용자 인증 정보용 Kubernetes 보안 비밀 만들기

Hugging Face 토큰이 포함된 Kubernetes 보안 비밀을 만듭니다.

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=${HF_TOKEN} \
  --dry-run=client -o yaml | kubectl apply -f -

Optimum TPU 배포

이 튜토리얼에서는 Kubernetes 배포를 사용하여 Optimum TPU를 배포합니다. 배포는 클러스터에서 노드 간에 배포되는 여러 포드 복제본을 실행할 수 있는 Kubernetes API 객체입니다.

Gemma 2B

다음 배포 매니페스트를 optimum-tpu-gemma-2b-2x4.yaml로 저장합니다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=google/gemma-2b
        - --max-concurrent-requests=4
        - --max-input-length=8191
        - --max-total-tokens=8192
        - --max-batch-prefill-tokens=32768
        - --max-batch-size=16
        securityContext:
            privileged: true
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120

---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

이 매니페스트는 TCP 포트 8080에 내부 부하 분산기가 있는 Optimum TPU 배포를 설명합니다.

매니페스트 적용

kubectl apply -f optimum-tpu-gemma-2b-2x4.yaml

Llama3 8B

다음 매니페스트를 optimum-tpu-llama3-8b-2x4.yaml로 저장합니다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=meta-llama/Meta-Llama-3-8B
        - --max-concurrent-requests=4
        - --max-input-length=8191
        - --max-total-tokens=8192
        - --max-batch-prefill-tokens=32768
        - --max-batch-size=16
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120
---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

이 매니페스트는 TCP 포트 8080에 내부 부하 분산기가 있는 Optimum TPU 배포를 설명합니다.

매니페스트 적용

kubectl apply -f optimum-tpu-llama3-8b-2x4.yaml

실행 중인 배포의 로그를 봅니다.

kubectl logs -f -l app=tgi-tpu

출력은 다음과 비슷하게 표시됩니다.

2024-07-09T22:39:34.365472Z  WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-07-09T22:40:47.851405Z  INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-07-09T22:40:54.559269Z  INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64
2024-07-09T22:40:54.559291Z  INFO text_generation_router: router/src/main.rs:352: Connected
2024-07-09T22:40:54.559295Z  WARN text_generation_router: router/src/main.rs:366: Invalid hostname, defaulting to 0.0.0.0

다음 섹션으로 이동하기 전에 모델이 완전히 다운로드되었는지 확인합니다.

모델 제공

모델로의 포트 전달을 설정합니다.

kubectl port-forward svc/service 8080:8080

curl을 사용하여 모델 서버와 상호작용

다음과 같이 배포된 모델을 확인합니다.

새 터미널 세션에서 curl을 사용해서 모델과 채팅합니다.

curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}'     -H 'Content-Type: application/json'

출력은 다음과 비슷하게 표시됩니다.

{"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to learn from data.\n\nArtificial neural networks are inspired by the way the human brain works. They are made up of multiple layers"}

삭제

이 튜토리얼에서 사용된 리소스 비용이 Google Cloud 계정에 청구되지 않도록 하려면 리소스가 포함된 프로젝트를 삭제하거나 프로젝트를 유지하고 개별 리소스를 삭제하세요.

배포된 리소스 삭제

이 가이드에서 만든 리소스에 대해 Google Cloud 계정에 비용이 청구되지 않도록 하려면 다음 명령어를 실행합니다.

gcloud container clusters delete CLUSTER_NAME \
  --location=REGION_NAME

다음 단계

Optimum TPU 문서 살펴보기
GKE에서 Gemma 모델을 실행하는 방법 및 GKE 플랫폼 조정 기능으로 최적화된 AI/ML 워크로드를 실행하는 방법에 대해 알아보기
GKE의 TPU 자세히 알아보기

Optimum TPU와 함께 TPU를 GKE에서 사용하여 오픈소스 모델 제공 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

목표

시작하기 전에

필요한 역할

환경 준비

모델 액세스 권한 얻기

Gemma 2B

라이선스 동의 계약 서명

액세스 토큰 생성

Llama3 8B

액세스 토큰 생성

GKE 클러스터 만들기

TPU 노드 풀 만들기

컨테이너 빌드

이미지를 Artifact Registry로 푸시

Hugging Face 사용자 인증 정보용 Kubernetes 보안 비밀 만들기

Optimum TPU 배포

Gemma 2B

Llama3 8B

모델 제공

curl을 사용하여 모델 서버와 상호작용

삭제

배포된 리소스 삭제

다음 단계

Optimum TPU와 함께 TPU를 GKE에서 사용하여 오픈소스 모델 제공