Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Optimum TPU を活用し、GKE 上で TPU を使用してオープンソースモデルをサービングします

Standard

このチュートリアルでは、Hugging Face の Optimum TPU サービングフレームワークを活用し、Google Kubernetes Engine（GKE）上で Tensor Processing Unit（TPU）を使用して大規模言語モデル（LLM）オープンソースモデルをサービングする方法について説明します。このチュートリアルでは、Hugging Face からオープンソースモデルをダウンロードし、Optimum TPU を実行するコンテナを使用して、GKE Standard クラスタにモデルをデプロイします。

このガイドは、AI / ML ワークロードをデプロイしてサービングする際に、マネージド Kubernetes での詳細な制御、スケーラビリティ、復元力、ポータビリティ、費用対効果が求められる場合の出発点として適しています。

このチュートリアルは、Hugging Face エコシステムの生成 AI をご利用のお客様、GKE の新規または既存のユーザー、ML エンジニア、MLOps（DevOps）エンジニア、LLM のサービングに Kubernetes コンテナのオーケストレーション機能を使用することに関心をお持ちのプラットフォーム管理者を対象としています。

Google Cloud プロダクト（GKE、Vertex AI、Compute Engine など）は、JetStream、vLLM などのさまざまなサービングライブラリやその他のパートナーサービスをサポートしています。たとえば、JetStream を使用してプロジェクトから最新の最適化を取得できます。Hugging Face のオプションを選択する場合は、Optimum TPU を使用できます。

Optimum TPU は次の機能をサポートしています。

連続的なバッチ処理
トークンのストリーミング
トランスフォーマーを使用した貪欲探索と多項サンプリング。

目標

モデルの特性に基づいて推奨される TPU トポロジを持つ GKE Standard クラスタを準備します。
GKE に Optimum TPU をデプロイします。
Optimum TPU を使用して、サポートされるモデルを curl を通じてサービングします。

始める前に

Google Cloud アカウントにログインします。 Google Cloudを初めて使用する場合は、アカウントを作成して、実際のシナリオでの Google プロダクトのパフォーマンスを評価してください。新規のお客様には、ワークロードの実行、テスト、デプロイができる無料クレジット $300 分を差し上げます。

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

If you're using an existing project for this guide, verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

If you're using an existing project for this guide, verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Hugging Face アカウントを作成します（まだ作成していない場合）。
GKE の Cloud TPU 用にプロジェクトに十分な割り当てがあることを確認します。

必要なロール

クラスタとワークロードの構成に必要な権限を取得するには、プロジェクトに対する次の IAM ロールを付与するよう管理者に依頼してください。

サービスアカウント管理者（roles/iam.serviceAccountAdmin）
GKE クラスタを管理する: Kubernetes Engine 管理者（roles/container.admin）
イメージをビルドして Artifact Registry に push する: Artifact Registry 管理者（roles/artifactregistry.admin）

ロールの付与については、プロジェクト、フォルダ、組織へのアクセス権の管理をご覧ください。

必要な権限は、カスタムロールや他の事前定義ロールから取得することもできます。

環境を準備する

このチュートリアルでは、Cloud Shell を使用してGoogle Cloudでホストされているリソースを管理します。Cloud Shell には、このチュートリアルに必要な kubectl や gcloud CLI などのソフトウェアがプリインストールされています。

Cloud Shell を使用して環境を設定するには、次の操作を行います。

Google Cloud コンソールで Cloud Shell をアクティブにします。

Cloud Shell をアクティブにする

Google Cloud コンソールの下部にある Cloud Shell セッションが開始し、コマンドラインプロンプトが表示されます。Cloud Shell はシェル環境です。Google Cloud CLI がすでにインストールされており、現在のプロジェクトの値もすでに設定されています。セッションが初期化されるまで数秒かかることがあります。
デフォルトの環境変数を設定します。
```
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export CLUSTER_NAME=CLUSTER_NAME
export REGION=REGION_NAME
export ZONE=ZONE
export HF_TOKEN=HF_TOKEN
```
次の値を置き換えます。
- PROJECT_ID: 実際の Google Cloud プロジェクト ID。
- CLUSTER_NAME: GKE クラスタの名前。
- REGION_NAME: GKE クラスタ、Cloud Storage バケット、TPU ノードが配置されているリージョン。TPU v5e マシンタイプを使用できるゾーン（us-west1、us-west4、us-central1、us-east1、us-east5、europe-west4 など）が含まれているリージョンです。
- （Standard クラスタのみ）ZONE: TPU リソースが使用可能なゾーン（us-west4-a など）。Autopilot クラスタの場合は、リージョンのみを指定します。ゾーンを指定する必要はありません。
- HF_TOKEN: HuggingFace トークン。
Optimum TPU リポジトリのクローンを作成します。
```
git clone https://github.com/huggingface/optimum-tpu.git
```

モデルへのアクセス権を取得する

Gemma 2B モデルまたは Llama3 8B モデルを使用できます。このチュートリアルでは、この 2 つのモデルに焦点を当てていますが、Optimum TPU では他のモデルもサポートされています。

Gemma 2B

GKE にデプロイするために Gemma モデルへのアクセス権を取得するには、まずライセンス同意契約に署名してから、Hugging Face のアクセストークンを生成する必要があります。

Gemma を使用するには同意契約に署名する必要があります。手順は次のとおりです。

モデルの同意ページにアクセスします。
Hugging Face アカウントを使用して同意を確認します。
モデルの規約に同意します。

アクセストークンを生成する

Hugging Face トークンをまだ生成していない場合は、新しいトークンを生成します。

[Your Profile] > [Settings] > [Access Tokens] の順にクリックします。
[New Token] をクリックします。
任意の名前と、少なくとも Read ロールを指定します。
[Generate a token] をクリックします。
トークンをクリップボードにコピーします。

Llama3 8B

Hugging Face リポジトリの Llama3 8b を使用するには、同意契約に署名する必要があります。

アクセストークンを生成する

Hugging Face トークンをまだ生成していない場合は、新しいトークンを生成します。

[Your Profile] > [Settings] > [Access Tokens] の順にクリックします。
[New Token] を選択します。
任意の名前と、少なくとも Read ロールを指定します。
[Generate a token] を選択します。
トークンをクリップボードにコピーします。

GKE クラスタを作成する

1 つの CPU ノードを含む GKE Standard クラスタを作成します。

gcloud container clusters create CLUSTER_NAME \
    --project=PROJECT_ID \
    --num-nodes=1 \
    --location=REGION_NAME

TPU ノードプールを作成する

1 つのノードと 8 つのチップを含む v5e TPU ノードプールを作成します。

gcloud container node-pools create tpunodepool \
    --location=REGION_NAME \
    --num-nodes=1 \
    --machine-type=ct5lp-hightpu-8t \
    --node-locations=ZONE \
    --cluster=CLUSTER_NAME

TPU リソースが使用可能な場合、GKE はノードプールをプロビジョニングします。TPU リソースが一時的に使用できない場合、出力には GCE_STOCKOUT エラーメッセージが表示されます。リソースの可用性に関するエラーのトラブルシューティングについては、TPU リクエストに対応できる十分な TPU リソースがないをご覧ください。

コンテナをビルドする

make コマンドを実行してイメージをビルドします。

cd optimum-tpu && make tpu-tgi

イメージを Artifact Registry に push する

gcloud artifacts repositories create optimum-tpu --repository-format=docker --location=REGION_NAME && \
gcloud auth configure-docker REGION_NAME-docker.pkg.dev && \
docker image tag huggingface/optimum-tpu REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest && \
docker push REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest

Hugging Face の認証情報用の Kubernetes Secret を作成する

Hugging Face トークンを含む Kubernetes Secret を作成します。

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=${HF_TOKEN} \
  --dry-run=client -o yaml | kubectl apply -f -

Optimum TPU をデプロイする

このチュートリアルでは、Kubernetes Deployment を使用して Optimum TPU をデプロイします。Deployment は、クラスタ内のノードに分散された Pod の複数のレプリカを実行できる Kubernetes API オブジェクトです。

Gemma 2B

次の Deployment マニフェストを optimum-tpu-gemma-2b-2x4.yaml として保存します。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=google/gemma-2b
        - --max-concurrent-requests=4
        - --max-input-length=8191
        - --max-total-tokens=8192
        - --max-batch-prefill-tokens=32768
        - --max-batch-size=16
        securityContext:
            privileged: true
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120

---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

このマニフェストは、TCP ポート 8080 に内部ロードバランサがある Optimum TPU のデプロイを記述しています。

マニフェストを適用します。

kubectl apply -f optimum-tpu-gemma-2b-2x4.yaml

Llama3 8B

次のマニフェストを optimum-tpu-llama3-8b-2x4.yaml として保存します。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=meta-llama/Meta-Llama-3-8B
        - --max-concurrent-requests=4
        - --max-input-length=8191
        - --max-total-tokens=8192
        - --max-batch-prefill-tokens=32768
        - --max-batch-size=16
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120
---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

このマニフェストは、TCP ポート 8080 に内部ロードバランサがある Optimum TPU のデプロイを記述しています。

マニフェストを適用します。

kubectl apply -f optimum-tpu-llama3-8b-2x4.yaml

実行中の Deployment のログを表示します。

kubectl logs -f -l app=tgi-tpu

出力例を以下に示します。

2024-07-09T22:39:34.365472Z  WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-07-09T22:40:47.851405Z  INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-07-09T22:40:54.559269Z  INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64
2024-07-09T22:40:54.559291Z  INFO text_generation_router: router/src/main.rs:352: Connected
2024-07-09T22:40:54.559295Z  WARN text_generation_router: router/src/main.rs:366: Invalid hostname, defaulting to 0.0.0.0

モデルが完全にダウンロードされたことを確認してから、次のセクションに進んでください。

モデルをサービングする

モデルへのポート転送を設定します。

kubectl port-forward svc/service 8080:8080

curl を使用してモデルサーバーと対話する

デプロイされたモデルを確認します。

新しいターミナルセッションで curl を使用してモデルとチャットします。

curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}'     -H 'Content-Type: application/json'

出力例を以下に示します。

{"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to learn from data.\n\nArtificial neural networks are inspired by the way the human brain works. They are made up of multiple layers"}

クリーンアップ

このチュートリアルで使用したリソースについて、Google Cloud アカウントに課金されないようにするには、リソースを含むプロジェクトを削除するか、プロジェクトを維持して個々のリソースを削除します。

デプロイされたリソースを削除する

このガイドで作成したリソースについて Google Cloud アカウントに課金されないようにするには、次のコマンドを実行します。

gcloud container clusters delete CLUSTER_NAME \
  --location=REGION_NAME

次のステップ

Optimum TPU のドキュメントを確認する。
GKE で Gemma モデルを実行する方法と、GKE プラットフォームのオーケストレーション機能を使用して最適化された AI / ML ワークロードを実行する方法を確認する。
GKE の TPU の詳細を確認する。

Optimum TPU を活用し、GKE 上で TPU を使用してオープンソース モデルをサービングします コレクションでコンテンツを整理 必要に応じて、コンテンツの保存と分類を行います。

目標

始める前に

必要なロール

環境を準備する

モデルへのアクセス権を取得する

Gemma 2B

ライセンス同意契約に署名する

アクセス トークンを生成する

Llama3 8B

アクセス トークンを生成する

GKE クラスタを作成する

TPU ノードプールを作成する

コンテナをビルドする

イメージを Artifact Registry に push する

Hugging Face の認証情報用の Kubernetes Secret を作成する

Optimum TPU をデプロイする

Gemma 2B

Llama3 8B

モデルをサービングする

curl を使用してモデルサーバーと対話する

クリーンアップ

デプロイされたリソースを削除する

次のステップ

Optimum TPU を活用し、GKE 上で TPU を使用してオープンソースモデルをサービングします

アクセストークンを生成する

アクセストークンを生成する