カスタムの重みを使用してモデルをデプロイする

カスタムの重みを使用したモデルのデプロイは、プレビュー版のサービスです。事前定義された一連のベースモデルに基づいてモデルをファインチューニングし、カスタマイズしたモデルを Vertex AI Model Garden にデプロイできます。カスタムモデルをデプロイするには、カスタムの重みのインポートを使用して、モデルアーティファクトをプロジェクトの Cloud Storage バケットにアップロードします。これは、Vertex AI でワンクリックで実行できます。

サポートされているモデル

カスタムの重みを使用したモデルのデプロイの公開プレビュー版は、次のベースモデルでサポートされています。

モデル名	バージョン
Llama	Llama-2: 7B、13B Llama-3.1: 8B、70B Llama-3.2: 1B、3B Llama-4: Scout-17B、Maverick-17B CodeLlama-13B
Gemma	Gemma-2: 9B、27B Gemma-3: 1B、4B、3-12B、27B Medgemma: 4B、27B-text
Qwen	Qwen2: 1.5B Qwen2.5: 0.5B、1.5B、7B、32B Qwen3: 0.6B、1.7B、8B、32B、Qwen3-Coder-480B-A35B-Instruct、Qwen3-Next-80B-A3B-Instruct、Qwen3-Next-80B-A3B-Thinking
Deepseek	Deepseek-R1 Deepseek-V3 DeepSeek-V3.1
Mistral と Mixtral	Mistral-7B-v0.1 Mixtral-8x7B-v0.1 Mistral-Nemo-Base-2407
Phi-4	Phi-4-reasoning
OpenAI OSS	gpt-oss: 20B、120B

制限事項

カスタムの重みでは、量子化モデルのインポートはサポートされていません。

モデルファイル

モデルファイルは Hugging Face の重み形式で指定する必要があります。Hugging Face の重み形式の詳細については、Hugging Face モデルを使用するをご覧ください。

必要なファイルが提供されていない場合、モデルのデプロイが失敗する可能性があります。

次の表に、モデルのアーキテクチャに応じて異なるモデルファイルの種類を示します。

モデルファイルの内容	ファイル形式
モデルの構成	`config.json`
モデルの重み	`.safetensors` `.bin`
重みインデックス	`*.index.json`
トークナイザーファイル	`tokenizer.model` `tokenizer.json` `tokenizer_config.json`

ロケーション

Model Garden サービスからすべてのリージョンにカスタムモデルをデプロイできます。

前提条件

このセクションでは、カスタムモデルをデプロイする方法について説明します。

始める前に

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

このチュートリアルでは、 Google Cloudの操作に Cloud Shell を使用することを前提としています。Cloud Shell の代わりに別のシェルを使用する場合は、次の追加の構成を行います。

Install the Google Cloud CLI.
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
To initialize the gcloud CLI, run the following command:
```
gcloud init
```

カスタムモデルをデプロイする

このセクションでは、カスタムモデルをデプロイする方法について説明します。

コマンドラインインターフェース（CLI）、Python、JavaScript を使用している場合は、次の変数をコードサンプルが機能する値に置き換えます。

REGION: 使用するリージョン。例: uscentral1
MODEL_GCS: Google Cloud モデル。例: gs://custom-weights-fishfooding/meta-llama/Llama-3.2-1B-Instruct
PROJECT_ID: 実際のプロジェクト ID。
MODEL_ID: モデル ID。
MACHINE_TYPE: マシンタイプ。例: g2-standard-12
ACCELERATOR_TYPE: アクセラレータのタイプ。例: NVIDIA_L4
ACCELERATOR_COUNT: アクセラレータの数。
PROMPT: テキストプロンプト。

コンソール

次の手順では、 Google Cloud コンソールを使用してカスタム重みでモデルをデプロイする方法について説明します。

Google Cloud コンソールで、[Model Garden] ページに移動します。

Model Garden に移動
[カスタムの重みを使用してモデルをデプロイ] をクリックします。[Vertex AI でカスタムの重みを使用してモデルをデプロイする] ペインが表示されます。
[モデルソース] セクションで、次の操作を行います。
1. [参照] をクリックし、モデルが保存されているバケットを選択して、[選択] をクリックします。
2. 省略可: [モデル名] フィールドにモデルの名前を入力します。
[デプロイの設定] セクションで、次の操作を行います。
1. [リージョン] フィールドでリージョンを選択し、[OK] をクリックします。
2. [マシン仕様] フィールドで、モデルのデプロイに使用されるマシン仕様を選択します。
3. 省略可: [エンドポイント名] フィールドには、デフォルトでモデルのエンドポイントが表示されます。ただし、フィールドに別のエンドポイント名を入力することはできます。
[カスタムの重みを使用してモデルをデプロイ] をクリックします。

gcloud CLI

このコマンドは、特定のリージョンにモデルをデプロイする方法を示します。

gcloud ai model-garden models deploy --model=${MODEL_GCS} --region ${REGION}

このコマンドは、マシンタイプ、アクセラレータタイプ、アクセラレータ数を使用して、モデルを特定のリージョンにデプロイする方法を示します。特定のマシン構成を選択する場合は、3 つのフィールドすべてを設定する必要があります。

gcloud ai model-garden models deploy --model=${MODEL_GCS} --machine-type=${MACHINE_TYE} --accelerator-type=${ACCELERATOR_TYPE} --accelerator-count=${ACCELERATOR_COUNT} --region ${REGION}

Python

import vertexai
from google.cloud import aiplatform
from vertexai.preview import model_garden

vertexai.init(project=${PROJECT_ID}, location=${REGION})
custom_model = model_garden.CustomModel(
  gcs_uri=GCS_URI,
)
endpoint = custom_model.deploy(
  machine_type="${MACHINE_TYPE}",
  accelerator_type="${ACCELERATOR_TYPE}",
  accelerator_count="${ACCELERATOR_COUNT}",
  model_display_name="custom-model",
  endpoint_display_name="custom-model-endpoint")

endpoint.predict(instances=[{"prompt": "${PROMPT}"}], use_dedicated_endpoint=True)

また、custom_model.deploy() メソッドにパラメータを渡す必要はありません。

import vertexai
from google.cloud import aiplatform
from vertexai.preview import model_garden

vertexai.init(project=${PROJECT_ID}, location=${REGION})
custom_model = model_garden.CustomModel(
  gcs_uri=GCS_URI,
)
endpoint = custom_model.deploy()

endpoint.predict(instances=[{"prompt": "${PROMPT}"}], use_dedicated_endpoint=True)

curl


curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:deploy" \
  -d '{
    "custom_model": {
    "gcs_uri": "'"${MODEL_GCS}"'"
  },
  "destination": "projects/'"${PROJECT_ID}"'/locations/'"${REGION}"'",
  "model_config": {
     "model_user_id": "'"${MODEL_ID}"'",
  },
}'

または、API を使用してマシンタイプを明示的に設定することもできます。


curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:deploy" \
  -d '{
    "custom_model": {
    "gcs_uri": "'"${MODEL_GCS}"'"
  },
  "destination": "projects/'"${PROJECT_ID}"'/locations/'"${REGION}"'",
  "model_config": {
     "model_user_id": "'"${MODEL_ID}"'",
  },
  "deploy_config": {
    "dedicated_resources": {
      "machine_spec": {
        "machine_type": "'"${MACHINE_TYPE}"'",
        "accelerator_type": "'"${ACCELERATOR_TYPE}"'",
        "accelerator_count": '"${ACCELERATOR_COUNT}"'
      },
      "min_replica_count": 1
    }
  }
}'

クエリを作成する

モデルがデプロイされると、カスタムの重みはパブリック専用エンドポイントをサポートします。クエリは、API または SDK を使用して送信できます。

クエリを送信する前に、 Google Cloud コンソールでエンドポイント URL、エンドポイント ID、モデル ID を取得する必要があります。

以下の手順に沿って情報を取得します。

Google Cloud コンソールで、[Model Garden] ページに移動します。

Model Garden
[エンドポイントとモデルを表示] をクリックします。
[リージョン] リストでリージョンを選択します。
エンドポイント ID とエンドポイント URL を取得するには、[マイエンドポイント] セクションでエンドポイントをクリックします。

エンドポイント ID が [エンドポイント ID] フィールドに表示されます。

パブリックエンドポイント URL が [専用エンドポイント] フィールドに表示されます。
モデル ID を取得するには、[デプロイされたモデル] セクションにリストされているモデルを見つけて、次の操作を行います。
1. [モデル] フィールドで、デプロイ済みモデルの名前をクリックします。
2. [バージョンの詳細] をクリックします。モデル ID が [モデル ID] フィールドに表示されます。

エンドポイントとデプロイされたモデルの情報を取得したら、次のコードサンプルで推論リクエストの送信方法を確認するか、専用パブリックエンドポイントにオンライン推論リクエストを送信するをご覧ください。

API

次のコードサンプルでは、ユースケースに基づいて API を使用するさまざまな方法を示します。

チャットの補完（単項）

このリクエストの例では、完全なチャットメッセージをモデルに送信し、レスポンス全体が生成された後に、レスポンスを 1 つのチャンクで取得します。これは、テキストメッセージを送信して 1 つの完全な返信を受け取ることに似ています。

  curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}/chat/completions" \
    -d '{
    "model": "'"${MODEL_ID}"'",
    "temperature": 0,
    "top_p": 1,
    "max_tokens": 154,
    "ignore_eos": true,
    "messages": [
      {
        "role": "user",
        "content": "How to tell the time by looking at the sky?"
      }
    ]
  }'

チャットの補完（ストリーミング）

このリクエストは、単項チャット完了リクエストのストリーミングバージョンです。リクエストに "stream": true を追加すると、モデルは生成と同時にレスポンスを少しずつ送信します。これは、チャットアプリケーションでリアルタイムのタイプライターのような効果の作成に役立ちます。

  curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \  "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}/chat/completions" \
    -d '{
    "model": "'"${MODEL_ID}"'",
    "stream": true,
    "temperature": 0,
    "top_p": 1,
    "max_tokens": 154,
    "ignore_eos": true,
    "messages": [
      {
        "role": "user",
        "content": "How to tell the time by looking at the sky?"
      }
    ]
  }'

予測

このリクエストは、モデルから推論を取得するための直接プロンプトを送信します。これは、テキストの要約や分類など、必ずしも会話型ではないタスクによく使用されます。

  curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
  "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}:predict" \
    -d '{
    "instances": [
      {
        "prompt": "How to tell the time by looking at the sky?",
        "temperature": 0,
        "top_p": 1,
        "max_tokens": 154,
        "ignore_eos": true
      }
    ]
  }'

未加工の予測

このリクエストは、予測リクエストのストリーミングバージョンです。:streamRawPredict エンドポイントを使用して "stream": true を含めることで、このリクエストは直接プロンプトを送信し、生成されたモデルの出力を連続したデータストリームとして受け取ります。これは、ストリーミングチャット完了リクエストに似ています。

  curl -X POST \
    -N \
    --output - \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}:streamRawPredict" \
    -d '{
    "instances": [
      {
        "prompt": "How to tell the time by looking at the sky?",
        "temperature": 0,
        "top_p": 1,
        "max_tokens": 154,
        "ignore_eos": true,
        "stream": true
      }
    ]
  }'

SDK

このコードサンプルでは、SDK を使用してモデルにクエリを送信し、そのモデルからレスポンスを取得します。

  from google.cloud import aiplatform

  project_id = ""
  location = ""
  endpoint_id = "" # Use the short ID here

  aiplatform.init(project=project_id, location=location)

  endpoint = aiplatform.Endpoint(endpoint_id)

  prompt = "How to tell the time by looking at the sky?"
  instances=[{"text": prompt}]
  response = endpoint.predict(instances=instances, use_dedicated_endpoint=True)
  print(response.predictions)

API の使用例については、カスタムの重みをインポートするノートブックをご覧ください。

Vertex AI のセルフデプロイモデルの詳細

セルフデプロイモデルのさらに詳しい内容については、セルフデプロイモデルの概要をご覧ください。
Model Garden のさらに詳しい内容については、Model Garden の概要をご覧ください。
モデルのデプロイの詳細については、Model Garden でモデルを使用するをご覧ください。
Gemma オープンモデルを使用する
Llama のオープンモデルを使用する
Hugging Face オープンモデルを使用する