이 페이지는 Cloud Translation API를 통해 번역되었습니다.

Model Garden 및 Vertex AI GPU 지원 엔드포인트를 사용하여 Gemma 배포 및 추론

이 튜토리얼에서는 Model Garden을 사용하여 Gemma 1B 개방형 모델을 GPU 지원 Vertex AI 엔드포인트에 배포합니다. 모델을 온라인 예측을 제공하는 데 사용하려면 먼저 모델을 엔드포인트에 배포해야 합니다. 모델을 배포하면 물리적 리소스가 모델과 연결되므로 짧은 지연 시간으로 온라인 예측을 제공할 수 있습니다.

Gemma 1B 모델을 배포한 후 PredictionServiceClient를 사용하여 온라인 예측을 수행해 학습된 모델을 추론합니다. 온라인 예측은 엔드포인트에 배포된 모델로 전송된 동기식 요청입니다.

목표

이 튜토리얼에서는 다음 작업을 수행하는 방법을 보여줍니다.

Model Garden을 사용하여 GPU 지원 엔드포인트에 Gemma 1B 개방형 모델 배포
PredictionServiceClient를 사용하여 온라인 예측 수행

비용

이 문서에서는 비용이 청구될 수 있는 Google Cloud구성요소( )를 사용합니다.

프로젝트 사용량을 기준으로 예상 비용을 산출하려면 가격 계산기를 사용합니다.

Google Cloud 신규 사용자는 무료 체험판을 사용할 수 있습니다.

이 문서에 설명된 태스크를 완료했으면 만든 리소스를 삭제하여 청구가 계속되는 것을 방지할 수 있습니다. 자세한 내용은 삭제를 참조하세요.

시작하기 전에

이 튜토리얼을 완료하려면 다음을 수행해야 합니다.

Google Cloud 프로젝트를 설정하고 Vertex AI API를 사용 설정합니다.
로컬 머신에서 다음을 수행합니다.
- Google Cloud CLI로 설치, 초기화, 인증
- 사용 언어의 SDK 설치

Google Cloud 프로젝트 설정

Google Cloud 프로젝트를 설정하고 Vertex AI API를 사용 설정합니다.

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Google Cloud CLI 설정

로컬 머신에서 Google Cloud CLI를 설정합니다.

Google Cloud CLI를 설치하고 초기화합니다.
이전에 gcloud CLI를 설치한 경우 이 명령어를 실행하여 gcloud 구성요소가 업데이트되었는지 확인합니다.
```
gcloud components update
```
gcloud CLI로 인증하려면 이 명령어를 실행하여 로컬 애플리케이션 기본 사용자 인증 정보(ADC) 파일을 생성합니다. 이 명령어로 실행되는 웹 흐름은 사용자 인증 정보를 제공하는 데 사용됩니다.
```
gcloud auth application-default login
```
자세한 내용은 gcloud CLI 인증 구성 및 ADC 구성을 참조하세요.

프로그래밍 언어의 SDK 설정

이 튜토리얼에서 사용되는 환경을 설정하려면 사용 언어의 Vertex AI SDK 및 프로토콜 버퍼 라이브러리를 설치합니다. 코드 샘플은 프로토콜 버퍼 라이브러리의 함수를 사용하여 입력 사전을 API에서 예상하는 JSON 형식으로 변환합니다.

로컬 머신에서 다음 탭 중 하나를 클릭하여 프로그래밍 언어의 SDK를 설치합니다.

Python

로컬 머신에서 다음 탭 중 하나를 클릭하여 프로그래밍 언어의 SDK를 설치합니다.

이 명령어를 실행하여 Vertex AI SDK for Python을 설치하고 업데이트합니다.
```
pip3 install --upgrade "google-cloud-aiplatform>=1.64"
```
이 명령어를 실행하여 Python용 프로토콜 버퍼 라이브러리를 설치합니다.
```
pip3 install --upgrade "protobuf>=5.28"
```

Node.js

다음 명령어를 실행하여 Node.js용 aiplatform SDK를 설치하거나 업데이트합니다.

npm install @google-cloud/aiplatform

Java

google-cloud-aiplatform을 종속 항목으로 추가하려면 환경에 적절한 코드를 추가합니다.

BOM이 있는 Maven

다음 HTML을 pom.xml에 추가합니다.

<dependencyManagement>
<dependencies>
  <dependency>
    <artifactId>libraries-bom</artifactId>
    <groupId>com.google.cloud</groupId>
    <scope>import</scope>
    <type>pom</type>
    <version>26.34.0</version>
  </dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
  <groupId>com.google.cloud</groupId>
  <artifactId>google-cloud-aiplatform</artifactId>
</dependency>
<dependency>
  <groupId>com.google.protobuf</groupId>
  <artifactId>protobuf-java-util</artifactId>
</dependency>
<dependency>
  <groupId>com.google.code.gson</groupId>
  <artifactId>gson</artifactId>
</dependency>
</dependencies>

BOM이 없는 Maven

pom.xml에 다음을 추가합니다.

<dependency>
  <groupId>com.google.cloud</groupId>
  <artifactId>google-cloud-aiplatform</artifactId>
  <version>1.1.0</version>
</dependency>
<dependency>
  <groupId>com.google.protobuf</groupId>
  <artifactId>protobuf-java-util</artifactId>
  <version>5.28</version>
</dependency>
<dependency>
  <groupId>com.google.code.gson</groupId>
  <artifactId>gson</artifactId>
  <version>2.11.0</version>
</dependency>

BOM이 없는 Gradle

build.gradle에 다음을 추가합니다.

implementation 'com.google.cloud:google-cloud-aiplatform:1.1.0'

Go

다음 명령어를 실행하여 이러한 Go 패키지를 설치합니다.

go get cloud.google.com/go/aiplatform
go get google.golang.org/protobuf
go get github.com/googleapis/gax-go/v2

Model Garden을 사용하여 Gemma 배포

Google Cloud 콘솔에서 모델 카드를 사용하거나 프로그래매틱 방식으로 Gemma 1B를 배포할 수 있습니다.

Google Gen AI SDK 또는 Google Cloud CLI 설정하는 방법에 대한 자세한 내용은 Google Gen AI SDK 개요나 Google Cloud CLI 설치를 참조하세요.

Python

Vertex AI SDK for Python을 설치하거나 업데이트하는 방법은 Vertex AI SDK for Python 설치를 참조하세요. 자세한 내용은 Python API 참고 문서를 참조하세요.

배포할 수 있는 모델을 나열하고 배포할 모델 ID를 기록합니다. Model Garden에 지원되는 Hugging Face 모델을 선택적으로 나열하고 모델 이름별로 필터링할 수도 있습니다. 출력에는 조정된 모델이 포함되지 않습니다.


import vertexai
from vertexai import model_garden

# TODO(developer): Update and un-comment below lines
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

# List deployable models, optionally list Hugging Face models only or filter by model name.
deployable_models = model_garden.list_deployable_models(list_hf_models=False, model_filter="gemma")
print(deployable_models)
# Example response:
# ['google/gemma2@gemma-2-27b','google/gemma2@gemma-2-27b-it', ...]

이전 단계의 모델 ID를 사용하여 모델의 배포 사양을 봅니다. Model Garden에서 특정 모델에 대해 확인한 머신 유형, 가속기 유형, 컨테이너 이미지 URI를 볼 수 있습니다.


import vertexai
from vertexai import model_garden

# TODO(developer): Update and un-comment below lines
# PROJECT_ID = "your-project-id"
# model = "google/gemma3@gemma-3-1b-it"
vertexai.init(project=PROJECT_ID, location="us-central1")

# For Hugging Face modelsm the format is the Hugging Face model name, as in
# "meta-llama/Llama-3.3-70B-Instruct".
# Go to https://console.cloud.google.com/vertex-ai/model-garden to find all deployable
# model names.

model = model_garden.OpenModel(model)
deploy_options = model.list_deploy_options()
print(deploy_options)
# Example response:
# [
#   dedicated_resources {
#     machine_spec {
#       machine_type: "g2-standard-12"
#       accelerator_type: NVIDIA_L4
#       accelerator_count: 1
#     }
#   }
#   container_spec {
#     ...
#   }
#   ...
# ]

모델을 엔드포인트에 배포합니다. 추가 인수와 값을 지정하지 않으면 Model Garden은 기본 배포 구성을 사용합니다.


import vertexai
from vertexai import model_garden

# TODO(developer): Update and un-comment below lines
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

open_model = model_garden.OpenModel("google/gemma3@gemma-3-12b-it")
endpoint = open_model.deploy(
    machine_type="g2-standard-48",
    accelerator_type="NVIDIA_L4",
    accelerator_count=4,
    accept_eula=True,
)

# Optional. Run predictions on the deployed endoint.
# endpoint.predict(instances=[{"prompt": "What is Generative AI?"}])

gcloud

시작하기 전에 다음 명령어를 실행할 할당량 프로젝트를 지정합니다. 실행하는 명령어는 해당 프로젝트의 할당량에 반영됩니다. 자세한 내용은 할당량 프로젝트 설정을 참조하세요.

gcloud ai model-garden models list 명령어를 실행하여 배포할 수 있는 모델을 나열합니다. 이 명령어는 모든 모델 ID와 직접 배포할 수 있는 모델 ID를 나열합니다.

gcloud ai model-garden models list --model-filter=gemma

출력에서 배포할 모델 ID를 찾습니다. 다음 예시에서는 축약된 출력을 보여줍니다.

MODEL_ID                                      CAN_DEPLOY  CAN_PREDICT
google/gemma2@gemma-2-27b                     Yes         No
google/gemma2@gemma-2-27b-it                  Yes         No
google/gemma2@gemma-2-2b                      Yes         No
google/gemma2@gemma-2-2b-it                   Yes         No
google/gemma2@gemma-2-9b                      Yes         No
google/gemma2@gemma-2-9b-it                   Yes         No
google/gemma3@gemma-3-12b-it                  Yes         No
google/gemma3@gemma-3-12b-pt                  Yes         No
google/gemma3@gemma-3-1b-it                   Yes         No
google/gemma3@gemma-3-1b-pt                   Yes         No
google/gemma3@gemma-3-27b-it                  Yes         No
google/gemma3@gemma-3-27b-pt                  Yes         No
google/gemma3@gemma-3-4b-it                   Yes         No
google/gemma3@gemma-3-4b-pt                   Yes         No
google/gemma3n@gemma-3n-e2b                   Yes         No
google/gemma3n@gemma-3n-e2b-it                Yes         No
google/gemma3n@gemma-3n-e4b                   Yes         No
google/gemma3n@gemma-3n-e4b-it                Yes         No
google/gemma@gemma-1.1-2b-it                  Yes         No
google/gemma@gemma-1.1-2b-it-gg-hf            Yes         No
google/gemma@gemma-1.1-7b-it                  Yes         No
google/gemma@gemma-1.1-7b-it-gg-hf            Yes         No
google/gemma@gemma-2b                         Yes         No
google/gemma@gemma-2b-gg-hf                   Yes         No
google/gemma@gemma-2b-it                      Yes         No
google/gemma@gemma-2b-it-gg-hf                Yes         No
google/gemma@gemma-7b                         Yes         No
google/gemma@gemma-7b-gg-hf                   Yes         No
google/gemma@gemma-7b-it                      Yes         No
google/gemma@gemma-7b-it-gg-hf                Yes         No

출력에는 조정된 모델이나 Hugging Face 모델이 포함되지 않습니다. 지원되는 Hugging Face 모델을 보려면 --can-deploy-hugging-face-models 플래그를 추가합니다.

모델 배포 사양을 보려면 gcloud ai model-garden models list-deployment-config 명령어를 실행합니다. Model Garden에서 특정 모델에 지원하는 머신 유형, 가속기 유형, 컨테이너 이미지 URI를 볼 수 있습니다.
```
gcloud ai model-garden models list-deployment-config \
    --model=MODEL_ID
```
MODEL_ID를 이전 목록 명령어의 모델 ID로 바꿉니다(예: google/gemma@gemma-2b 또는 stabilityai/stable-diffusion-xl-base-1.0).
gcloud ai model-garden models deploy 명령어를 실행하여 모델을 엔드포인트에 배포합니다. Model Garden은 엔드포인트 표시 이름을 생성하고 추가 인수와 값을 지정하지 않으면 기본 배포 구성을 사용합니다.

명령어를 비동기식으로 실행하려면 --asynchronous 플래그를 포함합니다.
```
gcloud ai model-garden models deploy \
    --model=MODEL_ID \
    [--machine-type=MACHINE_TYPE] \
    [--accelerator-type=ACCELERATOR_TYPE] \
    [--endpoint-display-name=ENDPOINT_NAME] \
    [--hugging-face-access-token=HF_ACCESS_TOKEN] \
    [--reservation-affinity reservation-affinity-type=any-reservation] \
    [--reservation-affinity reservation-affinity-type=specific-reservation, key="compute.googleapis.com/reservation-name", values=RESERVATION_RESOURCE_NAME] \
    [--asynchronous]
```
다음 자리표시자를 바꿉니다.
- MODEL_ID: 이전 목록 명령어의 모델 ID입니다. Hugging Face 모델의 경우 stabilityai/stable-diffusion-xl-base-1.0과 같은 Hugging Face 모델 URL 형식을 사용합니다.
- MACHINE_TYPE: 모델에 배포할 리소스 집합을 정의합니다(예: g2-standard-4).
- ACCELERATOR_TYPE: NVIDIA_L4와 같은 집약적인 워크로드를 사용할 때 성능 향상에 도움이 되도록 배포에 추가할 가속기를 지정합니다.
- ENDPOINT_NAME: 배포된 Vertex AI 엔드포인트의 이름입니다.
- HF_ACCESS_TOKEN: Hugging Face 모델의 경우 모델이 비공개이면 액세스 토큰을 제공합니다.
- RESERVATION_RESOURCE_NAME: 특정 Compute Engine 예약을 사용하려면 예약 이름을 지정합니다. 특정 예약을 지정하면 any-reservation을 지정할 수 없습니다.
출력에는 Model Garden에서 사용한 배포 구성, 엔드포인트 ID, 배포 작업 ID가 포함되며 이를 사용하여 배포 상태를 확인할 수 있습니다.
```
Using the default deployment configuration:
 Machine type: g2-standard-12
 Accelerator type: NVIDIA_L4
 Accelerator count: 1

The project has enough quota. The current usage of quota for accelerator type NVIDIA_L4 in region us-central1 is 0 out of 28.

Deploying the model to the endpoint. To check the deployment status, you can try one of the following methods:
1) Look for endpoint `ENDPOINT_DISPLAY_NAME` at the [Vertex AI] -> [Online prediction] tab in Cloud Console
2) Use `gcloud ai operations describe OPERATION_ID --region=LOCATION` to find the status of the deployment long-running operation
```
배포에 대한 세부정보를 확인하려면 gcloud ai endpoints list --list-model-garden-endpoints-only 명령어를 실행합니다.
```
gcloud ai endpoints list --list-model-garden-endpoints-only \
    --region=LOCATION_ID
```
LOCATION_ID를 모델을 배포한 리전으로 바꿉니다.

출력에는 Model Garden에서 생성된 모든 엔드포인트가 포함되며 엔드포인트 ID, 엔드포인트 이름, 엔드포인트가 배포된 모델과 연결되어 있는지 여부와 같은 정보가 포함됩니다. 배포를 찾으려면 이전 명령어에서 반환된 엔드포인트 이름을 찾습니다.

REST

배포 가능한 모든 모델을 나열한 후 배포할 모델의 ID를 가져옵니다. 그런 다음 기본 구성과 엔드포인트로 모델을 배포할 수 있습니다. 또는 특정 머신 유형을 설정하거나 전용 엔드포인트를 사용하는 등 배포를 맞춤설정할 수 있습니다.

배포할 수 있는 모델 나열

요청 데이터를 사용하기 전에 다음을 바꿉니다.

PROJECT_ID: Google Cloud 프로젝트 ID입니다.
QUERY_PARAMETERS: Model Garden 모델을 나열하려면 listAllVersions=True&filter=can_deploy(true) 쿼리 파라미터를 추가합니다. Hugging Face 모델을 나열하려면 필터를 alt=json&is_hf_wildcard(true)+AND+labels.VERIFIED_DEPLOYMENT_CONFIG%3DVERIFIED_DEPLOYMENT_SUCCEED&listAllVersions=True로 설정합니다.

HTTP 메서드 및 URL:

GET https://us-central1-aiplatform.googleapis.com/v1/publishers/*/models?QUERY_PARAMETERS

요청을 보내려면 다음 옵션 중 하나를 선택합니다.

curl

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하거나 gcloud CLI에 자동으로 로그인하는 Cloud Shell을 사용하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

다음 명령어를 실행합니다.

curl -X GET \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "x-goog-user-project: PROJECT_ID" \
     "https://us-central1-aiplatform.googleapis.com/v1/publishers/*/models?QUERY_PARAMETERS"

PowerShell

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

다음 명령어를 실행합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "PROJECT_ID" }

Invoke-WebRequest `
    -Method GET `
    -Headers $headers `
    -Uri "https://us-central1-aiplatform.googleapis.com/v1/publishers/*/models?QUERY_PARAMETERS" | Select-Object -Expand Content

다음과 비슷한 JSON 응답이 표시됩니다.

{
  "publisherModels": [
    {
      "name": "publishers/google/models/gemma3",
      "versionId": "gemma-3-1b-it",
      "openSourceCategory": "GOOGLE_OWNED_OSS_WITH_GOOGLE_CHECKPOINT",
      "supportedActions": {
        "openNotebook": {
          "references": {
            "us-central1": {
              "uri": "https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_gradio_streaming_chat_completions.ipynb"
            }
          },
          "resourceTitle": "Notebook",
          "resourceUseCase": "Chat Completion Playground",
          "resourceDescription": "Chat with deployed Gemma 2 endpoints via Gradio UI."
        },
        "deploy": {
          "modelDisplayName": "gemma-3-1b-it",
          "containerSpec": {
            "imageUri": "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250312_0916_RC01",
            "args": [
              "python",
              "-m",
              "vllm.entrypoints.api_server",
              "--host=0.0.0.0",
              "--port=8080",
              "--model=gs://vertex-model-garden-restricted-us/gemma3/gemma-3-1b-it",
              "--tensor-parallel-size=1",
              "--swap-space=16",
              "--gpu-memory-utilization=0.95",
              "--disable-log-stats"
            ],
            "env": [
              {
                "name": "MODEL_ID",
                "value": "google/gemma-3-1b-it"
              },
              {
                "name": "DEPLOY_SOURCE",
                "value": "UI_NATIVE_MODEL"
              }
            ],
            "ports": [
              {
                "containerPort": 8080
              }
            ],
            "predictRoute": "/generate",
            "healthRoute": "/ping"
          },
          "dedicatedResources": {
            "machineSpec": {
              "machineType": "g2-standard-12",
              "acceleratorType": "NVIDIA_L4",
              "acceleratorCount": 1
            }
          },
          "publicArtifactUri": "gs://vertex-model-garden-restricted-us/gemma3/gemma3.tar.gz",
          "deployTaskName": "vLLM 128K context",
          "deployMetadata": {
            "sampleRequest": "{\n    \"instances\": [\n        {\n          \"@requestFormat\": \"chatCompletions\",\n          \"messages\": [\n              {\n                  \"role\": \"user\",\n                  \"content\": \"What is machine learning?\"\n              }\n          ],\n          \"max_tokens\": 100\n        }\n    ]\n}\n"
          }
        },
        ...

모델 배포

Model Garden의 모델이나 Hugging Face의 모델을 배포합니다. JSON 필드를 추가로 지정하여 배포를 맞춤설정할 수도 있습니다.

기본 구성으로 모델을 배포합니다.

요청 데이터를 사용하기 전에 다음을 바꿉니다.

LOCATION: 모델이 배포되는 리전입니다.
PROJECT_ID: Google Cloud 프로젝트 ID입니다.
MODEL_ID: 배포할 모델의 ID입니다. 배포 가능한 모든 모델을 나열하면 가져올 수 있습니다. ID는 publishers/PUBLISHER_NAME/models/MODEL_NAME@MODEL_VERSION 형식을 사용합니다.

HTTP 메서드 및 URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy

JSON 요청 본문:

{
  "publisher_model_name": "MODEL_ID",
  "model_config": {
    "accept_eula": "true"
  }
}

요청을 보내려면 다음 옵션 중 하나를 선택합니다.

curl

요청 본문을 request.json 파일에 저장합니다. 터미널에서 다음 명령어를 실행하여 현재 디렉터리에 이 파일을 만들거나 덮어씁니다.

cat > request.json << 'EOF'
{
  "publisher_model_name": "MODEL_ID",
  "model_config": {
    "accept_eula": "true"
  }
}
EOF

그런 후 다음 명령어를 실행하여 REST 요청을 전송합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy"

PowerShell

요청 본문을 request.json 파일에 저장합니다. 터미널에서 다음 명령어를 실행하여 현재 디렉터리에 이 파일을 만들거나 덮어씁니다.

@'
{
  "publisher_model_name": "MODEL_ID",
  "model_config": {
    "accept_eula": "true"
  }
}
'@  | Out-File -FilePath request.json -Encoding utf8

그런 후 다음 명령어를 실행하여 REST 요청을 전송합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy" | Select-Object -Expand Content

다음과 비슷한 JSON 응답이 표시됩니다.

{
  "name": "projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployOperationMetadata",
    "genericMetadata": {
      "createTime": "2025-03-13T21:44:44.538780Z",
      "updateTime": "2025-03-13T21:44:44.538780Z"
    },
    "publisherModel": "publishers/google/models/gemma3@gemma-3-1b-it",
    "destination": "projects/PROJECT_ID/locations/LOCATION",
    "projectNumber": "PROJECT_ID"
  }
}

Hugging Face 모델 배포

요청 데이터를 사용하기 전에 다음을 바꿉니다.

LOCATION: 모델이 배포되는 리전입니다.
PROJECT_ID: Google Cloud 프로젝트 ID입니다.
MODEL_ID: 배포할 Hugging Face 모델 ID 모델입니다. 배포 가능한 모든 모델을 나열하면 가져올 수 있습니다. ID는 PUBLISHER_NAME/MODEL_NAME 형식을 사용합니다.
ACCESS_TOKEN: 모델이 비공개이면 액세스 토큰을 제공합니다.

HTTP 메서드 및 URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy

JSON 요청 본문:

{
  "hugging_face_model_id": "MODEL_ID",
  "hugging_face_access_token": "ACCESS_TOKEN",
  "model_config": {
    "accept_eula": "true"
  }
}

요청을 보내려면 다음 옵션 중 하나를 선택합니다.

curl

요청 본문을 request.json 파일에 저장합니다. 터미널에서 다음 명령어를 실행하여 현재 디렉터리에 이 파일을 만들거나 덮어씁니다.

cat > request.json << 'EOF'
{
  "hugging_face_model_id": "MODEL_ID",
  "hugging_face_access_token": "ACCESS_TOKEN",
  "model_config": {
    "accept_eula": "true"
  }
}
EOF

그런 후 다음 명령어를 실행하여 REST 요청을 전송합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy"

PowerShell

요청 본문을 request.json 파일에 저장합니다. 터미널에서 다음 명령어를 실행하여 현재 디렉터리에 이 파일을 만들거나 덮어씁니다.

@'
{
  "hugging_face_model_id": "MODEL_ID",
  "hugging_face_access_token": "ACCESS_TOKEN",
  "model_config": {
    "accept_eula": "true"
  }
}
'@  | Out-File -FilePath request.json -Encoding utf8

그런 후 다음 명령어를 실행하여 REST 요청을 전송합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy" | Select-Object -Expand Content

다음과 비슷한 JSON 응답이 표시됩니다.

{
  "name": "projects/PROJECT_ID/locations/us-central1LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployOperationMetadata",
    "genericMetadata": {
      "createTime": "2025-03-13T21:44:44.538780Z",
      "updateTime": "2025-03-13T21:44:44.538780Z"
    },
    "publisherModel": "publishers/PUBLISHER_NAME/model/MODEL_NAME",
    "destination": "projects/PROJECT_ID/locations/LOCATION",
    "projectNumber": "PROJECT_ID"
  }
}

맞춤설정으로 모델 배포

요청 데이터를 사용하기 전에 다음을 바꿉니다.

LOCATION: 모델이 배포되는 리전입니다.
PROJECT_ID: Google Cloud 프로젝트 ID입니다.
MODEL_ID: 배포할 모델의 ID입니다. 배포 가능한 모든 모델을 나열하면 가져올 수 있습니다. ID는 publishers/PUBLISHER_NAME/models/MODEL_NAME@MODEL_VERSION 형식을 사용합니다(예: google/gemma@gemma-2b 또는 stabilityai/stable-diffusion-xl-base-1.0).
MACHINE_TYPE: 모델에 배포할 리소스 집합을 정의합니다(예: g2-standard-4).
ACCELERATOR_TYPE: NVIDIA_L4와 같은 집약적인 워크로드를 사용할 때 성능 향상에 도움이 되도록 배포에 추가할 가속기를 지정합니다.
ACCELERATOR_COUNT: 배포에 사용할 가속기 수입니다.
reservation_affinity_type: 배포에 기존 Compute Engine 예약을 사용하려면 예약이나 특정 예약을 지정합니다. 이 값을 지정하는 경우 spot을 지정하지 마세요.
spot: 배포에 스팟 VM을 사용할지 여부입니다.
IMAGE_URI: 사용할 컨테이너 이미지의 위치입니다(예: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20241016_0916_RC00_maas).
CONTAINER_ARGS: 배포 중에 컨테이너에 전달할 인수입니다.
CONTAINER_PORT: 컨테이너 포트 번호입니다.
fast_tryout_enabled: 모델을 테스트할 때 더 빠른 배포를 사용할 수 있습니다. 이 옵션은 특정 머신 유형에서 많이 사용되는 모델에만 사용 가능합니다. 사용 설정하면 모델이나 배포 구성을 지정할 수 없습니다.

HTTP 메서드 및 URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy

JSON 요청 본문:

{
  "publisher_model_name": "MODEL_ID",
  "deploy_config": {
    "dedicated_resources": {
      "machine_spec": {
        "machine_type": "MACHINE_TYPE",
        "accelerator_type": "ACCELERATOR_TYPE",
        "accelerator_count": ACCELERATOR_COUNT,
        "reservation_affinity": {
          "reservation_affinity_type": "ANY_RESERVATION"
        }
      },
      "spot": "false"
    }
  },
  "model_config": {
    "accept_eula": "true",
    "container_spec": {
      "image_uri": "IMAGE_URI",
      "args": [CONTAINER_ARGS ],
      "ports": [
        {
          "container_port": CONTAINER_PORT
        }
      ]
    }
  },
  "deploy_config": {
    "fast_tryout_enabled": false
  },
}

요청을 보내려면 다음 옵션 중 하나를 선택합니다.

curl

요청 본문을 request.json 파일에 저장합니다. 터미널에서 다음 명령어를 실행하여 현재 디렉터리에 이 파일을 만들거나 덮어씁니다.

cat > request.json << 'EOF'
{
  "publisher_model_name": "MODEL_ID",
  "deploy_config": {
    "dedicated_resources": {
      "machine_spec": {
        "machine_type": "MACHINE_TYPE",
        "accelerator_type": "ACCELERATOR_TYPE",
        "accelerator_count": ACCELERATOR_COUNT,
        "reservation_affinity": {
          "reservation_affinity_type": "ANY_RESERVATION"
        }
      },
      "spot": "false"
    }
  },
  "model_config": {
    "accept_eula": "true",
    "container_spec": {
      "image_uri": "IMAGE_URI",
      "args": [CONTAINER_ARGS ],
      "ports": [
        {
          "container_port": CONTAINER_PORT
        }
      ]
    }
  },
  "deploy_config": {
    "fast_tryout_enabled": false
  },
}
EOF

그런 후 다음 명령어를 실행하여 REST 요청을 전송합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy"

PowerShell

요청 본문을 request.json 파일에 저장합니다. 터미널에서 다음 명령어를 실행하여 현재 디렉터리에 이 파일을 만들거나 덮어씁니다.

@'
{
  "publisher_model_name": "MODEL_ID",
  "deploy_config": {
    "dedicated_resources": {
      "machine_spec": {
        "machine_type": "MACHINE_TYPE",
        "accelerator_type": "ACCELERATOR_TYPE",
        "accelerator_count": ACCELERATOR_COUNT,
        "reservation_affinity": {
          "reservation_affinity_type": "ANY_RESERVATION"
        }
      },
      "spot": "false"
    }
  },
  "model_config": {
    "accept_eula": "true",
    "container_spec": {
      "image_uri": "IMAGE_URI",
      "args": [CONTAINER_ARGS ],
      "ports": [
        {
          "container_port": CONTAINER_PORT
        }
      ]
    }
  },
  "deploy_config": {
    "fast_tryout_enabled": false
  },
}
'@  | Out-File -FilePath request.json -Encoding utf8

그런 후 다음 명령어를 실행하여 REST 요청을 전송합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy" | Select-Object -Expand Content

다음과 비슷한 JSON 응답이 표시됩니다.

{
  "name": "projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployOperationMetadata",
    "genericMetadata": {
      "createTime": "2025-03-13T21:44:44.538780Z",
      "updateTime": "2025-03-13T21:44:44.538780Z"
    },
    "publisherModel": "publishers/google/models/gemma3@gemma-3-1b-it",
    "destination": "projects/PROJECT_ID/locations/LOCATION",
    "projectNumber": "PROJECT_ID"
  }
}

콘솔

Google Cloud 콘솔에서 Model Garden 페이지로 이동합니다.

Model Garden으로 이동
배포하려는 지원되는 모델을 찾고 모델 카드를 클릭합니다.
배포를 클릭하여 모델 배포 창을 엽니다.
모델 배포 창에서 배포 세부정보를 지정합니다.
1. 생성된 모델 및 엔드포인트 이름을 사용하거나 수정합니다.
2. 모델 엔드포인트를 만들 위치를 선택합니다.
3. 배포의 각 노드에 사용할 머신 유형을 선택합니다.
4. Compute Engine 예약을 사용하려면 배포 설정 섹션에서 고급을 선택합니다.
  
  예약 유형 필드에서 예약 유형을 선택합니다. 예약은 지정된 머신 사양과 일치해야 합니다.
  - 생성된 예약 자동 사용: Vertex AI는 일치하는 속성이 있는 허용된 예약을 자동으로 선택합니다. 자동으로 선택된 예약에 용량이 없으면 Vertex AI는 일반 Google Cloud리소스 풀을 사용합니다.
  - 특정 예약 선택: Vertex AI에서 특정 예약을 사용합니다. 선택한 예약에 여유 용량이 없으면 오류가 발생합니다.
  - 사용 안 함(기본값): Vertex AI는 일반Google Cloud 리소스 풀을 사용합니다. 이 값은 예약을 지정하지 않는 것과 동일한 효과가 있습니다.
배포를 클릭합니다.

Terraform

Terraform 구성을 적용하거나 삭제하는 방법은 기본 Terraform 명령어를 참조하세요. 자세한 내용은 Terraform 제공업체 참고 문서를 확인하세요.

모델 배포

다음 예시에서는 기본 구성을 사용하여 gemma-3-1b-it 모델을 us-central1의 새 Vertex AI 엔드포인트에 배포합니다.

terraform {
  required_providers {
    google = {
      source = "hashicorp/google"
      version = "6.45.0"
    }
  }
}

provider "google" {
  region  = "us-central1"
}

resource "google_vertex_ai_endpoint_with_model_garden_deployment" "gemma_deployment" {
  publisher_model_name = "publishers/google/models/gemma3@gemma-3-1b-it"
  location = "us-central1"
  model_config {
    accept_eula = True
  }
}

맞춤설정으로 모델 배포에 대한 자세한 내용은 Model Garden 배포를 사용하는 Vertex AI 엔드포인트를 참조하세요.

구성 적용

terraform init
terraform plan
terraform apply

구성을 적용하면 Terraform에서 새 Vertex AI 엔드포인트를 프로비저닝하고 지정된 개방형 모델을 배포합니다.

삭제

엔드포인트와 모델 배포를 삭제하려면 다음 명령어를 실행합니다.

terraform destroy

PredictionServiceClient로 Gemma 1B 추론

Gemma 1B를 배포한 후 PredictionServiceClient를 사용하여 '하늘이 파란 이유는 무엇인가요?'라는 프롬프트에 대한 온라인 예측을 수행합니다.

코드 파라미터

PredictionServiceClient 코드 샘플을 사용하려면 다음을 업데이트해야 합니다.

PROJECT_ID: 프로젝트 ID를 찾으려면 다음 단계를 수행합니다.
1. Google Cloud 콘솔에서 시작하기 페이지로 이동합니다.
  시작하기로 이동
2. 페이지 상단의 프로젝트 선택기에서 프로젝트를 선택합니다.
  
  프로젝트 이름, 프로젝트 번호, 프로젝트 ID가 시작하기 제목 뒤에 표시됩니다.
ENDPOINT_REGION: 엔드포인트를 배포한 리전입니다.
ENDPOINT_ID: 엔드포인트 ID를 찾으려면 콘솔에서 보거나 gcloud ai endpoints list 명령어를 실행합니다. 모델 배포 창의 엔드포인트 이름과 리전이 필요합니다.
콘솔
온라인 예측 > 엔드포인트를 클릭하고 리전을 선택하여 엔드포인트 세부정보를 볼 수 있습니다. ID 열에 표시되는 번호를 확인합니다.

엔드포인트로 이동
gcloud
gcloud ai endpoints list 명령어를 실행하여 엔드포인트 세부정보를 볼 수 있습니다.
```
gcloud ai endpoints list \
  --region=ENDPOINT_REGION \
  --filter=display_name=ENDPOINT_NAME
```
다음과 같이 출력됩니다.
```
Using endpoint [https://us-central1-aiplatform.googleapis.com/]
ENDPOINT_ID: 1234567891234567891
DISPLAY_NAME: gemma2-2b-it-mg-one-click-deploy
```

샘플 코드

사용 언어의 샘플 코드에서 PROJECT_ID, ENDPOINT_REGION, ENDPOINT_ID를 업데이트합니다. 그런 다음 코드를 실행합니다.

Python

Vertex AI SDK for Python을 설치하거나 업데이트하는 방법은 Vertex AI SDK for Python 설치를 참조하세요. 자세한 내용은 Python API 참고 문서를 참고하세요.

"""
Sample to run inference on a Gemma2 model deployed to a Vertex AI endpoint with GPU accellerators.
"""

from google.cloud import aiplatform
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value

# TODO(developer): Update & uncomment lines below
# PROJECT_ID = "your-project-id"
# ENDPOINT_REGION = "your-vertex-endpoint-region"
# ENDPOINT_ID = "your-vertex-endpoint-id"

# Default configuration
config = {"max_tokens": 1024, "temperature": 0.9, "top_p": 1.0, "top_k": 1}

# Prompt used in the prediction
prompt = "Why is the sky blue?"

# Encapsulate the prompt in a correct format for GPUs
# Example format: [{'inputs': 'Why is the sky blue?', 'parameters': {'temperature': 0.9}}]
input = {"inputs": prompt, "parameters": config}

# Convert input message to a list of GAPIC instances for model input
instances = [json_format.ParseDict(input, Value())]

# Create a client
api_endpoint = f"{ENDPOINT_REGION}-aiplatform.googleapis.com"
client = aiplatform.gapic.PredictionServiceClient(
    client_options={"api_endpoint": api_endpoint}
)

# Call the Gemma2 endpoint
gemma2_end_point = (
    f"projects/{PROJECT_ID}/locations/{ENDPOINT_REGION}/endpoints/{ENDPOINT_ID}"
)
response = client.predict(
    endpoint=gemma2_end_point,
    instances=instances,
)
text_responses = response.predictions
print(text_responses[0])

Node.js

이 샘플을 사용해 보기 전에 Vertex AI 빠른 시작: 클라이언트 라이브러리 사용의 Node.js 설정 안내를 따르세요. 자세한 내용은 Vertex AI Node.js API 참고 문서를 참조하세요.

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

async function gemma2PredictGpu(predictionServiceClient) {
  // Imports the Google Cloud Prediction Service Client library
  const {
    // TODO(developer): Uncomment PredictionServiceClient before running the sample.
    // PredictionServiceClient,
    helpers,
  } = require('@google-cloud/aiplatform');
  /**
   * TODO(developer): Update these variables before running the sample.
   */
  const projectId = 'your-project-id';
  const endpointRegion = 'your-vertex-endpoint-region';
  const endpointId = 'your-vertex-endpoint-id';

  // Default configuration
  const config = {maxOutputTokens: 1024, temperature: 0.9, topP: 1.0, topK: 1};
  // Prompt used in the prediction
  const prompt = 'Why is the sky blue?';

  // Encapsulate the prompt in a correct format for GPUs
  // Example format: [{inputs: 'Why is the sky blue?', parameters: {temperature: 0.9}}]
  const input = {
    inputs: prompt,
    parameters: config,
  };

  // Convert input message to a list of GAPIC instances for model input
  const instances = [helpers.toValue(input)];

  // TODO(developer): Uncomment apiEndpoint and predictionServiceClient before running the sample.
  // const apiEndpoint = `${endpointRegion}-aiplatform.googleapis.com`;

  // Create a client
  // predictionServiceClient = new PredictionServiceClient({apiEndpoint});

  // Call the Gemma2 endpoint
  const gemma2Endpoint = `projects/${projectId}/locations/${endpointRegion}/endpoints/${endpointId}`;

  const [response] = await predictionServiceClient.predict({
    endpoint: gemma2Endpoint,
    instances,
  });

  const predictions = response.predictions;
  const text = predictions[0].stringValue;

  console.log('Predictions:', text);
  return text;
}

module.exports = gemma2PredictGpu;

// TODO(developer): Uncomment below lines before running the sample.
// gemma2PredictGpu(...process.argv.slice(2)).catch(err => {
//   console.error(err.message);
//   process.exitCode = 1;
// });

Java

이 샘플을 사용해 보기 전에 Vertex AI 빠른 시작: 클라이언트 라이브러리 사용의 Java 설정 안내를 따르세요. 자세한 내용은 Vertex AI Java API 참고 문서를 참조하세요.

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


import com.google.cloud.aiplatform.v1.EndpointName;
import com.google.cloud.aiplatform.v1.PredictResponse;
import com.google.cloud.aiplatform.v1.PredictionServiceClient;
import com.google.cloud.aiplatform.v1.PredictionServiceSettings;
import com.google.gson.Gson;
import com.google.protobuf.InvalidProtocolBufferException;
import com.google.protobuf.Value;
import com.google.protobuf.util.JsonFormat;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class Gemma2PredictGpu {

  private final PredictionServiceClient predictionServiceClient;

  // Constructor to inject the PredictionServiceClient
  public Gemma2PredictGpu(PredictionServiceClient predictionServiceClient) {
    this.predictionServiceClient = predictionServiceClient;
  }

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String endpointRegion = "us-east4";
    String endpointId = "YOUR_ENDPOINT_ID";

    PredictionServiceSettings predictionServiceSettings =
        PredictionServiceSettings.newBuilder()
            .setEndpoint(String.format("%s-aiplatform.googleapis.com:443", endpointRegion))
            .build();
    PredictionServiceClient predictionServiceClient =
        PredictionServiceClient.create(predictionServiceSettings);
    Gemma2PredictGpu creator = new Gemma2PredictGpu(predictionServiceClient);

    creator.gemma2PredictGpu(projectId, endpointRegion, endpointId);
  }

  // Demonstrates how to run inference on a Gemma2 model
  // deployed to a Vertex AI endpoint with GPU accelerators.
  public String gemma2PredictGpu(String projectId, String region,
               String endpointId) throws IOException {
    Map<String, Object> paramsMap = new HashMap<>();
    paramsMap.put("temperature", 0.9);
    paramsMap.put("maxOutputTokens", 1024);
    paramsMap.put("topP", 1.0);
    paramsMap.put("topK", 1);
    Value parameters = mapToValue(paramsMap);

    // Prompt used in the prediction
    String instance = "{ \"inputs\": \"Why is the sky blue?\"}";
    Value.Builder instanceValue = Value.newBuilder();
    JsonFormat.parser().merge(instance, instanceValue);
    // Encapsulate the prompt in a correct format for GPUs
    // Example format: [{'inputs': 'Why is the sky blue?', 'parameters': {'temperature': 0.8}}]
    List<Value> instances = new ArrayList<>();
    instances.add(instanceValue.build());

    EndpointName endpointName = EndpointName.of(projectId, region, endpointId);

    PredictResponse predictResponse = this.predictionServiceClient
        .predict(endpointName, instances, parameters);
    String textResponse = predictResponse.getPredictions(0).getStringValue();
    System.out.println(textResponse);
    return textResponse;
  }

  private static Value mapToValue(Map<String, Object> map) throws InvalidProtocolBufferException {
    Gson gson = new Gson();
    String json = gson.toJson(map);
    Value.Builder builder = Value.newBuilder();
    JsonFormat.parser().merge(json, builder);
    return builder.build();
  }
}

Go

이 샘플을 사용해 보기 전에 Vertex AI 빠른 시작: 클라이언트 라이브러리 사용의 Go 설정 안내를 따르세요. 자세한 내용은 Vertex AI Go API 참고 문서를 참조하세요.

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

import (
	"context"
	"fmt"
	"io"

	"cloud.google.com/go/aiplatform/apiv1/aiplatformpb"

	"google.golang.org/protobuf/types/known/structpb"
)

// predictGPU demonstrates how to run interference on a Gemma2 model deployed to a Vertex AI endpoint with GPU accelerators.
func predictGPU(w io.Writer, client PredictionsClient, projectID, location, endpointID string) error {
	ctx := context.Background()

	// Note: client can be initialized in the following way:
	// apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	// client, err := aiplatform.NewPredictionClient(ctx, option.WithEndpoint(apiEndpoint))
	// if err != nil {
	// 	return fmt.Errorf("unable to create prediction client: %v", err)
	// }
	// defer client.Close()

	gemma2Endpoint := fmt.Sprintf("projects/%s/locations/%s/endpoints/%s", projectID, location, endpointID)
	prompt := "Why is the sky blue?"
	parameters := map[string]interface{}{
		"temperature":     0.9,
		"maxOutputTokens": 1024,
		"topP":            1.0,
		"topK":            1,
	}

	// Encapsulate the prompt in a correct format for TPUs.
	// Pay attention that prompt should be set in "inputs" field.
	// Example format: [{'inputs': 'Why is the sky blue?', 'parameters': {'temperature': 0.9}}]
	promptValue, err := structpb.NewValue(map[string]interface{}{
		"inputs":     prompt,
		"parameters": parameters,
	})
	if err != nil {
		fmt.Fprintf(w, "unable to convert prompt to Value: %v", err)
		return err
	}

	req := &aiplatformpb.PredictRequest{
		Endpoint:  gemma2Endpoint,
		Instances: []*structpb.Value{promptValue},
	}

	resp, err := client.Predict(ctx, req)
	if err != nil {
		return err
	}

	prediction := resp.GetPredictions()
	value := prediction[0].GetStringValue()
	fmt.Fprintf(w, "%v", value)

	return nil
}

삭제

이 튜토리얼에서 사용된 리소스 비용이 Google Cloud 계정에 청구되지 않도록 하려면 리소스가 포함된 프로젝트를 삭제하거나 프로젝트를 유지하고 개별 리소스를 삭제하세요.

프로젝트 삭제

주의: 프로젝트 삭제가 미치는 영향은 다음과 같습니다.

프로젝트의 모든 항목이 삭제됩니다. 이 문서의 태스크에 기존 프로젝트를 사용한 경우 프로젝트를 삭제하면 프로젝트에서 수행한 다른 작업도 삭제됩니다.
커스텀 프로젝트 ID가 손실됩니다. 이 프로젝트를 만들 때 앞으로 사용할 커스텀 프로젝트 ID를 만들었을 수 있습니다. appspot.com URL과 같이 프로젝트 ID를 사용하는 URL을 보존하려면 전체 프로젝트를 삭제하는 대신 프로젝트 내에서 선택한 리소스만 삭제합니다.

여러 아키텍처, 튜토리얼 또는 빠른 시작을 살펴보려는 경우 프로젝트를 재사용하면 프로젝트 할당량 한도 초과를 방지할 수 있습니다.

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

개별 리소스 삭제

프로젝트를 유지하려면 이 튜토리얼에서 사용한 리소스를 삭제합니다.

모델 배포 취소 및 엔드포인트 삭제
Model Registry에서 모델 삭제

모델 배포 취소 및 엔드포인트 삭제

다음 방법 중 하나를 사용하여 모델 배포를 취소하고 엔드포인트를 삭제합니다.

콘솔

Google Cloud 콘솔에서 온라인 예측을 클릭한 후 엔드포인트를 클릭합니다.

엔드포인트 페이지로 이동
리전 드롭다운 목록에서 엔드포인트를 배포한 리전을 선택합니다.
엔드포인트 이름을 클릭하여 세부정보 페이지를 엽니다. 예를 들면 gemma2-2b-it-mg-one-click-deploy입니다.
Gemma 2 (Version 1) 모델 행에서 작업을 클릭한 후 엔드포인트에서 모델 배포 취소를 클릭합니다.
엔드포인트에서 모델 배포 취소 대화상자에서 배포 취소를 클릭합니다.
뒤로 버튼을 클릭하여 엔드포인트 페이지로 돌아갑니다.

엔드포인트 페이지로 이동
gemma2-2b-it-mg-one-click-deploy 행 끝에서 작업을 클릭한 후 엔드포인트 삭제를 선택합니다.
확인 프롬프트에서 확인을 클릭합니다.

gcloud

Google Cloud CLI를 사용하여 모델 배포를 취소하고 엔드포인트를 삭제하려면 다음 단계를 수행합니다.

이 명령어에서 다음을 바꿉니다.

PROJECT_ID를 프로젝트 이름으로 바꿉니다.
LOCATION_ID를 모델과 엔드포인트를 배포한 리전으로 바꿉니다.
ENDPOINT_ID를 엔드포인트 ID로 바꿉니다.
DEPLOYED_MODEL_NAME을 모델 표시 이름으로 바꿉니다.
DEPLOYED_MODEL_ID를 모델 ID로 바꿉니다.

gcloud ai endpoints list 명령어를 실행하여 엔드포인트 ID를 가져옵니다. 이 명령어는 프로젝트에 있는 모든 엔드포인트의 엔드포인트 ID를 나열합니다. 이 튜토리얼에서 사용된 엔드포인트의 ID를 기록해 둡니다.
```
gcloud ai endpoints list \
    --project=PROJECT_ID \
    --region=LOCATION_ID
```
다음과 같이 출력됩니다. 출력에서 ID는 ENDPOINT_ID입니다.
```
Using endpoint [https://us-central1-aiplatform.googleapis.com/]
ENDPOINT_ID: 1234567891234567891
DISPLAY_NAME: gemma2-2b-it-mg-one-click-deploy
```

gcloud ai models describe 명령어를 실행하여 모델 ID를 가져옵니다. 이 튜토리얼에서 배포한 모델의 ID를 기록해 둡니다.

gcloud ai models describe DEPLOYED_MODEL_NAME \
    --project=PROJECT_ID \
    --region=LOCATION_ID

축약된 출력은 다음과 같습니다. 출력에서 ID는 deployedModelId입니다.

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
artifactUri: [URI removed]
baseModelSource:
  modelGardenSource:
    publicModelName: publishers/google/models/gemma2
...
deployedModels:
- deployedModelId: '1234567891234567891'
  endpoint: projects/12345678912/locations/us-central1/endpoints/12345678912345
displayName: gemma2-2b-it-12345678912345
etag: [ETag removed]
modelSourceInfo:
  sourceType: MODEL_GARDEN
name: projects/123456789123/locations/us-central1/models/gemma2-2b-it-12345678912345
...

엔드포인트에서 모델 배포를 취소합니다. 이전 명령어의 엔드포인트 ID와 모델 ID가 필요합니다.
```
gcloud ai endpoints undeploy-model ENDPOINT_ID \
    --project=PROJECT_ID \
    --region=LOCATION_ID \
    --deployed-model-id=DEPLOYED_MODEL_ID
```
이 명령어는 출력을 생성하지 않습니다.
gcloud ai endpoints delete 명령어를 실행하여 엔드포인트를 삭제합니다.
```
gcloud ai endpoints delete ENDPOINT_ID \
    --project=PROJECT_ID \
    --region=LOCATION_ID
```
메시지가 표시되면 y를 입력하여 확인합니다. 이 명령어는 출력을 생성하지 않습니다.

모델 삭제

콘솔

Google Cloud 콘솔의 Vertex AI 섹션에서 Model Registry 페이지로 이동합니다.

Model Registry 페이지로 이동
리전 드롭다운 목록에서 모델을 배포한 리전을 선택합니다.
gemma2-2b-it-1234567891234 행 끝에서 작업을 클릭합니다.
모델 삭제를 선택합니다.

모델을 삭제하면 모든 관련 모델 버전과 평가가 Google Cloud 프로젝트에서 삭제됩니다.
확인 프롬프트에서 삭제를 클릭합니다.

gcloud

Google Cloud CLI를 사용하여 모델을 삭제하려면 모델 표시 이름과 리전을 gcloud ai models delete 명령어에 제공합니다.

gcloud ai models delete DEPLOYED_MODEL_NAME \
    --project=PROJECT_ID \
    --region=LOCATION_ID

DEPLOYED_MODEL_NAME을 모델 표시 이름으로 바꿉니다. PROJECT_ID를 프로젝트 이름으로 바꿉니다. LOCATION_ID를 모델을 배포한 리전으로 바꿉니다.

다음 단계

Gemma 개방형 모델 자세히 알아보기
Gemma 이용약관 읽어보기
개방형 모델 자세히 알아보기
조정된 모델을 배포하는 방법 알아보기
HuggingFace Textgen 추론(TGI)을 사용하여 Gemma 2를 Google Kubernetes Engine에 배포하는 방법 알아보기
원하는 언어(Python, Node.js, Java 또는 Go)에서 PredictionServiceClient 자세히 알아보기

Model Garden 및 Vertex AI GPU 지원 엔드포인트를 사용하여 Gemma 배포 및 추론 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

목표

비용

시작하기 전에

Google Cloud 프로젝트 설정

Google Cloud CLI 설정

프로그래밍 언어의 SDK 설정

Python

Node.js

Java

BOM이 있는 Maven

BOM이 없는 Maven

BOM이 없는 Gradle

Go

Model Garden을 사용하여 Gemma 배포

Python

gcloud

REST

배포할 수 있는 모델 나열

curl

PowerShell

모델 배포

기본 구성으로 모델을 배포합니다.

curl

PowerShell

Hugging Face 모델 배포

curl

PowerShell

맞춤설정으로 모델 배포

curl

PowerShell

콘솔

Terraform

모델 배포

구성 적용

삭제

PredictionServiceClient로 Gemma 1B 추론

코드 파라미터

콘솔

gcloud

샘플 코드

Python

Node.js

Java

Go

삭제

프로젝트 삭제

개별 리소스 삭제

모델 배포 취소 및 엔드포인트 삭제

콘솔

gcloud

모델 삭제

콘솔

gcloud

다음 단계

Model Garden 및 Vertex AI GPU 지원 엔드포인트를 사용하여 Gemma 배포 및 추론