Google은 AI 기술을 사용하여 콘텐츠를 사용자의 기본 언어로 번역합니다. AI 번역에는 오류가 있을 수 있습니다.

멀티 호스트 GPU 배포를 사용하여 DeepSeek-V3 모델 서빙

개요

Gemini Enterprise Agent Platform은 DeepSeek-V3, DeepSeek-R1, Meta LLama 3.1 405B (비양자화 버전) 같이 단일 GPU 노드의 메모리 용량을 초과하는 모델을 제공하기 위한 멀티 호스트 GPU 배포를 지원합니다.

이 가이드에서는 vLLMvLLM을 사용하여 Gemini Enterprise Agent Platform에서 멀티 호스트 그래픽 처리 장치 (GPU)를 사용하여 DeepSeek-V3 모델을 제공하는 방법을 설명합니다. 다른 모델의 설정도 비슷합니다. 자세한 내용은 텍스트 및 멀티모달 언어 모델을 위한 vLLM 서빙을 참조하세요.

시작하기 전에 다음 사항을 숙지하세요.

가격 계산기 를 사용하면 예상 사용량을 기준으로 예상 비용을 산출할 수 있습니다.

컨테이너

멀티 호스트 배포를 지원하기 위해 이 가이드에서는 Model Garden의 Ray 통합이 포함된 사전 빌드된 vLLM 컨테이너 이미지 를 사용합니다. Ray를 사용하면 여러 GPU 노드에서 모델을 실행하는 데 필요한 분산 처리가 가능합니다.

원하는 경우 자체 vLLM 멀티노드 이미지를 만들 수 있습니다. 이 커스텀 컨테이너 이미지는 Gemini Enterprise Agent Platform과 호환되어야 합니다.

시작하기 전에

모델 배포를 시작하기 전에 이 섹션에 나열된 기본 요건을 완료하세요.

프로젝트 설정 Google Cloud

계정에 로그인합니다. Google Cloud 를 처음 사용하는 경우 계정을 만들고 Google 제품의 실제 성능을 평가해 보세요. Google Cloud신규 고객에게는 워크로드를 실행, 테스트, 배포하는 데 사용할 수 있는 $300의 무료 크레딧이 제공됩니다.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Agent Platform API.

Roles required to enable APIs

To enable APIs, you need the serviceusage.services.enable permission. If you created the project, then you likely already have this permission through the Owner role (roles/owner). Otherwise, you can get this permission through the Service Usage Admin role (roles/serviceusage.serviceUsageAdmin). Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Agent Platform API.

Roles required to enable APIs

Enable the API

콘솔에서 Cloud Shell을 활성화합니다. Google Cloud

Cloud Shell 활성화

콘솔 하단에 Cloud Shell 세션이 시작되고 명령줄 프롬프트가 표시됩니다. Google Cloud Cloud Shell은 Google Cloud CLI가 사전 설치된 셸 환경으로, 현재 프로젝트의 값이 이미 설정되어 있습니다. 세션이 초기화되는 데 몇 초 정도 걸릴 수 있습니다.

GPU 할당량 요청

DeepSeek-V3를 배포하려면 각각 8개의 H100 GPU가 있는 VM 2개가 필요합니다. 총 16개의 H100 GPU가 필요합니다.a3-highgpu-8g 기본값이 16보다 작으므로 H100 GPU 할당량 상향 조정을 요청해야 할 가능성이 높습니다.

H100 GPU 할당량을 보려면 Google Cloud 콘솔 할당량 및 시스템 한도 페이지로 이동합니다.

할당량 및 시스템 한도로 이동
할당량 조정 요청

모델 업로드

모델을 Model 리소스로 Gemini Enterprise Agent Platform에 업로드하려면 다음과 같이 gcloud ai models upload 명령어를 실행합니다.

gcloud ai models upload \
    --region=LOCATION \
    --project=PROJECT_ID \
    --display-name=MODEL_DISPLAY_NAME \
    --container-image-uri=us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250312_0916_RC01 \
    --container-args='^;^/vllm-workspace/ray_launcher.sh;python;-m;vllm.entrypoints.api_server;--host=0.0.0.0;--port=7080;--model=deepseek-ai/DeepSeek-V3;--tensor-parallel-size=8;--pipeline-parallel-size=2;--gpu-memory-utilization=0.82;--max-model-len=163840;--max-num-seqs=64;--enable-chunked-prefill;--kv-cache-dtype=auto;--trust-remote-code;--disable-log-requests' \
    --container-deployment-timeout-seconds=7200 \
    --container-ports=7080 \
    --container-env-vars=MODEL_ID=deepseek-ai/DeepSeek-V3

다음을 바꿉니다.

LOCATION: Gemini Enterprise Agent Platform을 사용하는 리전
PROJECT_ID: 프로젝트 ID Google Cloud
MODEL_DISPLAY_NAME: 모델에 사용할 표시 이름

전용 온라인 추론 엔드포인트 만들기

채팅 완성 요청을 지원하려면 Model Garden 컨테이너에 전용 엔드포인트가 필요합니다. 전용 엔드포인트는 프리뷰 상태이며 Google Cloud CLI를 지원하지 않으므로 REST API를 사용하여 엔드포인트를 만들어야 합니다.

전용 엔드포인트를 만들려면 다음 명령어를 실행합니다.

PROJECT_ID=PROJECT_ID
REGION=LOCATION
ENDPOINT="${REGION}-aiplatform.googleapis.com"

curl \
  -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://${ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${REGION}/endpoints \
  -d '{
    "displayName": "ENDPOINT_DISPLAY_NAME",
    "dedicatedEndpointEnabled": true
    }'

다음을 바꿉니다.

ENDPOINT_DISPLAY_NAME: 엔드포인트의 표시 이름

모델 배포

gcloud ai endpoints list 명령어를 실행하여 온라인 추론 엔드포인트의 엔드포인트 ID를 가져옵니다.

ENDPOINT_ID=$(gcloud ai endpoints list \
 --project=PROJECT_ID \
 --region=LOCATION \
 --filter=display_name~'ENDPOINT_DISPLAY_NAME' \
 --format="value(name)")

gcloud ai models list 명령어를 실행하여 모델의 모델 ID를 가져옵니다.

MODEL_ID=$(gcloud ai models list \
 --project=PROJECT_ID \
 --region=LOCATION \
 --filter=display_name~'MODEL_DISPLAY_NAME' \
 --format="value(name)")

gcloud ai deploy-model 명령어를 실행하여 모델을 엔드포인트에 배포합니다.
```
gcloud alpha ai endpoints deploy-model $ENDPOINT_ID \
 --project=PROJECT_ID \
 --region=LOCATION \
 --model=$MODEL_ID \
 --display-name="DEPLOYED_MODEL_NAME" \
 --machine-type=a3-highgpu-8g \
 --traffic-split=0=100 \
 --accelerator=type=nvidia-h100-80gb,count=8 \
 --multihost-gpu-node-count=2
```
DEPLOYED_MODEL_NAME을 배포된 모델의 이름으로 바꿉니다. 모델 표시 이름(MODEL_DISPLAY_NAME)과 같을 수 있습니다.

DeepSeek-V3와 같은 대규모 모델을 배포하는 데 기본 배포 제한 시간보다 오래 걸릴 수 있습니다. deploy-model 명령어가 제한 시간을 초과하면 배포 프로세스가 백그라운드에서 계속 실행됩니다.

deploy-model 명령어는 작업 완료 시간을 확인하는 데 사용할 수 있는 작업 ID를 반환합니다. 응답에 "done": true가 포함될 때까지 작업 상태를 폴링할 수 있습니다. 다음 명령어를 사용하여 상태를 폴링합니다.
```
gcloud ai operations describe \
--region=LOCATION \
OPERATION_ID
```
OPERATION_ID를 이전 명령어에서 반환한 작업 ID로 바꿉니다.

배포된 모델에서 온라인 추론 가져오기

이 섹션에서는 DeepSeek-V3 모델이 배포된 전용 공개 엔드포인트에 온라인 추론 요청을 전송하는 방법을 설명합니다.

gcloud projects describe 명령어를 실행하여 프로젝트 번호를 가져옵니다.

PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")

원시 예측 요청을 전송합니다.

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}:rawPredict \
-d '{
   "prompt": "Write a short story about a robot.",
   "stream": false,
   "max_tokens": 50,
   "temperature": 0.7
   }'

채팅 완성 요청을 전송합니다.

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}/chat/completions \
-d '{"stream":false, "messages":[{"role": "user", "content": "Summer travel plan to Paris"}], "max_tokens": 40,"temperature":0.4,"top_k":10,"top_p":0.95, "n":1}'

스트리밍을 사용 설정하려면 "stream" 값을 false에서 true로 변경합니다.

정리

Gemini Enterprise Agent Platform 요금이 추가로 발생하지 않도록 하려면 이 튜토리얼 중에 만든 Google Cloud 리소스를 삭제하세요.

엔드포인트에서 모델을 배포 해제하고 엔드포인트를 삭제하려면 다음 명령어를 실행합니다.

ENDPOINT_ID=$(gcloud ai endpoints list \
   --region=LOCATION \
   --filter=display_name=ENDPOINT_DISPLAY_NAME \
   --format="value(name)")

DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \
   --region=LOCATION \
   --format="value(deployedModels.id)")

gcloud ai endpoints undeploy-model $ENDPOINT_ID \
  --region=LOCATION \
  --deployed-model-id=$DEPLOYED_MODEL_ID

gcloud ai endpoints delete $ENDPOINT_ID \
   --region=LOCATION \
   --quiet

모델을 삭제하려면 다음 명령어를 실행합니다.

MODEL_ID=$(gcloud ai models list \
   --region=LOCATION \
   --filter=display_name=DEPLOYED_MODEL_NAME \
   --format="value(name)")

gcloud ai models delete $MODEL_ID \
   --region=LOCATION \
   --quiet

다음 단계

vLLM을 사용하는 Gemini Enterprise Agent Platform의 멀티 호스트 GPU 배포에 대한 포괄적인 참조 정보는 텍스트 및 멀티모달 언어 모델을 위한 vLLM 서빙을 참조하세요.
자체 vLLM 멀티노드 이미지를 만드는 방법을 알아보세요. 커스텀 컨테이너 이미지는 Gemini Enterprise Agent Platform과 호환되어야 합니다.