커스텀 vLLM 컨테이너로 개방형 모델 배포

커스텀 vLLM 컨테이너를 사용한 Llama 3.2 3B 배포 예시를 보려면 원하는 환경에서 다음 노트북을 실행합니다.

'vLLM을 사용하여 CPU에 Llama 3.2 3B 배포':
Colab에서 열기 | Colab Enterprise에서 열기 | Vertex AI Workbench에서 열기 | GitHub에서 보기
'vLLM을 사용하여 GPU에 Llama 3.2 3B 배포':
Colab에서 열기 | Colab Enterprise에서 열기 | Vertex AI Workbench에서 열기 | GitHub에서 보기
'GCS 가중치를 사용하여 vLLM으로 TPU에 Llama 3.2 3B 배포':
Colab에서 열기 | Colab Enterprise에서 열기 | Vertex AI Workbench에서 열기 | GitHub에서 보기
'vLLM을 사용하여 TPU에 Llama 3.2 3B 배포':
Colab에서 열기 | Colab Enterprise에서 열기 | Vertex AI Workbench에서 열기 | GitHub에서 보기

다양한 Vertex AI 모델 서빙 옵션이 많은 사용 사례에 충분히 많지만 Vertex AI에서 모델을 서빙하기 위해 자체 컨테이너 이미지를 사용해야 할 수 있습니다. 이 문서에서는 vLLM 커스텀 컨테이너 이미지를 사용하여 CPU, GPU 또는 TPU에서 Vertex AI의 모델을 서빙하는 방법을 설명합니다. vLLM 지원 모델에 대한 자세한 내용은 vLLM 문서를 참조하세요.

vLLM API 서버는 OpenAI API 프로토콜을 구현하지만 Vertex AI 요청 및 응답 요구사항은 지원하지 않습니다. 따라서, 예측 엔드포인트를 사용하여 Vertex AI에 배포된 모델로부터 추론을 가져오려면 Vertex AI 원시 추론 요청을 사용해야 합니다. Vertex AI Python SDK의 원시 예측 메서드에 관한 자세한 내용은 Python SDK 문서를 참조하세요.

Hugging Face와 Cloud Storage 모두에서 모델을 가져올 수 있습니다. 이 접근 방식은 유연성을 제공하므로 커뮤니티 중심 모델 허브(Hugging Face)와 내부 모델 관리 또는 파인 튜닝된 버전을 위한 Cloud Storage의 최적화된 데이터 전송 및 보안 기능을 활용할 수 있습니다.

vLLM은 Hugging Face 액세스 토큰 제공 시 Hugging Face에서 모델을 다운로드합니다. 그렇지 않으면 vLLM은 모델이 로컬 디스크에서 사용 가능하다고 가정합니다. 커스텀 컨테이너 이미지를 사용하면 Vertex AI가 Hugging Face 외에도Google Cloud 에서 모델을 다운로드할 수 있습니다.

시작하기 전에

Google Cloud 프로젝트에서 Vertex AI 및 Artifact Registry API를 사용 설정합니다.

gcloud services enable aiplatform.googleapis.com \
    artifactregistry.googleapis.com

프로젝트 ID로 Google Cloud CLI를 구성하고 Vertex AI SDK를 초기화합니다.

PROJECT_ID = "PROJECT_ID"
LOCATION = "LOCATION"
import vertexai
vertexai.init(project=PROJECT_ID, location=LOCATION)

gcloud config set project {PROJECT_ID}

Artifact Registry에서 Docker 저장소 만들기

gcloud artifacts repositories create DOCKER_REPOSITORY \
    --repository-format=docker \
    --location=LOCATION \
    --description="Vertex AI Docker repository"

선택사항: Hugging Face에서 모델을 다운로드하는 경우 Hugging Face 토큰을 획득합니다.
1. Hugging Face 계정이 없으면 이 계정을 만듭니다.
2. Llama 3.2와 같이 비공개 모델의 경우 계속하기 전에 Hugging Face에서 액세스를 요청하고 승인을 받아야 합니다.
3. 액세스 토큰 생성: 내 프로필 > 설정 > 액세스 토큰으로 이동합니다.
4. 새 토큰을 선택합니다.
5. 이름과 최소한 읽기 권한이 있는 역할을 지정합니다.
6. 토큰 생성을 선택합니다.
7. 배포 단계를 위해 이 토큰을 저장합니다.

컨테이너 빌드 파일 준비

다음 Dockerfile은 GPU, TPU, CPU용 vLLM 커스텀 컨테이너 이미지를 빌드합니다. 이 커스텀 컨테이너는 Hugging Face 또는 Cloud Storage에서 모델을 다운로드합니다.

ARG BASE_IMAGE
FROM ${BASE_IMAGE}

ENV DEBIAN_FRONTEND=noninteractive
# Install gcloud SDK
RUN apt-get update && \
    apt-get install -y apt-utils git apt-transport-https gnupg ca-certificates curl \
    && echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list \
    && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg \
    && apt-get update -y && apt-get install google-cloud-cli -y \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace/vllm

# Copy entrypoint.sh to the container
COPY ./entrypoint.sh /workspace/vllm/vertexai/entrypoint.sh
RUN chmod +x /workspace/vllm/vertexai/entrypoint.sh

ENTRYPOINT ["/workspace/vllm/vertexai/entrypoint.sh"]

Cloud Build를 사용하여 커스텀 컨테이너 이미지를 빌드합니다. 다음 cloudbuild.yaml 구성 파일은 동일한 Dockerfile을 사용하여 여러 플랫폼용 이미지를 빌드하는 방법을 보여줍니다.

steps:
-   name: 'gcr.io/cloud-builders/docker'
  automapSubstitutions: true
  script: |
      #!/usr/bin/env bash
      set -euo pipefail
      device_type_param=${_DEVICE_TYPE}
      device_type=${device_type_param,,}
      base_image=${_BASE_IMAGE}
      image_name="vllm-${_DEVICE_TYPE}"
      if [[ $device_type == "cpu" ]]; then
        echo "Quietly building open source vLLM CPU container image"
        git clone https://github.com/vllm-project/vllm.git
        cd vllm && DOCKER_BUILDKIT=1 docker build -t $base_image -f docker/Dockerfile.cpu . -q
        cd ..
      fi
      echo "Quietly building container image for: $device_type"
      docker build -t $LOCATION-docker.pkg.dev/$PROJECT_ID/${_REPOSITORY}/$image_name --build-arg BASE_IMAGE=$base_image . -q
      docker push $LOCATION-docker.pkg.dev/$PROJECT_ID/${_REPOSITORY}/$image_name
substitutions:
    _DEVICE_TYPE: gpu
    _BASE_IMAGE: vllm/vllm-openai
    _REPOSITORY: my-docker-repo

파일은 googlecloudplatform/vertex-ai-samples GitHub 저장소에서 확인할 수 있습니다. 사용하려면 저장소를 클론하세요.

git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git

컨테이너 이미지 빌드 및 푸시

cloudbuild.yaml 파일을 제출하여 Cloud Build를 통해 커스텀 컨테이너 이미지를 빌드합니다. 대체를 사용하여 대상 기기 유형(GPU, TPU 또는 CPU)과 해당 기본 이미지를 지정합니다.

GPU

DEVICE_TYPE="gpu"
BASE_IMAGE="vllm/vllm-openai"
cd vertex-ai-samples/notebooks/official/prediction/vertexai_serving_vllm/cloud-build && \
gcloud builds submit \
    --config=cloudbuild.yaml \
    --region=LOCATION \
    --timeout="2h" \
    --machine-type=e2-highcpu-32 \
    --substitutions=_REPOSITORY=DOCKER_REPOSITORY,_DEVICE_TYPE=$DEVICE_TYPE,_BASE_IMAGE=$BASE_IMAGE

TPU

DEVICE_TYPE="tpu"
BASE_IMAGE="vllm/vllm-tpu:nightly"
cd vertex-ai-samples/notebooks/official/prediction/vertexai_serving_vllm/cloud-build && \
gcloud builds submit \
    --config=cloudbuild.yaml \
    --region=LOCATION \
    --timeout="2h" \
    --machine-type=e2-highcpu-32 \
    --substitutions=_REPOSITORY=DOCKER_REPOSITORY,_DEVICE_TYPE=$DEVICE_TYPE,_BASE_IMAGE=$BASE_IMAGE

CPU

DEVICE_TYPE="cpu"
BASE_IMAGE="vllm-cpu-base"
cd vertex-ai-samples/notebooks/official/prediction/vertexai_serving_vllm/cloud-build && \
gcloud builds submit \
    --config=cloudbuild.yaml \
    --region=LOCATION \
    --timeout="2h" \
    --machine-type=e2-highcpu-32 \
    --substitutions=_REPOSITORY=DOCKER_REPOSITORY,_DEVICE_TYPE=$DEVICE_TYPE,_BASE_IMAGE=$BASE_IMAGE

빌드가 완료되면 Artifact Registry로 인증하도록 Docker를 구성합니다.

gcloud auth configure-docker LOCATION-docker.pkg.dev --quiet

모델을 Model Registry에 업로드하고 배포

다음 단계를 완료하여 모델을 Vertex AI Model Registry에 업로드하고, 엔드포인트를 만들고, 모델을 배포합니다. 이 예시에서는 Llama 3.2 3B를 사용하지만 다른 모델에도 적용할 수 있습니다.

모델 및 배포 변수를 정의합니다. DOCKER_URI 변수를 이전 단계에서 빌드한 이미지로 설정합니다(예: GPU의 경우).

DOCKER_URI = f"LOCATION-docker.pkg.dev/PROJECT_ID/DOCKER_REPOSITORY/vllm-gpu"

Hugging Face 토큰 및 모델 속성의 변수를 정의합니다. 예를 들어 GPU 배포의 경우 다음을 수행합니다.

hf_token = "your-hugging-face-auth-token"
model_name = "gpu-llama3_2_3B-serve-vllm"
model_id = "meta-llama/Llama-3.2-3B"
machine_type = "g2-standard-8"
accelerator_type = "NVIDIA_L4"
accelerator_count = 1

모델을 Model Registry에 업로드합니다. upload_model 함수는 vLLM 인수와 환경 변수가 다르기 때문에 기기 유형에 따라 약간 다릅니다.

from google.cloud import aiplatform

def upload_model_gpu(model_name, model_id, hf_token, accelerator_count, docker_uri):
    vllm_args = [
        "python3", "-m", "vllm.entrypoints.openai.api_server",
        "--host=0.0.0.0", "--port=8080", f"--model={model_id}",
        "--max-model-len=2048", "--gpu-memory-utilization=0.9",
        "--enable-prefix-caching", f"--tensor-parallel-size={accelerator_count}",
    ]
    env_vars = {
        "HF_TOKEN": hf_token,
        "LD_LIBRARY_PATH": "$LD_LIBRARY_PATH:/usr/local/nvidia/lib64",
    }
    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=docker_uri,
        serving_container_args=vllm_args,
        serving_container_ports=[8080],
        serving_container_predict_route="/v1/completions",
        serving_container_health_route="/health",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=1800,
    )
    return model

def upload_model_tpu(model_name, model_id, hf_token, tpu_count, docker_uri):
    vllm_args = [
        "python3", "-m", "vllm.entrypoints.openai.api_server",
        "--host=0.0.0.0", "--port=8080", f"--model={model_id}",
        "--max-model-len=2048", "--enable-prefix-caching",
        f"--tensor-parallel-size={tpu_count}",
    ]
    env_vars = {"HF_TOKEN": hf_token}
    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=docker_uri,
        serving_container_args=vllm_args,
        serving_container_ports=[8080],
        serving_container_predict_route="/v1/completions",
        serving_container_health_route="/health",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=1800,
    )
    return model

def upload_model_cpu(model_name, model_id, hf_token, docker_uri):
    vllm_args = [
        "python3", "-m", "vllm.entrypoints.openai.api_server",
        "--host=0.0.0.0", "--port=8080", f"--model={model_id}",
        "--max-model-len=2048",
    ]
    env_vars = {"HF_TOKEN": hf_token}
    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=docker_uri,
        serving_container_args=vllm_args,
        serving_container_ports=[8080],
        serving_container_predict_route="/v1/completions",
        serving_container_health_route="/health",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=1800,
    )
    return model

# Example for GPU:
vertexai_model = upload_model_gpu(model_name, model_id, hf_token, accelerator_count, DOCKER_URI)

엔드포인트를 만듭니다.

endpoint = aiplatform.Endpoint.create(display_name=f"model_name-endpoint")

엔드포인트에 모델 배포 모델 배포에는 20~30분 정도 걸릴 수 있습니다.

# Example for GPU:
vertexai_model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=model_name,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    traffic_percentage=100,
    deploy_request_timeout=1800,
    min_replica_count=1,
    max_replica_count=4,
    autoscaling_target_accelerator_duty_cycle=60,
)

TPU의 경우 accelerator_type 및 accelerator_count 파라미터를 생략하고 autoscaling_target_request_count_per_minute=60을 사용합니다. CPU의 경우 accelerator_type 및 accelerator_count 파라미터를 생략하고 autoscaling_target_cpu_utilization=60을 사용합니다.

Cloud Storage에서 모델 로드

커스텀 컨테이너는 Hugging Face에서 다운로드하는 대신 Cloud Storage 위치에서 모델을 다운로드합니다. Cloud Storage를 사용하는 경우:

upload_model 함수의 model_id 파라미터를 Cloud Storage URI(예: gs://<var>my-bucket</var>/<var>my-models</var>/<var>llama_3_2_3B</var>)로 설정합니다.
upload_model을 호출할 때 env_vars에서 HF_TOKEN 변수를 생략합니다.
model.deploy를 호출할 때는 Cloud Storage 버킷에서 읽기 권한이 있는 service_account를 지정합니다.

Cloud Storage 액세스를 위한 IAM 서비스 계정 만들기

모델이 Cloud Storage에 있는 경우 Vertex 예측 엔드포인트가 모델 아티팩트에 액세스하는 데 사용할 수 있는 서비스 계정을 만듭니다.

SERVICE_ACCOUNT_NAME = "vertexai-endpoint-sa"
SERVICE_ACCOUNT_EMAIL = f"SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com"

gcloud iam service-accounts create SERVICE_ACCOUNT_NAME \
    --display-name="Vertex AI Endpoint Service Account"

# Grant storage read permission
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:SERVICE_ACCOUNT_EMAIL" \
    --role="roles/storage.objectViewer"

배포 시 다음 서비스 계정 이메일을 deploy 메서드에 전달합니다. service_account=<var>SERVICE_ACCOUNT_EMAIL</var>

엔드포인트를 사용하여 예측 가져오기

엔드포인트에 모델을 성공적으로 배포한 후 raw_predict를 사용하여 모델 응답을 확인합니다.

import json

PROMPT = "Distance of moon from earth is "
request_body = json.dumps(
    {
        "prompt": PROMPT,
        "temperature": 0.0,
    },
)

raw_response = endpoint.raw_predict(
    body=request_body, headers={"Content-Type": "application/json"}
)
assert raw_response.status_code == 200
result = json.loads(raw_response.text)

for choice in result["choices"]:
    print(choice)

출력 예시:

{
  "index": 0,
  "text": "384,400 km. The moon is 1/4 of the earth's",
  "logprobs": null,
  "finish_reason": "length",
  "stop_reason": null,
  "prompt_logprobs": null
}

커스텀 vLLM 컨테이너로 개방형 모델 배포 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.