本頁面由 Cloud Translation API 翻譯而成。

使用自訂 vLLM 容器部署開放原始碼模型

如要查看使用自訂 vLLM 容器部署 Llama 3.2 3B 的範例，請在所選環境中執行下列筆記本：

「Deploy Llama 3.2 3B on CPU using vLLM」(使用 vLLM 在 CPU 上部署 Llama 3.2 3B)：
在 Colab 中開啟 | 在 Colab Enterprise 中開啟 | 在 Vertex AI Workbench 中開啟 | 在 GitHub 上查看
「Deploy Llama 3.2 3B on GPU using vLLM」(使用 vLLM 在 GPU 上部署 Llama 3.2 3B)：
在 Colab 中開啟 | 在 Colab Enterprise 中開啟 | 在 Vertex AI Workbench 中開啟 | 在 GitHub 上查看
「Deploy Llama 3.2 3B on TPU using vLLM with GCS weights」(使用 vLLM 和 GCS 權重在 TPU 上部署 Llama 3.2 3B)：
在 Colab 中開啟 | 在 Colab Enterprise 中開啟 | 在 Vertex AI Workbench 中開啟 | 在 GitHub 上查看
「Deploy Llama 3.2 3B on TPU using vLLM」(使用 vLLM 在 TPU 上部署 Llama 3.2 3B)：
在 Colab 中開啟 | 在 Colab Enterprise 中開啟 | 在 Vertex AI Workbench 中開啟 | 在 GitHub 上查看

雖然各種 Vertex AI 模型服務選項足以應付許多用途，但您可能需要使用自己的容器映像檔，在 Vertex AI 上提供模型。本文說明如何使用 vLLM 自訂容器映像檔，在 Vertex AI 上透過 CPU、GPU 或 TPU 提供模型。如要進一步瞭解 vLLM 支援的機型，請參閱 vLLM 說明文件。

vLLM API 伺服器會實作 OpenAI API 通訊協定，但不會支援 Vertex AI 要求和回應規定。因此，您必須使用 Vertex AI 原始推論要求，從部署至 Vertex AI 的模型取得推論結果，並使用預測端點。如要進一步瞭解 Vertex AI Python SDK 中的 Raw Prediction 方法，請參閱 Python SDK 說明文件。

您可以從 Hugging Face 和 Cloud Storage 取得模型。這種做法相當靈活，可讓您運用社群主導的模型中心 (Hugging Face)，以及 Cloud Storage 的最佳化資料移轉和安全防護功能，管理內部模型或微調版本。

如果提供 Hugging Face 存取權杖，vLLM 會從 Hugging Face 下載模型。否則，vLLM 會假設模型位於本機磁碟。自訂容器映像檔可讓 Vertex AI 從Google Cloud (而非 Hugging Face) 下載模型。

事前準備

在 Google Cloud 專案中，啟用 Vertex AI 和 Artifact Registry API。

gcloud services enable aiplatform.googleapis.com \
    artifactregistry.googleapis.com

使用專案 ID 設定 Google Cloud CLI，並初始化 Vertex AI SDK。

PROJECT_ID = "PROJECT_ID"
LOCATION = "LOCATION"
import vertexai
vertexai.init(project=PROJECT_ID, location=LOCATION)

gcloud config set project {PROJECT_ID}

在 Artifact Registry 中建立 Docker 存放區。

gcloud artifacts repositories create DOCKER_REPOSITORY \
    --repository-format=docker \
    --location=LOCATION \
    --description="Vertex AI Docker repository"

選用：如要從 Hugging Face 下載模型，請取得 Hugging Face 權杖。
1. 如果您沒有 Hugging Face 帳戶，請建立一個。
2. 如要使用 Llama 3.2 等封閉式模型，請先在 Hugging Face 申請並取得存取權，再繼續操作。
3. 產生存取權杖：依序前往「Your Profile」>「Settings」>「Access Tokens」。
4. 選取「New Token」。
5. 指定名稱和至少「讀取」角色。
6. 選取「產生憑證」。
7. 請儲存這個權杖，以供部署步驟使用。

準備容器建構檔案

下列 Dockerfile 會建構適用於 GPU、TPU 和 CPU 的 vLLM 自訂容器映像檔。這個自訂容器會從 Hugging Face 或 Cloud Storage 下載模型。

ARG BASE_IMAGE
FROM ${BASE_IMAGE}

ENV DEBIAN_FRONTEND=noninteractive
# Install gcloud SDK
RUN apt-get update && \
    apt-get install -y apt-utils git apt-transport-https gnupg ca-certificates curl \
    && echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list \
    && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg \
    && apt-get update -y && apt-get install google-cloud-cli -y \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace/vllm

# Copy entrypoint.sh to the container
COPY ./entrypoint.sh /workspace/vllm/vertexai/entrypoint.sh
RUN chmod +x /workspace/vllm/vertexai/entrypoint.sh

ENTRYPOINT ["/workspace/vllm/vertexai/entrypoint.sh"]

使用 Cloud Build 建構自訂容器映像檔。下列 cloudbuild.yaml 設定檔顯示如何使用相同的 Dockerfile，為多個平台建構映像檔。

steps:
-   name: 'gcr.io/cloud-builders/docker'
  automapSubstitutions: true
  script: |
      #!/usr/bin/env bash
      set -euo pipefail
      device_type_param=${_DEVICE_TYPE}
      device_type=${device_type_param,,}
      base_image=${_BASE_IMAGE}
      image_name="vllm-${_DEVICE_TYPE}"
      if [[ $device_type == "cpu" ]]; then
        echo "Quietly building open source vLLM CPU container image"
        git clone https://github.com/vllm-project/vllm.git
        cd vllm && DOCKER_BUILDKIT=1 docker build -t $base_image -f docker/Dockerfile.cpu . -q
        cd ..
      fi
      echo "Quietly building container image for: $device_type"
      docker build -t $LOCATION-docker.pkg.dev/$PROJECT_ID/${_REPOSITORY}/$image_name --build-arg BASE_IMAGE=$base_image . -q
      docker push $LOCATION-docker.pkg.dev/$PROJECT_ID/${_REPOSITORY}/$image_name
substitutions:
    _DEVICE_TYPE: gpu
    _BASE_IMAGE: vllm/vllm-openai
    _REPOSITORY: my-docker-repo

這些檔案位於 googlecloudplatform/vertex-ai-samples GitHub 存放區。複製存放區以使用這些範例：

git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git

建構及推送容器映像檔

使用 Cloud Build 提交 cloudbuild.yaml 檔案，建構自訂容器映像檔。使用替代項目指定目標裝置類型 (可以是 GPU、TPU 或 CPU) 和對應的基礎映像檔。

GPU

DEVICE_TYPE="gpu"
BASE_IMAGE="vllm/vllm-openai"
cd vertex-ai-samples/notebooks/official/prediction/vertexai_serving_vllm/cloud-build && \
gcloud builds submit \
    --config=cloudbuild.yaml \
    --region=LOCATION \
    --timeout="2h" \
    --machine-type=e2-highcpu-32 \
    --substitutions=_REPOSITORY=DOCKER_REPOSITORY,_DEVICE_TYPE=$DEVICE_TYPE,_BASE_IMAGE=$BASE_IMAGE

TPU

DEVICE_TYPE="tpu"
BASE_IMAGE="vllm/vllm-tpu:nightly"
cd vertex-ai-samples/notebooks/official/prediction/vertexai_serving_vllm/cloud-build && \
gcloud builds submit \
    --config=cloudbuild.yaml \
    --region=LOCATION \
    --timeout="2h" \
    --machine-type=e2-highcpu-32 \
    --substitutions=_REPOSITORY=DOCKER_REPOSITORY,_DEVICE_TYPE=$DEVICE_TYPE,_BASE_IMAGE=$BASE_IMAGE

CPU

DEVICE_TYPE="cpu"
BASE_IMAGE="vllm-cpu-base"
cd vertex-ai-samples/notebooks/official/prediction/vertexai_serving_vllm/cloud-build && \
gcloud builds submit \
    --config=cloudbuild.yaml \
    --region=LOCATION \
    --timeout="2h" \
    --machine-type=e2-highcpu-32 \
    --substitutions=_REPOSITORY=DOCKER_REPOSITORY,_DEVICE_TYPE=$DEVICE_TYPE,_BASE_IMAGE=$BASE_IMAGE

建構完成後，請設定 Docker 以透過 Artifact Registry 進行驗證：

gcloud auth configure-docker LOCATION-docker.pkg.dev --quiet

將模型上傳至 Model Registry 並部署

完成下列步驟，將模型上傳至 Vertex AI Model Registry、建立端點，並部署模型。這個範例使用 Llama 3.2 3B，但您可以調整為其他模型。

定義模型和部署變數。將 DOCKER_URI 變數設為上一個步驟中建構的映像檔 (例如 GPU)：

DOCKER_URI = f"LOCATION-docker.pkg.dev/PROJECT_ID/DOCKER_REPOSITORY/vllm-gpu"

定義 Hugging Face 權杖和模型屬性的變數。舉例來說，如果是 GPU 部署作業：

hf_token = "your-hugging-face-auth-token"
model_name = "gpu-llama3_2_3B-serve-vllm"
model_id = "meta-llama/Llama-3.2-3B"
machine_type = "g2-standard-8"
accelerator_type = "NVIDIA_L4"
accelerator_count = 1

將模型上傳至 Model Registry。由於 vLLM 引數和環境變數不同，upload_model 函式會因裝置類型而略有差異。

from google.cloud import aiplatform

def upload_model_gpu(model_name, model_id, hf_token, accelerator_count, docker_uri):
    vllm_args = [
        "python3", "-m", "vllm.entrypoints.openai.api_server",
        "--host=0.0.0.0", "--port=8080", f"--model={model_id}",
        "--max-model-len=2048", "--gpu-memory-utilization=0.9",
        "--enable-prefix-caching", f"--tensor-parallel-size={accelerator_count}",
    ]
    env_vars = {
        "HF_TOKEN": hf_token,
        "LD_LIBRARY_PATH": "$LD_LIBRARY_PATH:/usr/local/nvidia/lib64",
    }
    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=docker_uri,
        serving_container_args=vllm_args,
        serving_container_ports=[8080],
        serving_container_predict_route="/v1/completions",
        serving_container_health_route="/health",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=1800,
    )
    return model

def upload_model_tpu(model_name, model_id, hf_token, tpu_count, docker_uri):
    vllm_args = [
        "python3", "-m", "vllm.entrypoints.openai.api_server",
        "--host=0.0.0.0", "--port=8080", f"--model={model_id}",
        "--max-model-len=2048", "--enable-prefix-caching",
        f"--tensor-parallel-size={tpu_count}",
    ]
    env_vars = {"HF_TOKEN": hf_token}
    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=docker_uri,
        serving_container_args=vllm_args,
        serving_container_ports=[8080],
        serving_container_predict_route="/v1/completions",
        serving_container_health_route="/health",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=1800,
    )
    return model

def upload_model_cpu(model_name, model_id, hf_token, docker_uri):
    vllm_args = [
        "python3", "-m", "vllm.entrypoints.openai.api_server",
        "--host=0.0.0.0", "--port=8080", f"--model={model_id}",
        "--max-model-len=2048",
    ]
    env_vars = {"HF_TOKEN": hf_token}
    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=docker_uri,
        serving_container_args=vllm_args,
        serving_container_ports=[8080],
        serving_container_predict_route="/v1/completions",
        serving_container_health_route="/health",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=1800,
    )
    return model

# Example for GPU:
vertexai_model = upload_model_gpu(model_name, model_id, hf_token, accelerator_count, DOCKER_URI)

建立端點。

endpoint = aiplatform.Endpoint.create(display_name=f"model_name-endpoint")

將模型部署至端點。模型部署作業可能需要 20 到 30 分鐘。

# Example for GPU:
vertexai_model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=model_name,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    traffic_percentage=100,
    deploy_request_timeout=1800,
    min_replica_count=1,
    max_replica_count=4,
    autoscaling_target_accelerator_duty_cycle=60,
)

如果是 TPU，請省略 accelerator_type 和 accelerator_count 參數，並使用 autoscaling_target_request_count_per_minute=60。如果是 CPU，請省略 accelerator_type 和 accelerator_count 參數，並使用 autoscaling_target_cpu_utilization=60。

從 Cloud Storage 載入模型

自訂容器會從 Cloud Storage 位置下載模型，而不是從 Hugging Face 下載。使用 Cloud Storage 時：

將 upload_model 函式中的 model_id 參數設為 Cloud Storage URI，例如 gs://<var>my-bucket</var>/<var>my-models</var>/<var>llama_3_2_3B</var>。
呼叫 upload_model 時，請從 env_vars 省略 HF_TOKEN 變數。
呼叫 model.deploy 時，請指定有權從 Cloud Storage bucket 讀取的 service_account。

建立 IAM 服務帳戶，以存取 Cloud Storage

如果模型位於 Cloud Storage，請建立服務帳戶，供 Vertex Prediction 端點存取模型構件。

SERVICE_ACCOUNT_NAME = "vertexai-endpoint-sa"
SERVICE_ACCOUNT_EMAIL = f"SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com"

gcloud iam service-accounts create SERVICE_ACCOUNT_NAME \
    --display-name="Vertex AI Endpoint Service Account"

# Grant storage read permission
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:SERVICE_ACCOUNT_EMAIL" \
    --role="roles/storage.objectViewer"

部署時，請將服務帳戶電子郵件傳遞至 deploy 方法： service_account=<var>SERVICE_ACCOUNT_EMAIL</var>。

使用端點取得預測結果

將模型成功部署至端點後，請使用 raw_predict 驗證模型回應。

import json

PROMPT = "Distance of moon from earth is "
request_body = json.dumps(
    {
        "prompt": PROMPT,
        "temperature": 0.0,
    },
)

raw_response = endpoint.raw_predict(
    body=request_body, headers={"Content-Type": "application/json"}
)
assert raw_response.status_code == 200
result = json.loads(raw_response.text)

for choice in result["choices"]:
    print(choice)

輸出內容範例：

{
  "index": 0,
  "text": "384,400 km. The moon is 1/4 of the earth's",
  "logprobs": null,
  "finish_reason": "length",
  "stop_reason": null,
  "prompt_logprobs": null
}

使用自訂 vLLM 容器部署開放原始碼模型 透過集合功能整理內容 你可以依據偏好儲存及分類內容。