Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

우선순위 사용한 만큼만 지불

우선순위 사용한 만큼만 지불 (우선순위 PayGo)은 프로비저닝된 처리량의 선불 약정 없이 표준 PayGo보다 일관된 성능을 제공하는 소비 옵션입니다.

우선순위 PayGo를 사용하면 표준 PayGo보다 높은 요율로 토큰 사용량당 요금이 청구됩니다. 가격에 대한 자세한 내용은 Gemini Enterprise Agent Platform 가격 책정 페이지를 참고하세요.

우선순위 PayGo를 사용해야 하는 경우

Priority PayGo는 트래픽 패턴이 변동하거나 예측할 수 없는 비즈니스에 중요한 워크로드에 적합합니다. 다음은 사용 사례의 예입니다.

고객 대면 가상 어시스턴트
에이전트형 워크플로 및 교차 에이전트 상호작용
연구 시뮬레이션

지원되는 모델 및 위치

다음 모델은 global 엔드포인트에서만 우선순위 PayGo를 지원합니다. 우선순위 PayGo는 리전 또는 멀티 리전 엔드포인트를 지원하지 않습니다.

우선 PayGo 사용

우선순위 종량제를 사용하여 Gemini Enterprise Agent Platform의 Gemini API에 요청을 전송하려면 요청에 X-Vertex-AI-LLM-Shared-Request-Type 헤더를 포함해야 합니다. 우선순위 PayGo는 다음 두 가지 방법으로 사용할 수 있습니다.

프로비저닝된 처리량 할당량 (사용 가능한 경우)을 사용하고 우선순위 PayGo로 오버플로합니다.
우선 PayGo만 사용합니다.

프로비저닝된 처리량을 기본값으로 사용하는 동안 우선순위 PayGo 사용

우선순위 종량제를 사용하기 전에 사용 가능한 프로비저닝된 처리량 할당량을 활용하려면 다음 샘플에 표시된 대로 요청에 X-Vertex-AI-LLM-Shared-Request-Type: priority 헤더를 포함하세요.

Python

설치

pip install --upgrade google-genai

자세한 내용은 SDK 참고 문서를 참고하세요.

Vertex AI에서 Gen AI SDK를 사용하도록 환경 변수를 설정합니다.

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_ENTERPRISE=True

우선순위 PayGo를 사용하도록 생성형 AI 클라이언트를 초기화합니다. 이 단계를 수행한 후에는 동일한 클라이언트에서 우선순위 PayGo를 사용하여 Gemini API와 상호작용하기 위해 코드를 추가로 조정할 필요가 없습니다.

from google import genai
from google.genai.types import HttpOptions
client = genai.Client(
  vertexai=True, project='your_project_id', location='global',
  http_options=HttpOptions(
    api_version="v1",
      headers={
        "X-Vertex-AI-LLM-Shared-Request-Type": "priority"
      },
  )
)

REST

환경을 설정한 후 REST를 사용하여 텍스트 프롬프트를 테스트할 수 있습니다. 다음 샘플은 요청을 게시자 모델 엔드포인트에 전송합니다.

요청 데이터를 사용하기 전에 다음을 바꿉니다.

PROJECT_ID: [프로젝트 ID](/resource-manager/docs/creating-managing-projects#identifiers)입니다. .
MODEL_ID: 우선순위 PayGo를 초기화할 모델의 모델 ID입니다. 우선순위 PayGo를 지원하는 모델 목록은 모델 버전을 참고하세요.
PROMPT_TEXT: 프롬프트에 포함할 텍스트 안내입니다. JSON.

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json; charset=utf-8" \
  -H "X-Vertex-AI-LLM-Shared-Request-Type: priority" \
  "https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/MODEL_ID:generateContent" -d \
  $'{
      "contents": {
        "role": "model",
        "parts": { "text": "PROMPT_TEXT" }
    }
  }'

다음과 비슷한 JSON 응답이 수신됩니다.

{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "text": "Response to sample request."
          }
        ]
      },
      "finishReason": "STOP"
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 3,
    "candidatesTokenCount": 900,
    "totalTokenCount": 1957,
    "trafficType": "ON_DEMAND_PRIORITY",
    "thoughtsTokenCount": 1054
  }
}

응답이 완전히 생성된 후 반환되도록 요청하려면 generateContent 메서드를 사용합니다. 시청자가 지연 시간에 대해 갖는 느낌을 줄이려면 streamGenerateContent 메서드를 사용하여 생성되는 응답을 스트리밍합니다.
멀티모달 모델 ID는 메서드 앞의 URL 끝 부분에 있습니다(예: gemini-2.0-flash). 이 샘플은 다른 모델도 지원할 수 있습니다.

우선순위 PayGo만 사용

우선순위 PayGo만 사용하려면 다음 샘플과 같이 요청에 X-Vertex-AI-LLM-Request-Type: shared 및 X-Vertex-AI-LLM-Shared-Request-Type: priority 헤더를 포함하세요.

Python

설치

pip install --upgrade google-genai

자세한 내용은 SDK 참고 문서를 참고하세요.

Vertex AI에서 Gen AI SDK를 사용하도록 환경 변수를 설정합니다.

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_ENTERPRISE=True

from google import genai
from google.genai.types import HttpOptions
client = genai.Client(
  vertexai=True, project='your_project_id', location='global',
  http_options=HttpOptions(
    api_version="v1",
      headers={
        "X-Vertex-AI-LLM-Request-Type": "shared",
        "X-Vertex-AI-LLM-Shared-Request-Type": "priority"
      },
  )
)

REST

요청 데이터를 사용하기 전에 다음을 바꿉니다.

PROJECT_ID: [프로젝트 ID](/resource-manager/docs/creating-managing-projects#identifiers)입니다. .
MODEL_ID: 우선순위 PayGo를 초기화할 모델의 모델 ID입니다. 우선순위 PayGo를 지원하는 모델 목록은 모델 버전을 참고하세요.
PROMPT_TEXT: 프롬프트에 포함할 텍스트 안내입니다. JSON.

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json; charset=utf-8" \
  -H "X-Vertex-AI-LLM-Request-Type: shared" \
  -H "X-Vertex-AI-LLM-Shared-Request-Type: priority" \
  "https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/MODEL_ID:generateContent" -d \
  $'{
      "contents": {
        "role": "model",
        "parts": { "text": "PROMPT_TEXT" }
    }
  }'

다음과 비슷한 JSON 응답이 수신됩니다.

{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "text": "Response to sample request."
          }
        ]
      },
      "finishReason": "STOP"
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 3,
    "candidatesTokenCount": 900,
    "totalTokenCount": 1957,
    "trafficType": "ON_DEMAND_PRIORITY",
    "thoughtsTokenCount": 1054
  }
}

응답이 완전히 생성된 후 반환되도록 요청하려면 generateContent 메서드를 사용합니다. 시청자가 지연 시간에 대해 갖는 느낌을 줄이려면 streamGenerateContent 메서드를 사용하여 생성되는 응답을 스트리밍합니다.
멀티모달 모델 ID는 메서드 앞의 URL 끝 부분에 있습니다(예: gemini-2.0-flash). 이 샘플은 다른 모델도 지원할 수 있습니다.

우선 PayGo 사용량 확인

다음 예와 같이 응답의 트래픽 유형에서 요청이 우선순위 PayGo를 사용했는지 확인할 수 있습니다.

Python

응답의 traffic_type 필드에서 요청에 Priority PayGo가 사용되었는지 확인할 수 있습니다. 요청이 우선순위 PayGo를 사용하여 처리된 경우 traffic_type 필드는 ON_DEMAND_PRIORITY로 설정됩니다.

sdk_http_response=HttpResponse(
  headers=<dict len=9>
) candidates=[Candidate(
  avg_logprobs=-0.539712212302468,
  content=Content(
    parts=[
      Part(
        text="""Response to sample request.
        """
      ),
    ],
    role='model'
  ),
  finish_reason=<FinishReason.STOP: 'STOP'>
)] create_time=datetime.datetime(2025, 12, 3, 20, 32, 55, 916498, tzinfo=TzInfo(0)) model_version='gemini-2.5-flash' prompt_feedback=None response_id='response_id' usage_metadata=GenerateContentResponseUsageMetadata(
  candidates_token_count=1408,
  candidates_tokens_details=[
    ModalityTokenCount(
      modality=<MediaModality.TEXT: 'TEXT'>,
      token_count=1408
    ),
  ],
  prompt_token_count=5,
  prompt_tokens_details=[
    ModalityTokenCount(
      modality=<MediaModality.TEXT: 'TEXT'>,
      token_count=5
    ),
  ],
  thoughts_token_count=1356,
  total_token_count=2769,
  traffic_type=<TrafficType.ON_DEMAND_PRIORITY: 'ON_DEMAND_PRIORITY'>
) automatic_function_calling_history=[] parsed=None

REST

응답의 trafficType 필드에서 요청에 Priority PayGo가 사용되었는지 확인할 수 있습니다. 요청이 우선순위 PayGo를 사용하여 처리된 경우 trafficType 필드는 ON_DEMAND_PRIORITY로 설정됩니다.

{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "text": "Response to sample request."
          }
        ]
      },
      "finishReason": "STOP"
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 3,
    "candidatesTokenCount": 900,
    "totalTokenCount": 1957,
    "trafficType": "ON_DEMAND_PRIORITY",
    "thoughtsTokenCount": 1054
  }
}

램프 제한

우선순위 PayGo는 조직 수준에서 램프 한도를 설정합니다. 램프 제한은 예측 가능하고 일관된 성능을 제공하는 데 도움이 됩니다. 시작 한도는 다음과 같이 모델에 따라 다릅니다.

Gemini Flash 및 Flash-Lite 모델: 4백만 토큰/분
Gemini Pro 모델: 100만 토큰/분

지속적으로 사용한 10분마다 램프 한도가 50% 씩 증가합니다.

요청이 램프 한도를 초과하거나 트래픽 부하가 높아 시스템이 일시적으로 용량을 초과하는 경우 요청이 표준 PayGo로 다운그레이드되고 표준 PayGo 요금으로 청구될 수 있습니다.

다운그레이드를 최소화하려면 한도 내에서 사용량을 점진적으로 확장하세요. 그래도 성능이 더 필요한 경우 프로비저닝된 처리량 할당량을 추가로 구매하는 것이 좋습니다.

응답에서 요청이 다운그레이드되었는지 확인할 수 있습니다. 표준 종량제로 다운그레이드된 요청의 경우 트래픽 유형이 ON_DEMAND로 설정됩니다. 자세한 내용은 우선순위 PayGo 사용량 확인을 참고하세요.

다음 단계

프로비저닝된 처리량에 대한 자세한 내용은 프로비저닝된 처리량 참고하기
Agent Platform의 할당량 및 한도에 대한 자세한 내용은 Gemini Enterprise Agent Platform 할당량 및 한도를 참고하세요.
Google Cloud 할당량 및 시스템 한도에 대해 자세히 알아보려면 Cloud 할당량 문서를 참조하세요.

우선순위 사용한 만큼만 지불 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

우선순위 PayGo를 사용해야 하는 경우

지원되는 모델 및 위치

우선 PayGo 사용

프로비저닝된 처리량을 기본값으로 사용하는 동안 우선순위 PayGo 사용

Python

설치

REST

우선순위 PayGo만 사용

Python

설치

REST

우선 PayGo 사용량 확인

Python

REST

램프 제한

다음 단계

우선순위 사용한 만큼만 지불