gcloud container ai profiles manifests create

NAME: gcloud container ai profiles manifests create - generate ready-to-deploy Kubernetes manifests with compute, load balancing, and autoscaling capabilities
SYNOPSIS: gcloud container ai profiles manifests create --accelerator-type=ACCELERATOR_TYPE --model=MODEL --model-server=MODEL_SERVER [--model-bucket-uri=MODEL_BUCKET_URI] [--model-server-version=MODEL_SERVER_VERSION] [--namespace=NAMESPACE] [--output=OUTPUT; default="all"] [--output-path=OUTPUT_PATH] [--serving-stack=SERVING_STACK] [--serving-stack-version=SERVING_STACK_VERSION] [--target-itl-milliseconds=TARGET_ITL_MILLISECONDS] [--target-ntpot-milliseconds=TARGET_NTPOT_MILLISECONDS] [--target-ttft-milliseconds=TARGET_TTFT_MILLISECONDS] [--use-case=USE_CASE] [GCLOUD_WIDE_FLAG …]
DESCRIPTION: To get supported model, model servers, and model server versions, run gcloud alpha container ai profiles model-and-server-combinations list. To get supported accelerators with their performance metrics, run gcloud alpha container ai profiles accelerators list.
REQUIRED FLAGS: --accelerator-type=ACCELERATOR_TYPE

The accelerator type.

--model=MODEL

The model.

--model-server=MODEL_SERVER

The model server.
OPTIONAL FLAGS: --model-bucket-uri=MODEL_BUCKET_URI

The Google Cloud Storage bucket URI to load the model from. This URI must point to the directory containing the model's config file (config.json) and model weights. If unspecified, defaults to loading the model from Hugging Face.

--model-server-version=MODEL_SERVER_VERSION

The model server version. If not specified, this defaults to the latest version.

--namespace=NAMESPACE

The namespace to deploy the manifests in. Default namespace is 'default'.

--output=OUTPUT; default="all"

The output to display. Default is all. OUTPUT must be one of: manifest, comments, all.

--output-path=OUTPUT_PATH

The path to save the output to. If not specified, output to the terminal.

--serving-stack=SERVING_STACK

The serving stack to filter manifests by. If not provided, manifests for all serving stacks that support the given model and model server will be considered.

--serving-stack-version=SERVING_STACK_VERSION

The serving stack version to filter manifests by. If not provided, manifests for all versions that support the given model and model server will be considered.

--target-itl-milliseconds=TARGET_ITL_MILLISECONDS

The target inter-token latency (ITL) in milliseconds. If this is set, the manifest will include Horizontal Pod Autoscaler (HPA) resources which automatically adjust the model server replica count in response to changes in model server load to keep p50 ITL below the specified threshold. If the provided target-itl-milliseconds is too low to achieve, the HPA manifest will not be generated.

--target-ntpot-milliseconds=TARGET_NTPOT_MILLISECONDS

The maximum normalized time per output token (NTPOT) in milliseconds. NTPOT is measured as the request_latency / output_tokens. If this is set, the manifests will include Horizontal Pod Autoscaler (HPA) resources which automatically adjust the model server replica count in response to changes in model server load to keep p50 NTPOT below the specified threshold. If the provided target-ntpot-milliseconds is too low to achieve, the HPA manifest will not be generated.

--target-ttft-milliseconds=TARGET_TTFT_MILLISECONDS

If specified, results will only show accelerators that can meet the latency target and will show their throughput performances at the target ttft target to achieve, the HPA manifest will not be generated.

--use-case=USE_CASE

The manifest will be optimized for this use case. Options are: Advanced Customer Support, Code Completion, Text Summarization, Chatbot (ShareGPT), Code Generation, Deep Research. Will default to Chatbot if not specified.
GCLOUD WIDE FLAGS: These flags are available to all commands: --access-token-file, --account, --billing-project, --configuration, --flags-file, --flatten, --format, --help, --impersonate-service-account, --log-http, --project, --quiet, --trace-token, --user-output-enabled, --verbosity.
Run $ gcloud help for details.
NOTES: This variant is also available:
gcloud alpha container ai profiles manifests create

gcloud container ai profiles manifests create Stay organized with collections Save and categorize content based on your preferences.

gcloud container ai profiles manifests create