PromptTokenLimit policy

This page applies to Apigee, but not to Apigee hybrid.

View Apigee Edge documentation.

Overview

The PromptTokenLimit policy protects Large Language Model (LLM) backends from traffic surges by throttling the number of tokens in user prompts.

The PromptTokenLimit policy is similar to the SpikeArrest policy; however; the SpikeArrest policy limits the number of requests, while the PromptTokenLimit policy limits the number of tokens within those requests. This policy is specifically tailored for LLM applications where the cost and performance are directly related to the number of tokens processed.

This policy is an Extensible policy and use of this policy might have cost or utilization implications, depending on your Apigee license. For information on policy types and usage implications, see Policy types.

The difference between PromptTokenLimit and LLMTokenQuota

The PromptTokenLimit policy is used for operational traffic management to prevent sudden spikes in token usage. In contrast, the LLMTokenQuota policy is used to enforce consumption limits on client apps over longer periods (such as hours, days, or months) to manage costs and enforce business agreements.

`PromptTokenLimit` element

Defines the PromptTokenLimit policy.

Default Value	See Default Policy tab, below
Required?	Optional
Type	Complex object
Parent Element	N/A
Child Elements	`<Identifier>` `<Rate>` (Required) `<UseEffectiveCount>` `<UserPromptSource>` `<IgnoreUnresolvedVariables>`

Syntax

The PromptTokenLimit element uses the following syntax:

<PromptTokenLimit continueOnError="false" enabled="true" name="POLICY_NAME">
  <DisplayName></DisplayName>
  <Properties/>
  <UserPromptSource>{jsonPath('$.contents[-1].parts[-1].text',request.content,true)}</UserPromptSource>
  <Identifier ref=""/>
  <Rate ref="">[pm|ps]</Rate>
  <UseEffectiveCount>[false|true]</UseEffectiveCount>
  <IgnoreUnresolvedVariables>[false|true]</IgnoreUnresolvedVariables>
</PromptTokenLimit>

Default Policy

The following example shows the default settings when you add a PromptTokenLimit policy to your flow in the UI:

<PromptTokenLimit continueOnError="false" enabled="true" name="PTL-limitTokens-1">
  <DisplayName></DisplayName>
  <Properties/>
  <UserPromptSource>{jsonPath('$.contents[-1].parts[-1].text',request.content,true)}</UserPromptSource>
  <Identifier ref=""/>
  <Rate ref="">[pm|ps]</Rate>
  <UseEffectiveCount>[false|true]</UseEffectiveCount>
  <IgnoreUnresolvedVariables>[false|true]</IgnoreUnresolvedVariables>
</PromptTokenLimit>

This element has the following attributes that are common to all policies:

Attribute	Default	Required?	Description
`name`	N/A	Required	The internal name of the policy. The value of the `name` attribute can contain letters, numbers, spaces, hyphens, underscores, and periods. This value cannot exceed 255 characters. Optionally, use the `<DisplayName>` element to label the policy in the management UI proxy editor with a different, natural-language name.
`continueOnError`	false	Optional	Set to `false` to return an error when a policy fails. This is expected behavior for most policies. Set to `true` to have flow execution continue even after a policy fails. See also: Fault rules are triggered ONLY in an error state (about continueOnError) Handling faults within the current flow
`enabled`	true	Optional	Set to `true` to enforce the policy. Set to `false` to turn off the policy. The policy will not be enforced even if it remains attached to a flow.
`async`	false	Deprecated	This attribute is deprecated.

Examples

The following examples show some of the ways in which you can use the PromptTokenLimit policy:

Example 1

Prompt Token limiting within a single replica.

In this example, prompt token limiting occurs within a single replica and is not distributed across multiple message processors in a region.

<PromptTokenLimit continueOnError="false" enabled="true" name="PTL-limitTokens-1">
  <DisplayName></DisplayName>
  <Properties/>
  <UserPromptSource>{jsonPath('$.contents[-1].parts[-1].text',request.content,true)}</UserPromptSource>
  <Identifier ref="request.url"/>
  <Rate>1pm</Rate>
  <UseEffectiveCount>false</UseEffectiveCount>
</PromptTokenLimit>

Example 2

Distributed Token limiting.

In this example, prompt token limiting is distributed across multiple replicas in a region, and a "sliding window" rate limiting algorithm is employed.

<PromptTokenLimit continueOnError="false" enabled="true" name="PTL-limitTokens-1">
  <DisplayName></DisplayName>
  <Properties/>
  <UserPromptSource>{jsonPath('$.contents[-1].parts[-1].text',request.content,true)}</UserPromptSource>
  <Identifier ref="request.url"/>
  <Rate>1pm</Rate>
  <UseEffectiveCount>true</UseEffectiveCount>
</PromptTokenLimit>

Example 3

Context window size token limiting per request.

In this example, prompt token limiting occurs within a single replica and is not distributed across multiple message processors in a region. This specific configuration is used for context window size token limiting per request.

<PromptTokenLimit continueOnError="false" enabled="true" name="PTL-limitTokens-1">
  <DisplayName></DisplayName>
  <Properties/>
  <UserPromptSource>{jsonPath('$.messages',request.content,true)}</UserPromptSource>
  <Identifier ref="messageid"/>
  <Rate>1pm</Rate>
  <UseEffectiveCount>false</UseEffectiveCount>
</PromptTokenLimit>

Example 4

Distributed Token limiting with default values.

In this example, prompt token limiting occurs within a single replica and is not distributed across multiple message processors in a region. The user prompt source default value is used: {jsonPath('$.messages',request.content,true)}

<PromptTokenLimit continueOnError="false" enabled="true" name="PTL-limitTokens-1">
  <DisplayName></DisplayName>
  <Properties/>
  <Identifier ref="messageid"/>
  <Rate>1pm</Rate>
  <UseEffectiveCount>false</UseEffectiveCount>
</PromptTokenLimit>

Child element reference

This section describes the child elements of <PromptTokenLimit>.

`<DisplayName>`

Use in addition to the name attribute to label the policy in the management UI proxy editor with a different, more natural-sounding name.

The <DisplayName> element is common to all policies.

Default Value	N/A
Required?	Optional. If you omit `<DisplayName>`, the value of the policy's `name` attribute is used.
Type	String
Parent Element	<`PolicyElement`>
Child Elements	None

The <DisplayName> element uses the following syntax:

Syntax

<PolicyElement>
  <DisplayName>POLICY_DISPLAY_NAME</DisplayName>
  ...
</PolicyElement>

Example

<PolicyElement>
  <DisplayName>My Validation Policy</DisplayName>
</PolicyElement>

The <DisplayName> element has no attributes or child elements.

`<Identifier>`

Lets you choose how to group the requests so that the PromptTokenLimit policy can be applied based on the client. For example, you can group requests by developer ID, in which case each developer's requests will count towards their own PromptTokenLimit rate and not all requests to the proxy.

If you leave the <Identifier> element empty, one rate limit is enforced for all requests into that API proxy.

Default Value	N/A
Required?	Optional
Type	String
Parent Element	`<PromptTokenLimit>`
Child Elements	None

Syntax

<PromptTokenLimit
  continueOnError="[false|true]"
  enabled="[true|false]"
  name="POLICY_NAME"
>
  <Identifier ref="FLOW_VARIABLE"/>
</PromptTokenLimit>

Example 1

The following example applies the PromptTokenLimit policy per developer ID:

<PromptTokenLimit name="PTL-limitTokens-1">
  <Identifier ref="developer.id"/>
  <Rate>42pm</Rate/>
  <UseEffectiveCount>true</UseEffectiveCount>
</PromptTokenLimit>

The following table describes the attributes of <Identifier>:

Attribute	Description	Default	Presence
`ref`	Identifies the variable by which PromptTokenLimit groups incoming requests. You can use any flow variable to indicate a unique client, such those available with the VerifyAPIKey policy. You can also set custom variables using the JavaScript policy or the AssignMessage policy.	N/A	Required

`<Rate>`

Specifies the rate at which to limit token spikes (or bursts) by setting the number of tokens that are allowed in per minute or per second intervals. You can use this element in conjunction with <Identifier> to smoothly throttle traffic at runtime by accepting values from the client. Use the <UseEffectiveCount> element to set the rate limiting algorithm used by the policy.

Note: The use of <UseEffectiveCount> with a value or false is not supported for any Apigee organizations or for Apigee hybrid organizations running production workloads. For more information, see <UseEffectiveCount>.

Default Value	N/A
Required?	Required
Type	Integer
Parent Element	`<PromptTokenLimit>`
Child Elements	None

Syntax

You can specify rates in one of the following ways:

A static rate that you specify as the body of the <Rate> element
A variable value, which can be passed by the client; identify the name of the flow variable using the ref attribute

<PromptTokenLimit
    continueOnError="[false|true]"
    enabled="[true|false]"
    name="POLICY_NAME"
>
  <Rate ref="FLOW_VARIABLE">RATE[pm|ps]</Rate>
</PromptTokenLimit>

Valid rate values (either defined as a variable value or in the body of the element) must conform to the following format:

intps (number of tokens per second, smoothed into intervals of milliseconds)
intpm (number of tokens per minute, smoothed into intervals of seconds)

The value of int must be a positive, non-zero integer.

Example 1

The following example sets the rate to five tokens per second:

<PromptTokenLimit name="PTL-Static-5ps">
  <Rate>5ps</Rate>
  <UseEffectiveCount>false</UseEffectiveCount>
</PromptTokenLimit>

The policy smoothes the rate to one token allowed every 200 milliseconds (1000/5).

Example 2

The following example sets the rate to 12 tokens per minute:

<PromptTokenLimit name="PTL-Static-12pm">
  <Rate>12pm</Rate>
  <UseEffectiveCount>false</UseEffectiveCount>
</PromptTokenLimit>

This example policy smoothes the rate to one token allowed every five seconds (60/12).

The following table describes the attributes of <Rate>:

Attribute Description Presence Default

Attribute	Description	Presence	Default
`ref`	Identifies a flow variable that specifies the rate. This can be any flow variable, such as an HTTP query parameter, header, or message body content, or a value such as a KVM. For more information, see Flow variables reference. You can also use custom variables using the JavaScript policy or the AssignMessage policy. If you define both `ref` and the body of this element, the value of `ref` is applied and takes precedence when the flow variable is set in the request. (The reverse is true when the variable identified in `ref` is not set in the request.) For example: <Rate ref="request.header.custom_rate">1pm</Rate> In this example, if the client does not pass a `custom_rate` header, then the rate for the API proxy is 1 token per minute for all clients. If the client passes a `custom_rate` header, whose value is 10ps, for all clients on the proxy — until a request without the `custom_rate` header is sent. You can use `<Identifier>` to group requests to enforce custom rates for different types of clients. If you specify a value for `ref` but do not set the rate in the body of the `<Rate>` element and the client does not pass a value, then the PromptTokenLimit policy throws an error.	Optional	N/A

ref

Identifies a flow variable that specifies the rate. This can be any flow variable, such as an HTTP query parameter, header, or message body content, or a value such as a KVM. For more information, see Flow variables reference.

You can also use custom variables using the JavaScript policy or the AssignMessage policy.

If you define both ref and the body of this element, the value of ref is applied and takes precedence when the flow variable is set in the request. (The reverse is true when the variable identified in ref is not set in the request.)

For example:

<Rate ref="request.header.custom_rate">1pm</Rate>

In this example, if the client does not pass a custom_rate header, then the rate for the API proxy is 1 token per minute for all clients. If the client passes a custom_rate header, whose value is 10ps, for all clients on the proxy — until a request without the custom_rate header is sent.

You can use <Identifier> to group requests to enforce custom rates for different types of clients.

If you specify a value for ref but do not set the rate in the body of the <Rate> element and the client does not pass a value, then the PromptTokenLimit policy throws an error.

Optional

N/A

The following table describes the attributes of Rate defining the traffic throttling behavior:

Attribute	Description
`messagesPerPeriod`	Specifies the number of tokens allowed within a defined period. For example, if a policy is configured for '10ps' (10 tokens per second), the `messagesPerPeriod` value would be 10.
`periodInMicroseconds`	Defines the time period, in microseconds, over which the `messagesPerPeriod` is calculated. For a '10ps' configuration, this value would be 1,000,000, which is equivalent to one second.
`maxBurstMessageCount`	Represents the maximum number of tokens that can be allowed instantly or in a short burst at the beginning of a new interval.

Note: The values and calculations for these attributes can be affected by the number of MPs and whether the UseEffectiveCount setting is enabled. For instance, if you set a rate of '10ps' with 10 MPs and UseEffectiveCount is enabled, the effective rate for each MP is 1ps. The internal attributes of the individual MP are adjusted to reflect this. For example, the messagesPerPeriod attribute could be set to 1 and periodInMicroseconds to 1,000,000.

`<UseEffectiveCount>`

This element lets you choose between distinct PromptTokenLimit algorithms by setting the value to true or false, as explained below:

true

If set to true, PromptTokenLimit is distributed in a region. That means request counts are synchronized across message processors (MPs) in a region. In addition, a "sliding window" rate limiting algorithm is employed. This algorithm provides consistent rate limit behavior and does not "smooth" the number of incoming requests that can be sent to the backend. If a burst of requests are sent in a short time interval, they are allowed as long as they do not exceed the configured rate limit, as set in the <Rate> element. For example:

<PromptTokenLimit name="Prompt-Token-Limit-1">
  <Rate>12pm</Rate>
  <Identifier ref="client_id" />
  <UseEffectiveCount>true</UseEffectiveCount>
</PromptTokenLimit>

false (default)

Note: Although the default value is false, Apigee organizations should set this element to true.

Rate smoothing is not supported for Apigee organizations.

If set to false (the default), the PromptTokenLimit policy uses a "token bucket" algorithm that smooths token spikes by dividing the rate limit that you specify into smaller intervals. A drawback of this approach is that multiple legitimate tokens coming in over a short time interval can potentially be denied.

For example, say you enter a rate of 30pm (30 tokens per minute). In testing, you might think you could send 30 tokens in 1 second, as long as they came within a minute. But that's not how the policy enforces the setting. If you think about it, 30 tokens inside a 1-second period could be considered a mini spike in some environments.

Per-minute rates get smoothed into full requests allowed in intervals of seconds.

For example, 30pm gets smoothed like this:
60 seconds (1 minute) / 30pm = 2-second intervals, or 1 token allowed every 2 seconds. A second token inside of 2 seconds will fail. Also, a 31st token within a minute will fail.
Per-second rates get smoothed into full requests allowed in intervals of milliseconds.

For example, 10ps gets smoothed like this:
1000 milliseconds (1 second) / 10ps = 100-millisecond intervals, or 1 token allowed every 100 milliseconds. A second token inside of 100ms will fail. Also, an 11th token within a second will fail.

Default Value	False
Required?	Optional
Type	Boolean
Parent Element	`<PromptTokenLimit>`
Child Elements	None

The following table describes the attributes of the <UseEffectiveCount> element:

Attribute	Description	Default	Presence
`ref`	Identifies the variable that contains the value of `<UseEffectiveCount>`. This can be any flow variable, such as an HTTP query param, header, or message body content. For more information, see Flow variables reference. You can also set custom variables using the JavaScript policy or the AssignMessage policy.	N/A	Optional

`<UserPromptSource>`

Provides the source for retrieving user prompt text. Use a message template.

The message template should provide a single value of the user prompt text.

For example, {jsonPath('$.contents[-1].parts[-1].text',request.content,true)}.

Default Value	N/A
Required?	Optional
Type	String
Parent Element	`<PromptTokenLimit>`
Child Elements	None

Syntax

<PromptTokenLimit
    continueOnError="[false|true]"
    enabled="[true|false]"
    name="POLICY_NAME">
  <UserPromptSource>{jsonPath('$.contents[-1].parts[-1].text',request.content,true)}</UserPromptSource>
</PromptTokenLimit>

Example 1

<PromptTokenLimit name="Prompt-Token-Limit-1">
  <UserPromptSource>{jsonPath('$.contents[-1].parts[-1].text',request.content,true)}</UserPromptSource>
</PromptTokenLimit>

`<IgnoreUnresolvedVariables>`

Determines whether processing stops when an unresolved variable is encountered.

Set to true to ignore unresolved variables and continue processing; otherwise false. The default value is false.

Default Value	False
Required?	Optional
Type	Boolean
Parent Element	`<PromptTokenLimit>`
Child Elements	None

Syntax

<PromptTokenLimit
    continueOnError="[false|true]"
    enabled="[true|false]"
    name="POLICY_NAME">
  <IgnoreUnresolvedVariables>[true|false]</IgnoreUnresolvedVariables>
</PromptTokenLimit>

Example

<PromptTokenLimit name="Prompt-Token-Limit-1">
  <Rate>10ps</Rate>
  <IgnoreUnresolvedVariables>true</IgnoreUnresolvedVariables>
</PromptTokenLimit>

Flow variables

When a PromptTokenLimit policy executes, the following flow variables are populated:

Variable	Type	Permission	Description
`ratelimit.POLICY_NAME.failed`	Boolean	Read-Only	Indicates whether or not the policy failed (`true` or `false`).
`ratelimit.POLICY_NAME.resolvedUserPrompt`	String	Read-Only	Returns the user prompt extracted.
`ratelimit.POLICY_NAME.userPromptSource`	String	Read-Only	Returns the message template for user prompt specified in the policy.
`ratelimit.POLICY_NAME.userPromptTokenCount`	String	Read-Only	Returns the token count of the extracted user prompt.

For more information, see Flow variables reference.

Error reference

This section describes the fault codes and error messages that are returned and fault variables that are set by Apigee when this policy triggers an error. You may also see SpikeArrest policy errors. This information is important to know if you are developing fault rules to handle faults. To learn more, see What you need to know about policy errors and Handling faults.

Runtime errors

These errors can occur when the policy executes.

Fault code	HTTP status	Apigee fault	Cause
`policies.prompttokenlimit.FailedToExtractUserPrompt`	`400`	FALSE	Unable to extract the user prompt from the API request.
`policies.prompttokenlimit.PromptTokenLimitViolation`	`429`	FALSE	PromptTokenLimit violation.
`policies.prompttokenlimit.FailedToCalculateUserPromptTokens`	`500`	TRUE	Tokens cannot be calculated for the user prompt.

Deployment errors

These errors can occur when you deploy a proxy containing this policy.

Fault code	HTTP status	Apigee fault	Cause	Fix
`policies.prompttokenlimit.MessageWeightNotSupported`	500	FALSE	MessageWeight is not supported for the PromptTokenLimit policy.

Fault variables

These variables are set when a runtime error occurs. For more information, see What you need to know about policy errors.

Variables	Where	Example
`ratelimit.policy_name.fault.name`	`fault_name` is the name of the fault, as listed in the Runtime errors table above. The fault name is the last part of the fault code.	`fault.name Matches "PromptTokenLimitViolation"`
`ratelimit.policy_name.failed`	`policy_name` is the user-specified name of the policy that threw the fault.	`ratelimit.PTL-PromptTokenLimitPolicy.failed = true`

Example error response

Shown below is an example error response:

Note: For error handling, the best practice is to trap the errorcode part of the error response. Do not rely on the text in the faultstring, because it could change.

{  
   "fault":{  
      "detail":{  
         "errorcode":"policies.prompttokenlimit.PromptTokenLimitViolation"
      },
      "faultstring":"Prompt Token Limit Violation. Allowed rate : MessageRate{capacity=10, period=Minutes}"
   }
}

Example fault rule

Shown below is an example fault rule to handle a PromptTokenLimitViolation fault:

<FaultRules>
    <FaultRule name="Prompt Token Limit Errors">
        <Step>
            <Name>JavaScript-1</Name>
            <Condition>(fault.name Matches "PromptTokenLimitViolation") </Condition>
        </Step>
        <Condition>ratelimit.PTL-1.failed=true</Condition>
    </FaultRule>
</FaultRules>

The current HTTP status code for exceeding a rate limit set by a LLMTokenQuota or PromptTokenLimit policy is 429 (Too Many Requests).

Schemas

Each policy type is defined by an XML schema (.xsd). For reference, policy schemas are available on GitHub.

PromptTokenLimit policy Stay organized with collections Save and categorize content based on your preferences.

Overview

The difference between PromptTokenLimit and LLMTokenQuota

PromptTokenLimit element

Syntax

Default Policy

Examples

Example 1

Example 2

Example 3

Example 4

Child element reference

<DisplayName>

Syntax

Example

<Identifier>

Syntax

Example 1

<Rate>

Syntax

Example 1

Example 2

<UseEffectiveCount>

true

false (default)

<UserPromptSource>

Syntax

Example 1

<IgnoreUnresolvedVariables>

Syntax

Example

Flow variables

Error reference

Runtime errors

Deployment errors

Fault variables

Example error response

Example fault rule

Schemas

Related topics

PromptTokenLimit policy

`PromptTokenLimit` element

`<DisplayName>`

`<Identifier>`

`<Rate>`

`<UseEffectiveCount>`

`<UserPromptSource>`

`<IgnoreUnresolvedVariables>`