Comparing rate-limiting policies

This page applies to Apigee and Apigee hybrid.

View Apigee Edge documentation.

Use the comparison chart below to help you decide which policy to use for your rate-limiting use case:

Quota SpikeArrest LLMTokenQuota PromptTokenLimit
Use it to: Limit the number of API proxy calls a developer app or developer can make over a specific period of time. It's best for rate limiting over longer time intervals like days, weeks, or months, especially when accurate counting is a requirement. Limit the number of API calls that can be made against an API proxy across all consumers over a short period of time, such as seconds or minutes. Manage and limit the total token consumption for LLM API calls over a specified period (minute, hour, day, week, or month). This allows you to control LLM expenditures and apply granular quota management based on API products. Protect your API proxy's target backend against token abuse, massive prompts, and potential denial-of-service attempts by limiting the rate of tokens sent in the input by throttling requests based on the number of tokens in the user's prompt message. It is a comparative paradigm to Spike Arrest for API traffic, but for tokens.
Don't use it to: Protect your API proxy's target backend against traffic spikes. Use SpikeArrest or PromptTokenLimit for that. Count and limit the number of connections apps can make to your API proxy's target backend over a specific period of time, especially when accurate counting is required. Protect your API proxy's target backend against token abuse. Use PromptTokenLimit for that. Accurately count and limit the total number of tokens consumed for billing or long-term quota management. Use the LLMTokenQuota policy for that.
Stores a count? Yes No Yes, it maintains counters that track the number of tokens consumed by LLM responses. It counts tokens to enforce a rate limit but does not store a persistent, long-term count like the LLMTokenQuota policy.
Best practices for attaching the policy:

Attach it to the ProxyEndpoint Request PreFlow, generally after the authentication of the user.

This enables the policy to check the quota counter at the entry point of your API proxy.

Attach it to the ProxyEndpoint Request PreFlow, generally at the very beginning of the flow.

This provides spike protection at the entry point of your API proxy.

Apply the enforcement policy (EnforceOnly) in the request flow and the counting policy (CountOnly) in the response flow. For streaming responses, attach the counting policy to an EventFlow. Attach it to the ProxyEndpoint Request PreFlow, at the beginning of the flow, to protect your backend from oversized prompts.
HTTP status code when limit has been reached: 429 (Service Unavailable) 429 (Service Unavailable) 429 (Service Unavailable) 429 (Service Unavailable)
Good to know:
  • The Quota counter is stored in Cassandra.
  • You can configure the policy to synchronize the counter asynchronously to save resources, but this may allow calls slightly in excess of the limit.
  • Lets you choose between a smoothing algorithm or an effective count algorithm. The former smooths the number of requests that can occur in a specified period of time, and the latter limits the total number of requests that can occur within a specified time period, no matter how rapidly they are sent in succession.
  • Smoothing is not coordinated across Message Processors.
  • Can be configured as CountOnly to track token usage or EnforceOnly to reject requests that exceed the quota.
  • It works with API Products to allow for granular quota configurations based on the app, developer, model, or a specific LLM operation set.
  • Uses <LLMTokenUsageSource> to extract token count from the LLM response and <LLMModelSource> to identify the model used.
  • The token calculation might differ slightly from the one used by the LLM.
  • The <UserPromptSource> element specifies the location of the user prompt in the request message.
Get more details: Quota policy SpikeArrest policy LLMTokenQuota policy PromptTokenLimit policy