Use the comparison chart below to help you decide which policy to use
for your rate-limiting use case:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-01-06 UTC."],[],[]]
Quota
SpikeArrest
LLMTokenQuota
PromptTokenLimit
Use it to:
Limit the number of API proxy calls a developer app or developer can
make over a specific period of time. It's best for rate limiting over
longer time intervals like days, weeks, or months, especially when
accurate counting is a requirement.
Limit the number of API calls that can be made against an API proxy
across all consumers over a short period of time, such as seconds or
minutes.
Manage and limit the total token consumption for LLM API calls over a
specified period (minute, hour, day, week, or month). This allows you to
control LLM expenditures and apply granular quota management based on
API products.
Protect your API proxy's target backend against token abuse,
massive prompts, and potential denial-of-service attempts by limiting
the rate of tokens sent in the input by throttling requests based on
the number of tokens in the user's
prompt message. It is a comparative paradigm to Spike Arrest for API
traffic, but for tokens.
Don't use it to:
Protect your API proxy's target backend against traffic spikes. Use
SpikeArrest or PromptTokenLimit for that.
Count and limit the number of connections apps can make to your API
proxy's target backend over a specific period of time, especially when
accurate counting is required.
Protect your API proxy's target backend against token abuse.
Use PromptTokenLimit for that.
Accurately count and limit the total number of tokens consumed for
billing or long-term quota management. Use the LLMTokenQuota policy for
that.
Stores a count?
Yes
No
Yes, it maintains counters that track the number of tokens consumed by
LLM responses.
It counts tokens to enforce a rate limit but does not store a
persistent, long-term count like the LLMTokenQuota policy.
Best practices for attaching the policy:
Attach it to theProxyEndpoint Request PreFlow,
generally after the authentication of the user.
This enables the policy to check the quota counter at the entry
point of your API proxy.
Attach it to theProxyEndpoint Request PreFlow,
generally at the very beginning of the flow.
This provides spike protection at the entry point of your API proxy.
Apply the enforcement policy (EnforceOnly) in therequest flowand the counting policy
(CountOnly) in theresponse flow. For
streaming responses, attach the counting policy to anEventFlow.
Attach it to theProxyEndpoint Request PreFlow, at the
beginning of the flow, to protect your backend from oversized prompts.
HTTP status code when limit has been reached:
429(Service Unavailable)
429(Service Unavailable)
429(Service Unavailable)
429(Service Unavailable)
Good to know:
The Quota counter is stored in Cassandra.
You can configure the policy to synchronize the counter
asynchronously to save resources, but this may allow calls slightly
in excess of the limit.
Lets you choose between asmoothingalgorithm or an effective
count algorithm. The former smooths the number of requests that can
occur in a specified period of time, and the latter limits the total
number of requests that can occur within a specified time period, no
matter how rapidly they are sent in succession.
Smoothing is not coordinated across Message Processors.
Can be configured asCountOnlyto track token usage orEnforceOnlyto reject requests that exceed the quota.
It works with API Products to allow for granular quota
configurations based on the app, developer, model, or a specific LLM
operation set.
Uses<LLMTokenUsageSource>to extract token count
from the LLM response and<LLMModelSource>to
identify the model used.
The token calculation might differ slightly from the one used by the LLM.
The<UserPromptSource>element specifies the
location of the user prompt in the request message.