June | 2024 | Journey Of The Geek

This is part of my series on GenAI Services in Azure:

Updates:

10/29/2024 – Microsoft has announced a deployment option referred to as a data zone (https://azure.microsoft.com/en-us/blog/accelerate-scale-with-azure-openai-service-provisioned-offering/). Data zones can be thought of as data sovereignty boundaries incorporated into the existing global deployment option. This will significantly ease load balancing so you will no longer need to deploy individual regional instances and can instead deploy a single instance with a data zone deployment within a single subscription. As you hit the cap for TPM/RPM within that subscription, you can then repeat the process with a new subscription and load balance across the two. This will result in fewer backends and a more simple load balancing setup.

Another week, another AOAI (Azure OpenAI Service) post. Today, I’m going to continue to discuss the new “Generative AI Gateway”-type features released to APIM (Azure API Management). In my last post I covered the new built-in load balancing and circuit breaker feature. For this post I’m going to talk about the new token-based rate limiting feature and rate limiting in general. Put on your nerd cap and caffeinate because we’re going to be analyzing some Fiddler captures.

The Basics

When talking rate limiting for AOAI it’s helpful to understand how an instance natively handles subscription service limits. There are a number of limits to be aware, but the most relevant to this conversation are the regional quota limits. Each Azure Subscription gets a certain quota of tokens per minute and request per minute for each model in a given region. That regional quota is shared among all the AOAI instances you provision with the model within that subscription in that given region. When you exhaust your quota for a region, you scan scale by requesting quota (good luck with that), create a new instance in another region in the same subscription, create a new instance in the same region in a different subscription, or going the provisioned throughput option.

In October 2024, Microsoft introduced the concept of a data zone deployment. Data zones address the compliance issues that came with global deployments. In a global deployment the prompt can be sent and serviced by the AOAI service in any region across the globe. For customers in regulated industries, this was largely a no go due to data sovereignty requirements. The new data zone deployment type allows you to pool AOAI capacity within a subscription across all regions within a given geopolitical boundary. As of October 2024, this supports two data zones including the US and EU.

With each AOAI instance you provision in a subscription, you’ll be able to adjust the quota of the deployment of a particular model for that instance. Each AOAI instance you create will share the total quota available. If you have a use case where you need multiple AOAI instances, like for example making each its own authorization boundary with Azure RBAC for the purposes of separating different fine-tuned models and training data, each instance will draw from that total subscription-wide regional quota. Note that the more TPM (tokens per minute) you give the instance the higher RPM (requests per minute, 1K TPM = 6 RPM).

**Adjusting quota for a specific instance of AOAI**

Alright, so you get the basics of quota so now let’s talk about what rate limiting looks like from the application’s point of view. I’ll first walk through how things work when contacting the AOAI instance directly and then I’ll cover how things work when APIM sits in the middle (YMMV on this one if you’re using another type of “Generative AI Gateway”).

Direct Connectivity to Azure OpenAI Instance

Here I’ve set the model deployment to a rate limit of 50K TPM which gives me a limit of 300 RPM. I’ll be contacting the AOAI instance directly without any “Generative AI Gateway” component between my code and the AOAI instance. I’m using the Python openai SDK version 1.14.3.

I’ll be using this simple function to make Chat Completion calls to GPT3.5 Turbo.

Let’s dig into the response from the AOAI service.

**Response headers from direct connectivity to AOAI**

The headers relevant to the topic at hand are x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens.

The x-rate-limit-remaining-requests header tells you have many responses you have left before you’ll be rate limited for requests. There’s a few interesting things about this header. First, it always starts decrementing from 1/1000 of whatever the TPM. In my testing 50K TPM starts with 50 requests, 10K TPM with 10 requests. The Portal says 300 RPM at 50K TPM, so it’s odd that the response header shows something different and far less than what I’d expect. I also noticed that each corresponding request will decrease the x-ratelimit-remaining-tokens but will not necessarily reduce the x-rate-limit-remaining-requests header. Good example is at 1K TPM (which gives you 6 requests according to the Portal but gives me 1 RPM according to this header) would tell me I had zero requests left after my first request but wouldn’t always throttle me. Either there’s additional logic being executed to determine when to rate limit based on request or it’s simply inaccurate. My guess is the former, but I’m not sure.

The next header is the x-ratelimit-remaining-tokens which does match the TPM you set for the deployment. The functionality is pretty straightforward, but it’s important to understand how the max_tokens parameter Chat Completions and the like can affect it. In my example above, I ask the model to say hello which uses around 20 total tokens across the prompt and completion. When I set the max_tokens parameter to 100 the x-ratelimit-remaining-tokens is reduced by 100 even though I’ve only used 20 tokens. What you want to take from that is be careful with what you set in your max_tokens parameter because you can very easily exhaust your quota on an specific AOAI instance. I believe consideration holds true in both pay-as-you-go and PTU SKUs.

When you hit a limit and begin to get rate limited, you’ll get a message similar to what you see below with the policy-id header telling you which limit you hit (token or requests) and the Retry-After header telling you how long you’re rate limited. If you’re using the openai SDK (I can only speak for Python) the retry logic within the library will kick off.

Let me dig into that a bit.

The retry logic for the openai SDK for Python is in the openai/lib/_base_client.py file. It’s handled by a few different functions including _parse_retry_after_header, _calculate_retry_timeout, and _should_retry. I’ll save the you the gooey details and give you the highlights. In each response the SDK looks for the retry-after-ms and retry-after headers. If either is found it looks to see if the value is less than 60 seconds. If it’s greater than 60 seconds, it ignores the value and executes its own logic which starts at around 1 to 2 seconds and increases up to 8 seconds for a maximum of 2 retries by default (constants used for much of the calculations are located in openai/lib/_constants.py). The defaults should be good for most instances but you can certainly tweak the max retries if it’s not sufficient. While the retry logic is very straightforward in the instance of hitting the AOAI instance directly, you will see some interesting behavior when APIM is added.

Throttling and APIM

I’ve talked ad-nauseam about why you’d want to place APIM in between your applications and the AOAI instance. To save myself some typing these are some of the key reasons:

Load balancing across multiple AOAI instances spread across regions spread across subscriptions to maximize model quota.
Capturing operational and security information such as metrics for response times, token usage for chargebacks, and prompts and responses for security review or caching to reduce costs.

In the olden days (two months ago) customers struggled to limit specific applications to a certain amount of token usage. Using APIM’s request limiting wasn’t very helpful because the metric we care most about with GenAI is tokens, not requests. Customers came up with creative solutions to distribute applications to different sets of AOAI instances, but it was difficult to manage at scale. I can’t count the number of times I heard “How do I throttle based upon token usage in APIM?” and I was stuck giving the customer the bad news it wasn’t possible without extremely convoluted PeeWee Herman Breakfast Machine-type solutions.

Microsoft heard the customer pain and introduced a new APIM policy for rate limiting based on token usage. This new policy allows you to rate limit an application based on a counter key you specify. APIM will then limit the application if it pushes beyond the TPM you specify. This allows you to move away from the dedicated AOAI instance pattern you may have been trying to use to solve this problem and into a design where you position a whole bunch of AOAI instances behind APIM and load balance across them using the new load balancer and circuit breaker capabilities of APIM relying upon this new policy to control consumption.

Now that you get the sales pitch, take a look at the options available for the policy snippet.

Below you’ll see a section from my APIM policy. In this section I’m setting up the token rate limiting feature. The counter key I’m using in this scenario is the appid property I’ve extracted from the Entra ID access token. I’m a huge proponent of blocking API key access to AOAI instances and instead using Entra ID based authentication under the context of the application’s identity for the obvious reasons.


        <!-- Enforce token usage limits -->
        <azure-openai-token-limit counter-key="@(context.Variables.GetValueOrDefault<string>("appId","00000000-0000-0000-0000-000000000000"))" estimate-prompt-tokens="true" tokens-per-minute="1000" remaining-tokens-header-name="x-apim-remaining-token" tokens-consumed-header-name="x-apim-tokens-consumed" />
        <set-backend-service backend-id="backend_pool_aoai" />

I’ve also set the estimate-prompt-tokens property to true. The docs state this could cause some performance impact, so you’ll want to test that on and off in your own environment. It’s worth noting that APIM will always estimate prompt tokens if a streaming completion is being used whether or not you’ve set this option to true. Next, I’m setting a custom header name for both the remaining-tokens header and tokens-consumed headers. This will ensure these headers are returned to the client and they’re uniquely identifiable such that they couldn’t be confused with the headers natively returned by AOAI instance behind the scenes.

Notice I didn’t modify the name of the retry header. Recall that the openai SDK looks for retry-after so if you modify this header you won’t get the benefit of the SDK’s retry logic. My advice is keep this as the default.

When I send the Chat Completion request to APIM, I get back the response headers below which includes the two new headers x-apim-remaining-tokens and x-apim-tokens-consumed which show my request consumed 22 of the 1K TPM I’ve been allotted. Notice how this is keeping track of exact number of tokens being used vs how the service natively will feed of the max_tokens parameter which is a nice improvement.

Once I exhaust my 1K TPM, I’m hit with a 429 and a retry-after header. The SDK will execute its retry logic and wait the amount of time in the retry logic. This is why you shouldn’t muck with the header name.

Very cool right? You are now saved from a convoluted solution of dedicated AOAI instances or an insanely complex APIM policy snippet.

Before I close this out I want to show one more interesting “feature” I ran into when I was testing. In my environment I’m using a load balanced pool backend where I have 4 AOAI instances stretched across multiple regions and a circuit breaker to bounce to temporarily remove pool members if they 429. When I was doing testing for this post I noticed an interesting behavior of APIM when one of the pool members begins to 429.

In the image below I purposely went over the AOAI instance backend quota to trigger the pool member’s rate limiting. Notice how I receive a 429 with the Retry-After is set to 86,400 seconds which is the number of seconds in a day. It seems like the load balanced pool will shoot this value back when a pool member 429s. Recall again the behavior or the openai SDK which ignores retry-after greater than 60 second. This means the SDK will execute its own shorter timer making for a quick retry. Whether the PG designed this with the openai SDK behavior’s in mind, I don’t know, but it worked out well either way.

**Rate limited by AOAI instance behind a load balancing APIM backend**

That about completes this post. Your key takeaways today are:

If you’re using request rate limiting or something more convoluted like dedicating AOAI instances to handle rate limiting across applications, plan to move to using the token-based rate limiting policy in APIM.
Be careful with what you’re setting the max_tokens parameter to when you call the models because setting too high can trigger the AOAI instance rate limiting even though you haven’t exhausted the TPM set in the token rate limiting APIM policy.
Don’t mess with the retry-after header in your token rate limiting policy if you’re using the openai SDK. If you do you’ll have to come up with your own retry logic.
Ensure you set the remaining-tokens-header-name and tokens-consumed-header-name so it’s easily identified which rate limit is affecting an application.
Be aware that in my testing the tokens-consumed returned by the token rate limiting policy didn’t account for completion tokens when it was a streaming Chat Completion. You’ll still need to be creative to calculating streaming token usage for chargeback.

Journey Of The Geek

The chronicles of a Bostonian tech geek navigating through life and technology

Monthly Archives: June 2024