Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Hello again!

Today I’m back with another post focusing on AOAI (Azure OpenAI Service). My focus falls into two buckets: operations and security. For this post I’m going to cover a topic that falls into the operations bucket.

Last year I covered some of the challenges that arise when tracking token usage when the need arises to use streaming-based ChatCompletions. The challenges center around logging the prompt, response, and token usage. The guidance I provide in that prior post is unchanged for logging prompts and completions, but capturing token usage has gotten much easier. Before I dig into the details, I want to very briefly cover why you should care about and track token usage.

The whole “AI is the new electricity” statement isn’t all hype. Your business units are going to want to experiment with it, especially generative AI, to for optimizing business processes such as shaving time off how long it takes a call center rep to resolve a customer’s problem or automating a portion of what is now a manual limited value-added activity of highly paid employees to free them up to focus on activities that drive more business value. As an organization, you’re going to be charged with providing these services to the developers, data scientists, an AI engineers. The demand will be significant and you gotta figure out a scalable way to provide these services while satisfying security, performance, and availability requirements.

This will typically drive an architecture where capacity for generative AI is pooled and distributed to your business units a core service. Acting as a control point to ensure security, availability, and performance requirements can be met, the architectural concept of a Generative AI Gateway is introduced. This component usually translates to Azure API Management, 3rd party API Gateway, or custom developed solution with “generative ai-specific” functionality layered on top (load balancing, rate limiting based on token usage, token usage tracking, prompt and response logging, caching of prompts and responses to reduce costs and latency, etc).

In Azure you might see a design like the image below where you’re distributing the requests across multiple AOAI instances spread across regions, geo-political boundaries, and subscriptions in order to maximize your quota (number of requests and tokens per model). When you have this type of architecture it’s important to get visibility into the token usage of each application for charge backs and to ensure everyone is getting their fair share of the capacity (i.e. rate limiting).

Example high-level architecture using AOAI

Now let’s align the token usage back to streaming ChatCompletions. With a non-streaming ChatCompletion the API automatically returns the number of prompt tokens, completion tokens, and total tokens that were consumed with the request. This information is easy to intercept at the Generative AI Gateway to use as an input for rate limiting or to pass on to some reporting system for charge backs on token usage.

Non-streaming ChatCompletion returning usage

When performing a streaming ChatCompletion the completion is returned in a series of server events (or chunks). Usage statistics were historically not provided in the response from the AOAI service to my understanding and experience. This forced the application developer or the owner of the Generative AI Gateway to incorporate some custom code using a Tokenizer like tiktoken to manually calculate the total number of tokens. An example of such a solution developed by one of my wonderful peers Shaun Callighan can be found here. This was one of the only (maybe the only?) to approach the problem at the time but sometimes resulted in slightly skewed results from what was estimated by the tokenizer to what the actual numbers were when processed by the AOAI service and billed to the customer.

Streaming ChatCompletion chunks of responses



Microsoft has made this easier with the introduction of the azure-openai-emit-token-metrics policy snippet for APIM (Azure API Management) which can emit token usage for both streaming and non-streaming completions (among other operations) to an App Insights instance. I talk through this at length in this post. However, at this time, it’s supported for a limited set of models and not every customer uses APIM. These customers have had to address the problem using a custom solution like I mentioned earlier.

Earlier this week I was mucking around with a simplistic ChatBot I’m building (FYI, Streamlit is an amazing framework to help build GUIs if you’re terrible at frontend design like I am) and I came across an additional parameter that can be passed when making a streaming ChatCompletion. You can pass an additional parameter called stream_options which will provide the token usage of the ChatCompletion in the the second to last chunk delivered back to the client. I’m not sure when this was introduced or how I missed it, but it removes the need to calculate this yourself with a tokenizer.

 response = client.chat.completions.create(
    model=deployment_name,
    messages= [
        {"role":"user",
         "content": message}
    ],
    max_tokens=200,
    stream=True,
    stream_options={
         "include_usage": True
        }
 )

Below you’ll see a sample response from a streaming ChatCompletion when including the stream_options property. In the chunk before the final chunk (there is a final check not visible in this image), the usage statistics are provided and can be extracted.

This provides a much better option than trying to calculate this yourself. I tested this with 3.5-turbo and 4o (both with text and images) and it gave me back the token usage as expected (I’m using API version 2024-02-01). I threw together some very simple (and if it’s coming from me it’s likely gonna be simple because my coding skills leave a lot to be desired) to capture these metrics and return them as part of the completion.

# Class to support completion and token usage
class ChatMessage:
    def __init__(self, full_response, prompt_tokens, completion_tokens, total_tokens):
        self.full_response = full_response
        self.prompt_tokens = prompt_tokens
        self.completion_tokens = completion_tokens
        self.total_tokens = total_tokens

# Streaming chat completions
async def get_streaming_chat_completion(client, deployment_name, messages, max_tokens):
    response = client.chat.completions.create(
        model=deployment_name,
        messages=messages,
        max_tokens=max_tokens,
        stream=True,
        stream_options={
            "include_usage": True
        }
    )
    assistant_message = st.chat_message("assistant")
    full_response = ""

    with assistant_message:
        message_placeholder = st.empty()

    # Intialize token counts
    t_tokens = 0
    c_tokens = 0
    p_tokens = 0
    usage_dict = None


    for chunk in response:
        if chunk.usage:
            usage_dict = chunk.usage
            if p_tokens == 0:
                p_tokens = usage_dict.prompt_tokens
                c_tokens = usage_dict.completion_tokens
                t_tokens = usage_dict.total_tokens

        if hasattr(chunk, 'choices') and chunk.choices:
            content = chunk.choices[0].delta.content
            if content is not None:
                full_response += content
                message_placeholder.markdown(full_response)

    if full_response == "":
        full_response = "Sorry, I was unable to generate a response."

    return ChatMessage(full_response, p_tokens, c_tokens, t_tokens)

For those of using APIM as a Generative AI Gateway, you won’t have to worry about this for most of the OpenAI models offered through AOAI because the policy snippet I mentioned earlier will be improved to support additional models beyond what it supports today. For those of you using third-party gateways, this is likely relevant and may help to simplify your code and eliminate the discrepancies you see from calculating token usage yourself vs what you’re seeing displayed within the AOAI instance.

Well folks, this post was short and sweet. Hopefully this small tidbit of information helps a few folks out there who were going the tokenizer route. Any simplification these days is welcome!

Azure AI Studio – Chat Playground and API Management

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Hello again folks!

Today, I’m going to be posting my first post in a series on Azure AI Studio. I’ll let the true AI professionals give you the gory details and features of the service. The way my small brain thinks of the service is a platform built on top of AML (Azure Machine Learning) to make building applications that use Generative AI more developer-friendly. You can build and test applications, deploy third-party models, and organize applications into “projects” which can be secured to a specific project team but share resources across an organization via the concept of a hub. I’ll cover more on those pieces in a future blog post, but for today I want to focus on a pattern I was messing around that I think would be appealing to most folks.

One of the neat features of AI Studio is the Chat Playground. The Chat Playground is a web interface for interacting with models you have deployed to Azure AI Studio. You can send prompts and receive completions, adjust parameters such as temperature, and even get a code sample of the code being run by the web interface. The models that can e deployed include OpenAI models deployed to an AOAI (Azure OpenAI Service) instance or third-party models like Meta’s Llama deployed to a serverless endpoint or self-managed compute (in AML called managed online endpoint). For the purposes of this post I’m going to be focusing on OpenAI models deployed to an AOAI instance.

Azure AI Studio Chat Playground

You’re probably looking at this and thinking, “Yeah that is cool… a similar functionality exists in Azure OpenAI Studio and it does the same thing.” That’s correct it does, but for many organizations using the Azure OpenAI Studio’s Chat Playground isn’t an option for a number of different reasons both operational and security-related.

From an operational perspective, the Azure OpenAI Studio’s Chat Playground is designed to communicate directly with the endpoint for an AOAI instance. As I’ve covered in previous posts, this can be problematic. One reason is you’re limited to the quota within the instance which could cause you to hit limits quickly if you direct a whole ton of users to it. Typically, you will load balance across multiple instances deployed to multiple regions across multiple subscriptions as I discuss in my post on load balancing AOAI. The other problem is dealing with internal chargebacks. If I have multiple BUs (business units) hammering away at an instance, I don’t have any easy to determine who which folks in what BU consumed what. While metrics are token usage are captured in the metrics streamed from an instance, there is no way to associate that usage with an individual.

On the security side, communicating directly with the AOAI instance means I can’t review the prompts and responses being sent and received by the service. Many regulated organizations have requirements for these to be captured for review to ensure the service is being used appropriately and sensitive data isn’t being sent that hasn’t been approved to be sent. Additionally, availability of the AOAI instance could be affected by one user going nuts and consuming the full quota.

The challenges outlined above have driven many customers to insert a control point. The industry seems determined to coin this architectural component a Gen AI Gateway so I’ll play along. For you fellow old folks, all a Gen AI Gateway really is an API Gateway with some Gen AI-related features slapped on top of it. It sits between the front-facing user application and the models processing the prompts and responses. The GenAI-specific features available within the gateway help to address the operational and security challenges I’ve outlined above. If you’re curious about the specifics on this, you can check out my post on load balancing, logging, tracking token usage, rate limiting, and extracting useful information from the conversation such as prompts and responses.

Example design and process flow of a Gen AI Gateway

In the image above I’ve included an example of how APIM (Azure API Management) could be used to provide such functionality. Within the customer base I work with at Microsoft, many customers have built something that functions similar to what you see above. A design like this helps to address the operational and security challenges I’ve outlined above.

Wonderful right? Now what the **** does this have to do with AI Studio’s Chat Playground? Well, unlike the Azure OpenAI Studio’s Chat Playground, AI Studio’s offering does support modifying the endpoint to point to your generative AI gateway. How you do this isn’t super intuitive, but it does work. Whether you go this route is totally up to you. Ok, disclaimer is done, let’s talk about how you do this.

One thing to understand about using AI Studio’s Chat Playground is it works the same way that Azure OpenAI Studio’s version works in regards to where the TCP connections are sourced from when making calls to the model. As can be seen in the Fiddler capture below, the TCP connections made when you submit a prompt from the Chat Playground are sourced from the user’s endpoint.

Fiddler capture showing Chat Completion coming from user endpoint

This makes our life much easier because we likely control the path that user’s packet takes and the DNS the user uses which means we can direct that user’s packet to a Gen AI Gateway. For the purposes of this post, my goal is to funnel these prompts and completions through an APIM instance I have in place which has some APIM policy snippets that do some checks and balances and a call a small app (based off an awesome solution assembled by my buddy Shaun Callighan) which logs prompts and responses and calculates token metrics. The data processed by the app are then sent to an Event Hub, processed by Stream Analytics, and dumped into CosmosDB.

APIM between Chat Playground and AOAI

When you want to connect to an AOAI instance from AI Studio’s Chat Playground you add it as a connection. These connections can created at the hub level (think of this as a logical container for the projects) and then shared across projects. When adding the connection you can browse for the instance you want to connect to or enter manually.

Adding a connection to an AOAI instance

If you were to do that you won’t be able to create a deployment of a model or access a deployment of a model deployed in the instances behind it. This is because AI Studio is making calls to the Azure management plane to enumerate the deployments within the instance. Since there isn’t an AOAI with the hostname of your AOAI instance, you’ll be unable to add deployments or pick a deployment from the Chat Playground.

To work around this, you need to add a connection to one of your AOAI instances. This will be your “stub” instance that we’ll modify the endpoint of to point to API Management. If you’re load balancing across multiple AOAI instances behind APIM, you need to ensure that you’ve already created your model deployments and you’ve named them consistently across all of the AOAI instances you’re load balancing to. In the image below, I modify the endpoint to point to my APIM instance. The azure-openai-log-helper path is added to send it to a specific API I have setup on APIM that handles logging. For your environment, you’ll likely just need the hostname.

Modifying the endpoint name

Now before you go running and trying to use the Chat Playground, you’ll have to make a change to the APIM policy. Since the user’s browser is being told to make the call to this endpoint from a different domain (AI Studio’s domain) we need to ensure there is a CORS policy in place on the APIM instance to allow for this, otherwise it will be blocked by APIM. If you forget about this policy you’ll get a back a 200 from the APIM instance but nothing will be in the response.

Your CORS policy could look like the below:

        <cors>
            <allowed-origins>
                <origin>https://ai.azure.com/</origin>
                <origin>https://ai.azure.com</origin>
            </allowed-origins>
            <allowed-methods preflight-result-max-age="300">
                <method>POST</method>
                <method>OPTIONS</method>
            </allowed-methods>
            <allowed-headers>
                <header>authorization</header>
                <header>content-type</header>
                <header>request-id</header>
                <header>traceparent</header>
                <header>x-ms-client-request-id</header>
                <header>x-ms-useragent</header>
            </allowed-headers>
        </cors>

Once you’ve modified your APIM policy with the CORS update, you’ll be good to go! Your requests will now flow through APIM for all the GenAI Gateway goodness.

Chat Completion from AI Studio Chat Playground flowing through APIM

When messing with this I ran into a few things I want to call out:

  1. Do not forget the CORS policy. If you run into a 200 response from APIM with no content, it’s probably the CORS snippet.
  2. If you have a validate-jwt snippet in your APIM policy that includes validating the claim includes cognitivesservices, remove that. The claim passed by AI Studio includes a trailing forward slash which won’t likely match what you get back if you’re using the MSAL library in code. You could certainly include some logic to handle it, but honestly the security benefit is so little from checking the claim just make it easy on yourself and remove the check for the claim. Keep the check that validate-jwt snippet but restrict it to checking the tenant ID in the token.
  3. Chat Playground will pass the content property as the prompt as an array (this is the more modern approach to allow for multi-modal models like GPT-4o which can handle images and audio). If you have an APIM policy in place to parse the request body and extract information you’ll need to update it to also handle when content is passed as an array.
  4. Chat Playground allows for the user to submit an image along with text in the prompt. Ensure your APIM policy is capable of handling prompts like that. Dealing with human users being able to submit images to an LLM and ensuring you’re reviewing that image for DLP and calculating token consumption for streaming Chat Completion is a whole other blog topic that I’m not going to do today. Key thing is you want to account for that. Block images or ensure your policy is capable of handling it if you’re deploying 4o or 4 Vision.

Well folks that sums up this post. I realize this solution is a bit funky, and I’m not gonna tell you to use it. I’m simply putting it out there as an option if you have a business need strong enough to provide a ChatGPT-style solution but don’t have the bandwidth or time to whip up your own application.

Enjoy!

Azure OpenAI Service – Tracking Token Usage with APIM

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Yeah, yeah, yeah, I missed posting in July. I have been appropriately shamed on a daily basis by WordPress reminders.

I’m going to make up for it today by covering another of the “Generative AI Gateway” features of APIM (Azure API Management) that were announced a few months back. I’ve already covered the circuit breaker and load balancing and the token-based rate limiting features. These two features have made it far easier to distribute and control the usage of the AOAI (Azure OpenAI Service) that is being offered as a core enterprise service. One of the challenges that isn’t addressed by those features is charge backs.

As I’ve covered in prior posts, you can get away with an instance or two of AOAI dedicated to an app when you have one or two applications at the POC (proof-of-concept) stage. Capacity and charge back isn’t an issue in that model. However, your volume of applications will grow as well as the capacity of tokens and requests those applications require as they move to production. This necessitates AOAI being offered as a core foundational service as basic as DNS or networking. The patterns for doing this involve centrally distributing requests across several instances of AOAI spread across different regions and subscriptions using a feature like the circuit breaker and load balancing features of APIM. Once you have several applications drawing from a common pool, you then need to control how much each of those applications can consume using a feature like the token-based rate limiting feature of APIM.

Common way to scale AOAI service

Wonderful! You’ve built a service that has significant capacity and can service your BUs from a central endpoint. Very cool, but how are you gonna determine who is consuming what volume?

You may think, “That information is returned in the response. I can have the developers use a common code snippet to send that information for each response to a central database where I can track it.” Yeah nah, that ain’t gonna work. First, you ain’t ever gonna get that level of consistency across your enterprise (if you do have this, drop me an email because I want to work there). Second, as of today, the APIs do not return the number of tokens used for streaming based chat completions which will be a large majority of what is being sent to the models.

I know you, and you’re determined. You follow-up with, “Well Matt, I’m simply going to pull the native metrics from each of the AOAI instances I’m load balancing to.” Well yeah, you could do that but guess what? Those only show you the total consumed across the instance and do not provide a dimension for you to determine how much of that total was related to a specific application.

Native metrics and its dimensions for an instance of AOAI

“Well Matt, I’m going to configure diagnostic logging for each of my AOAI instances and check off the Request and Response Logs. Surely that information will be in there!”. You don’t quit do you? Let me shatter your hopes yet again, no that will not work. As I’ve covered in a prior post while the logs do contain the Entra ID object ID (assuming you used Entra ID-based authentication) you won’t find any token counts in those logs either.

AOAI Request and Response Logs

Well fine then, you’re going to use a custom logging solution to capture token usage when it’s returned by the API and calculate it when it isn’t. While yes this does work and does provide a number of additional benefits beyond information for charge backs (and I’m a fan of this pattern) it takes some custom code development and some APIM policy snippet expertise. What if there was an easier way?

That is where the token metrics feature of APIM really shines. This feature allows you to configure APIM to emit a custom metric for the tokens consumed by a Completion, Chat Completion (EVEN STREAMING!!), or Embeddings API call to an AOAI backend with a very basic APIM Policy snippet. You can even add custom dimensions and that is where this feature gets really powerful.

The first step in setting this up is to spin up an instance of Application Insights (if your APIM isn’t already hooked into one) and a Log Analytics Workspace the Application Insights instance will be associated with. Once your App Insights instance is created, you need to modify the settings API in APIM you’ve defined for AOAI and turn on the App Insights integration and enable custom metrics as seen below.

Enable custom metrics in APIM

Next up, you need to modify your APIM policy. In the APIM Policy snippet below I extra a few pieces of data from the request and add them as dimensions to the custom metric. Here I’m extracting the Entra ID app id of security principal accessing the AOAI service (this would be the application’s identity if you’re using Entra ID authentication to the AOAI service) and the model deployment name being called from AOAI which I’ve standardized to be the same as the model name.

         <!-- Extract the application id from the Entra ID access token -->

        <set-variable name="appId" value="@(context.Request.Headers.GetValueOrDefault("Authorization",string.Empty).Split(' ').Last().AsJwt().Claims.GetValueOrDefault("appid", string.Empty))" />

        <!-- Extract the model name from the URL -->

        <set-variable name="uriPath" value="@(context.Request.OriginalUrl.Path)" />
        <set-variable name="deploymentName" value="@(System.Text.RegularExpressions.Regex.Match((string)context.Variables["uriPath"], "/deployments/([^/]+)").Groups[1].Value)" />

        <!-- Emit token metrics to Application Insights -->

        <azure-openai-emit-token-metric namespace="openai-metrics">
            <dimension name="model" value="@(context.Variables.GetValueOrDefault<string>("deploymentName","None"))" />
            <dimension name="client_ip" value="@(context.Request.IpAddress)" />
            <dimension name="appId" value="@(context.Variables.GetValueOrDefault<string>("appId","00000000-0000-0000-0000-000000000000"))" />
        </azure-openai-emit-token-metric>

After making a few calls from my code to APIM, the metrics begin to populate in the App Insights instance. To view those metrics you’ll want to go into the App Insights blade and go to the Monitoring -> Metrics section. Under the Metrics Namespace drop down you’ll see the namespace you’ve created in the policy snippet. I named mine openai-metrics.

Accessing custom metrics in App Insights for token metrics

I can now select metrics based on prompt tokens, completion tokens, and total tokens consumed. Here I select the completion tokens and split the data by the appId, client IP address, and model to give me a view of how many tokens each app is consuming and of which model at any given time span.

Metrics split by dimensions

Very cool right?

As of today, there are some key limitations to be aware of:

  1. Only Chat Completions, Completions, and Embedding API operations are supported today.
  2. Each API operation is further limited by which models it supports. For example, as of August 2024, Chat Completions only supports gpt-3.5 and gpt-4. No 4o support yet unfortunately.
  3. If you’re using a load balanced pool backend, you can’t yet use the actual backend the pool send the request to as a dimension.

Well folks, hopefully this helps you better understand why this functionality was added and the value it provides. While you could do this with another API Gateway (pick your favorite), it likely won’t be as simple as it it with APIM’s policy snippet. Another win for cloud native I guess!

Thanks!

Azure OpenAI Service – How To Handle Rate Limiting

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Updates:

  • 10/29/2024 – Microsoft has announced a deployment option referred to as a data zone (https://azure.microsoft.com/en-us/blog/accelerate-scale-with-azure-openai-service-provisioned-offering/). Data zones can be thought of as data sovereignty boundaries incorporated into the existing global deployment option. This will significantly ease load balancing so you will no longer need to deploy individual regional instances and can instead deploy a single instance with a data zone deployment within a single subscription. As you hit the cap for TPM/RPM within that subscription, you can then repeat the process with a new subscription and load balance across the two. This will result in fewer backends and a more simple load balancing setup.

Another week, another AOAI (Azure OpenAI Service) post. Today, I’m going to continue to discuss the new “Generative AI Gateway”-type features released to APIM (Azure API Management). In my last post I covered the new built-in load balancing and circuit breaker feature. For this post I’m going to talk about the new token-based rate limiting feature and rate limiting in general. Put on your nerd cap and caffeinate because we’re going to be analyzing some Fiddler captures.

The Basics

When talking rate limiting for AOAI it’s helpful to understand how an instance natively handles subscription service limits. There are a number of limits to be aware, but the most relevant to this conversation are the regional quota limits. Each Azure Subscription gets a certain quota of tokens per minute and request per minute for each model in a given region. That regional quota is shared among all the AOAI instances you provision with the model within that subscription in that given region. When you exhaust your quota for a region, you scan scale by requesting quota (good luck with that), create a new instance in another region in the same subscription, create a new instance in the same region in a different subscription, or going the provisioned throughput option.

Representation of regional quotas

In October 2024, Microsoft introduced the concept of a data zone deployment. Data zones address the compliance issues that came with global deployments. In a global deployment the prompt can be sent and serviced by the AOAI service in any region across the globe. For customers in regulated industries, this was largely a no go due to data sovereignty requirements. The new data zone deployment type allows you to pool AOAI capacity within a subscription across all regions within a given geopolitical boundary. As of October 2024, this supports two data zones including the US and EU.

Global and Data Zone Deployments

With each AOAI instance you provision in a subscription, you’ll be able to adjust the quota of the deployment of a particular model for that instance. Each AOAI instance you create will share the total quota available. If you have a use case where you need multiple AOAI instances, like for example making each its own authorization boundary with Azure RBAC for the purposes of separating different fine-tuned models and training data, each instance will draw from that total subscription-wide regional quota. Note that the more TPM (tokens per minute) you give the instance the higher RPM (requests per minute, 1K TPM = 6 RPM).

Adjusting quota for a specific instance of AOAI

Alright, so you get the basics of quota so now let’s talk about what rate limiting looks like from the application’s point of view. I’ll first walk through how things work when contacting the AOAI instance directly and then I’ll cover how things work when APIM sits in the middle (YMMV on this one if you’re using another type of “Generative AI Gateway”).

Direct Connectivity to Azure OpenAI Instance

Here I’ve set the model deployment to a rate limit of 50K TPM which gives me a limit of 300 RPM. I’ll be contacting the AOAI instance directly without any “Generative AI Gateway” component between my code and the AOAI instance. I’m using the Python openai SDK version 1.14.3.

I’ll be using this simple function to make Chat Completion calls to GPT3.5 Turbo.

Code being used in this post

Let’s dig into the response from the AOAI service.

Response headers from direct connectivity to AOAI

The headers relevant to the topic at hand are x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens.

The x-rate-limit-remaining-requests header tells you have many responses you have left before you’ll be rate limited for requests. There’s a few interesting things about this header. First, it always starts decrementing from 1/1000 of whatever the TPM. In my testing 50K TPM starts with 50 requests, 10K TPM with 10 requests. The Portal says 300 RPM at 50K TPM, so it’s odd that the response header shows something different and far less than what I’d expect. I also noticed that each corresponding request will decrease the x-ratelimit-remaining-tokens but will not necessarily reduce the x-rate-limit-remaining-requests header. Good example is at 1K TPM (which gives you 6 requests according to the Portal but gives me 1 RPM according to this header) would tell me I had zero requests left after my first request but wouldn’t always throttle me. Either there’s additional logic being executed to determine when to rate limit based on request or it’s simply inaccurate. My guess is the former, but I’m not sure.

The next header is the x-ratelimit-remaining-tokens which does match the TPM you set for the deployment. The functionality is pretty straightforward, but it’s important to understand how the max_tokens parameter Chat Completions and the like can affect it. In my example above, I ask the model to say hello which uses around 20 total tokens across the prompt and completion. When I set the max_tokens parameter to 100 the x-ratelimit-remaining-tokens is reduced by 100 even though I’ve only used 20 tokens. What you want to take from that is be careful with what you set in your max_tokens parameter because you can very easily exhaust your quota on an specific AOAI instance. I believe consideration holds true in both pay-as-you-go and PTU SKUs.

When you hit a limit and begin to get rate limited, you’ll get a message similar to what you see below with the policy-id header telling you which limit you hit (token or requests) and the Retry-After header telling you how long you’re rate limited. If you’re using the openai SDK (I can only speak for Python) the retry logic within the library will kick off.

Let me dig into that a bit.

Rate limited by AOAI instance

The retry logic for the openai SDK for Python is in the openai/lib/_base_client.py file. It’s handled by a few different functions including _parse_retry_after_header, _calculate_retry_timeout, and _should_retry. I’ll save the you the gooey details and give you the highlights. In each response the SDK looks for the retry-after-ms and retry-after headers. If either is found it looks to see if the value is less than 60 seconds. If it’s greater than 60 seconds, it ignores the value and executes its own logic which starts at around 1 to 2 seconds and increases up to 8 seconds for a maximum of 2 retries by default (constants used for much of the calculations are located in openai/lib/_constants.py). The defaults should be good for most instances but you can certainly tweak the max retries if it’s not sufficient. While the retry logic is very straightforward in the instance of hitting the AOAI instance directly, you will see some interesting behavior when APIM is added.

Throttling and APIM

I’ve talked ad-nauseam about why you’d want to place APIM in between your applications and the AOAI instance. To save myself some typing these are some of the key reasons:

  1. Load balancing across multiple AOAI instances spread across regions spread across subscriptions to maximize model quota.
  2. Capturing operational and security information such as metrics for response times, token usage for chargebacks, and prompts and responses for security review or caching to reduce costs.

In the olden days (two months ago) customers struggled to limit specific applications to a certain amount of token usage. Using APIM’s request limiting wasn’t very helpful because the metric we care most about with GenAI is tokens, not requests. Customers came up with creative solutions to distribute applications to different sets of AOAI instances, but it was difficult to manage at scale. I can’t count the number of times I heard “How do I throttle based upon token usage in APIM?” and I was stuck giving the customer the bad news it wasn’t possible without extremely convoluted PeeWee Herman Breakfast Machine-type solutions.

Microsoft heard the customer pain and introduced a new APIM policy for rate limiting based on token usage. This new policy allows you to rate limit an application based on a counter key you specify. APIM will then limit the application if it pushes beyond the TPM you specify. This allows you to move away from the dedicated AOAI instance pattern you may have been trying to use to solve this problem and into a design where you position a whole bunch of AOAI instances behind APIM and load balance across them using the new load balancer and circuit breaker capabilities of APIM relying upon this new policy to control consumption.

Now that you get the sales pitch, take a look at the options available for the policy snippet.

Below you’ll see a section from my APIM policy. In this section I’m setting up the token rate limiting feature. The counter key I’m using in this scenario is the appid property I’ve extracted from the Entra ID access token. I’m a huge proponent of blocking API key access to AOAI instances and instead using Entra ID based authentication under the context of the application’s identity for the obvious reasons.


        <!-- Enforce token usage limits -->
        <azure-openai-token-limit counter-key="@(context.Variables.GetValueOrDefault<string>("appId","00000000-0000-0000-0000-000000000000"))" estimate-prompt-tokens="true" tokens-per-minute="1000" remaining-tokens-header-name="x-apim-remaining-token" tokens-consumed-header-name="x-apim-tokens-consumed" />
        <set-backend-service backend-id="backend_pool_aoai" />
        

I’ve also set the estimate-prompt-tokens property to true. The docs state this could cause some performance impact, so you’ll want to test that on and off in your own environment. It’s worth noting that APIM will always estimate prompt tokens if a streaming completion is being used whether or not you’ve set this option to true. Next, I’m setting a custom header name for both the remaining-tokens header and tokens-consumed headers. This will ensure these headers are returned to the client and they’re uniquely identifiable such that they couldn’t be confused with the headers natively returned by AOAI instance behind the scenes.

Notice I didn’t modify the name of the retry header. Recall that the openai SDK looks for retry-after so if you modify this header you won’t get the benefit of the SDK’s retry logic. My advice is keep this as the default.

When I send the Chat Completion request to APIM, I get back the response headers below which includes the two new headers x-apim-remaining-tokens and x-apim-tokens-consumed which show my request consumed 22 of the 1K TPM I’ve been allotted. Notice how this is keeping track of exact number of tokens being used vs how the service natively will feed of the max_tokens parameter which is a nice improvement.

Once I exhaust my 1K TPM, I’m hit with a 429 and a retry-after header. The SDK will execute its retry logic and wait the amount of time in the retry logic. This is why you shouldn’t muck with the header name.

Very cool right? You are now saved from a convoluted solution of dedicated AOAI instances or an insanely complex APIM policy snippet.

Before I close this out I want to show one more interesting “feature” I ran into when I was testing. In my environment I’m using a load balanced pool backend where I have 4 AOAI instances stretched across multiple regions and a circuit breaker to bounce to temporarily remove pool members if they 429. When I was doing testing for this post I noticed an interesting behavior of APIM when one of the pool members begins to 429.

In the image below I purposely went over the AOAI instance backend quota to trigger the pool member’s rate limiting. Notice how I receive a 429 with the Retry-After is set to 86,400 seconds which is the number of seconds in a day. It seems like the load balanced pool will shoot this value back when a pool member 429s. Recall again the behavior or the openai SDK which ignores retry-after greater than 60 second. This means the SDK will execute its own shorter timer making for a quick retry. Whether the PG designed this with the openai SDK behavior’s in mind, I don’t know, but it worked out well either way.

Rate limited by AOAI instance behind a load balancing APIM backend

That about completes this post. Your key takeaways today are:

  • If you’re using request rate limiting or something more convoluted like dedicating AOAI instances to handle rate limiting across applications, plan to move to using the token-based rate limiting policy in APIM.
  • Be careful with what you’re setting the max_tokens parameter to when you call the models because setting too high can trigger the AOAI instance rate limiting even though you haven’t exhausted the TPM set in the token rate limiting APIM policy.
  • Don’t mess with the retry-after header in your token rate limiting policy if you’re using the openai SDK. If you do you’ll have to come up with your own retry logic.
  • Ensure you set the remaining-tokens-header-name and tokens-consumed-header-name so it’s easily identified which rate limit is affecting an application.
  • Be aware that in my testing the tokens-consumed returned by the token rate limiting policy didn’t account for completion tokens when it was a streaming Chat Completion. You’ll still need to be creative to calculating streaming token usage for chargeback.