Azure OpenAI Service – How To Handle Rate Limiting

Posted on June 19, 2024 by mattfeltonma

This is part of my series on GenAI Services in Azure:

Updates:

10/29/2024 – Microsoft has announced a deployment option referred to as a data zone (https://azure.microsoft.com/en-us/blog/accelerate-scale-with-azure-openai-service-provisioned-offering/). Data zones can be thought of as data sovereignty boundaries incorporated into the existing global deployment option. This will significantly ease load balancing so you will no longer need to deploy individual regional instances and can instead deploy a single instance with a data zone deployment within a single subscription. As you hit the cap for TPM/RPM within that subscription, you can then repeat the process with a new subscription and load balance across the two. This will result in fewer backends and a more simple load balancing setup.

Another week, another AOAI (Azure OpenAI Service) post. Today, I’m going to continue to discuss the new “Generative AI Gateway”-type features released to APIM (Azure API Management). In my last post I covered the new built-in load balancing and circuit breaker feature. For this post I’m going to talk about the new token-based rate limiting feature and rate limiting in general. Put on your nerd cap and caffeinate because we’re going to be analyzing some Fiddler captures.

The Basics

When talking rate limiting for AOAI it’s helpful to understand how an instance natively handles subscription service limits. There are a number of limits to be aware, but the most relevant to this conversation are the regional quota limits. Each Azure Subscription gets a certain quota of tokens per minute and request per minute for each model in a given region. That regional quota is shared among all the AOAI instances you provision with the model within that subscription in that given region. When you exhaust your quota for a region, you scan scale by requesting quota (good luck with that), create a new instance in another region in the same subscription, create a new instance in the same region in a different subscription, or going the provisioned throughput option.

In October 2024, Microsoft introduced the concept of a data zone deployment. Data zones address the compliance issues that came with global deployments. In a global deployment the prompt can be sent and serviced by the AOAI service in any region across the globe. For customers in regulated industries, this was largely a no go due to data sovereignty requirements. The new data zone deployment type allows you to pool AOAI capacity within a subscription across all regions within a given geopolitical boundary. As of October 2024, this supports two data zones including the US and EU.

With each AOAI instance you provision in a subscription, you’ll be able to adjust the quota of the deployment of a particular model for that instance. Each AOAI instance you create will share the total quota available. If you have a use case where you need multiple AOAI instances, like for example making each its own authorization boundary with Azure RBAC for the purposes of separating different fine-tuned models and training data, each instance will draw from that total subscription-wide regional quota. Note that the more TPM (tokens per minute) you give the instance the higher RPM (requests per minute, 1K TPM = 6 RPM).

**Adjusting quota for a specific instance of AOAI**

Alright, so you get the basics of quota so now let’s talk about what rate limiting looks like from the application’s point of view. I’ll first walk through how things work when contacting the AOAI instance directly and then I’ll cover how things work when APIM sits in the middle (YMMV on this one if you’re using another type of “Generative AI Gateway”).

Direct Connectivity to Azure OpenAI Instance

Here I’ve set the model deployment to a rate limit of 50K TPM which gives me a limit of 300 RPM. I’ll be contacting the AOAI instance directly without any “Generative AI Gateway” component between my code and the AOAI instance. I’m using the Python openai SDK version 1.14.3.

I’ll be using this simple function to make Chat Completion calls to GPT3.5 Turbo.

Let’s dig into the response from the AOAI service.

**Response headers from direct connectivity to AOAI**

The headers relevant to the topic at hand are x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens.

The x-rate-limit-remaining-requests header tells you have many responses you have left before you’ll be rate limited for requests. There’s a few interesting things about this header. First, it always starts decrementing from 1/1000 of whatever the TPM. In my testing 50K TPM starts with 50 requests, 10K TPM with 10 requests. The Portal says 300 RPM at 50K TPM, so it’s odd that the response header shows something different and far less than what I’d expect. I also noticed that each corresponding request will decrease the x-ratelimit-remaining-tokens but will not necessarily reduce the x-rate-limit-remaining-requests header. Good example is at 1K TPM (which gives you 6 requests according to the Portal but gives me 1 RPM according to this header) would tell me I had zero requests left after my first request but wouldn’t always throttle me. Either there’s additional logic being executed to determine when to rate limit based on request or it’s simply inaccurate. My guess is the former, but I’m not sure.

The next header is the x-ratelimit-remaining-tokens which does match the TPM you set for the deployment. The functionality is pretty straightforward, but it’s important to understand how the max_tokens parameter Chat Completions and the like can affect it. In my example above, I ask the model to say hello which uses around 20 total tokens across the prompt and completion. When I set the max_tokens parameter to 100 the x-ratelimit-remaining-tokens is reduced by 100 even though I’ve only used 20 tokens. What you want to take from that is be careful with what you set in your max_tokens parameter because you can very easily exhaust your quota on an specific AOAI instance. I believe consideration holds true in both pay-as-you-go and PTU SKUs.

When you hit a limit and begin to get rate limited, you’ll get a message similar to what you see below with the policy-id header telling you which limit you hit (token or requests) and the Retry-After header telling you how long you’re rate limited. If you’re using the openai SDK (I can only speak for Python) the retry logic within the library will kick off.

Let me dig into that a bit.

The retry logic for the openai SDK for Python is in the openai/lib/_base_client.py file. It’s handled by a few different functions including _parse_retry_after_header, _calculate_retry_timeout, and _should_retry. I’ll save the you the gooey details and give you the highlights. In each response the SDK looks for the retry-after-ms and retry-after headers. If either is found it looks to see if the value is less than 60 seconds. If it’s greater than 60 seconds, it ignores the value and executes its own logic which starts at around 1 to 2 seconds and increases up to 8 seconds for a maximum of 2 retries by default (constants used for much of the calculations are located in openai/lib/_constants.py). The defaults should be good for most instances but you can certainly tweak the max retries if it’s not sufficient. While the retry logic is very straightforward in the instance of hitting the AOAI instance directly, you will see some interesting behavior when APIM is added.

Throttling and APIM

I’ve talked ad-nauseam about why you’d want to place APIM in between your applications and the AOAI instance. To save myself some typing these are some of the key reasons:

Load balancing across multiple AOAI instances spread across regions spread across subscriptions to maximize model quota.
Capturing operational and security information such as metrics for response times, token usage for chargebacks, and prompts and responses for security review or caching to reduce costs.

In the olden days (two months ago) customers struggled to limit specific applications to a certain amount of token usage. Using APIM’s request limiting wasn’t very helpful because the metric we care most about with GenAI is tokens, not requests. Customers came up with creative solutions to distribute applications to different sets of AOAI instances, but it was difficult to manage at scale. I can’t count the number of times I heard “How do I throttle based upon token usage in APIM?” and I was stuck giving the customer the bad news it wasn’t possible without extremely convoluted PeeWee Herman Breakfast Machine-type solutions.

Microsoft heard the customer pain and introduced a new APIM policy for rate limiting based on token usage. This new policy allows you to rate limit an application based on a counter key you specify. APIM will then limit the application if it pushes beyond the TPM you specify. This allows you to move away from the dedicated AOAI instance pattern you may have been trying to use to solve this problem and into a design where you position a whole bunch of AOAI instances behind APIM and load balance across them using the new load balancer and circuit breaker capabilities of APIM relying upon this new policy to control consumption.

Now that you get the sales pitch, take a look at the options available for the policy snippet.

Below you’ll see a section from my APIM policy. In this section I’m setting up the token rate limiting feature. The counter key I’m using in this scenario is the appid property I’ve extracted from the Entra ID access token. I’m a huge proponent of blocking API key access to AOAI instances and instead using Entra ID based authentication under the context of the application’s identity for the obvious reasons.


        <!-- Enforce token usage limits -->
        <azure-openai-token-limit counter-key="@(context.Variables.GetValueOrDefault<string>("appId","00000000-0000-0000-0000-000000000000"))" estimate-prompt-tokens="true" tokens-per-minute="1000" remaining-tokens-header-name="x-apim-remaining-token" tokens-consumed-header-name="x-apim-tokens-consumed" />
        <set-backend-service backend-id="backend_pool_aoai" />

I’ve also set the estimate-prompt-tokens property to true. The docs state this could cause some performance impact, so you’ll want to test that on and off in your own environment. It’s worth noting that APIM will always estimate prompt tokens if a streaming completion is being used whether or not you’ve set this option to true. Next, I’m setting a custom header name for both the remaining-tokens header and tokens-consumed headers. This will ensure these headers are returned to the client and they’re uniquely identifiable such that they couldn’t be confused with the headers natively returned by AOAI instance behind the scenes.

Notice I didn’t modify the name of the retry header. Recall that the openai SDK looks for retry-after so if you modify this header you won’t get the benefit of the SDK’s retry logic. My advice is keep this as the default.

When I send the Chat Completion request to APIM, I get back the response headers below which includes the two new headers x-apim-remaining-tokens and x-apim-tokens-consumed which show my request consumed 22 of the 1K TPM I’ve been allotted. Notice how this is keeping track of exact number of tokens being used vs how the service natively will feed of the max_tokens parameter which is a nice improvement.

Once I exhaust my 1K TPM, I’m hit with a 429 and a retry-after header. The SDK will execute its retry logic and wait the amount of time in the retry logic. This is why you shouldn’t muck with the header name.

Very cool right? You are now saved from a convoluted solution of dedicated AOAI instances or an insanely complex APIM policy snippet.

Before I close this out I want to show one more interesting “feature” I ran into when I was testing. In my environment I’m using a load balanced pool backend where I have 4 AOAI instances stretched across multiple regions and a circuit breaker to bounce to temporarily remove pool members if they 429. When I was doing testing for this post I noticed an interesting behavior of APIM when one of the pool members begins to 429.

In the image below I purposely went over the AOAI instance backend quota to trigger the pool member’s rate limiting. Notice how I receive a 429 with the Retry-After is set to 86,400 seconds which is the number of seconds in a day. It seems like the load balanced pool will shoot this value back when a pool member 429s. Recall again the behavior or the openai SDK which ignores retry-after greater than 60 second. This means the SDK will execute its own shorter timer making for a quick retry. Whether the PG designed this with the openai SDK behavior’s in mind, I don’t know, but it worked out well either way.

**Rate limited by AOAI instance behind a load balancing APIM backend**

That about completes this post. Your key takeaways today are:

If you’re using request rate limiting or something more convoluted like dedicating AOAI instances to handle rate limiting across applications, plan to move to using the token-based rate limiting policy in APIM.
Be careful with what you’re setting the max_tokens parameter to when you call the models because setting too high can trigger the AOAI instance rate limiting even though you haven’t exhausted the TPM set in the token rate limiting APIM policy.
Don’t mess with the retry-after header in your token rate limiting policy if you’re using the openai SDK. If you do you’ll have to come up with your own retry logic.
Ensure you set the remaining-tokens-header-name and tokens-consumed-header-name so it’s easily identified which rate limit is affecting an application.
Be aware that in my testing the tokens-consumed returned by the token rate limiting policy didn’t account for completion tokens when it was a streaming Chat Completion. You’ll still need to be creative to calculating streaming token usage for chargeback.

Azure OpenAI Service – Load Balancing

Posted on May 23, 2024 by mattfeltonma

This is part of my series on GenAI Services in Azure:

Updates:

10/29/2024 – Microsoft has announced a deployment option referred to as a data zone (https://azure.microsoft.com/en-us/blog/accelerate-scale-with-azure-openai-service-provisioned-offering/). Data zones can be thought of as data sovereignty boundaries incorporated into the existing global deployment option. This will significantly ease load balancing so you will no longer need to deploy individual regional instances and can instead deploy a single instance with a data zone deployment within a single subscription. As you hit the cap for TPM/RPM within that subscription, you can then repeat the process with a new subscription and load balance across the two. This will result in fewer backends and a more simple load balancing setup.

Welcome back folks!

Today I’m back again talking load balancing in AOAI (Azure OpenAI Service). This is an area which has seen a ton of innovation over the past year. From what began as a very basic APIM (API Management) policy snippet providing randomized load balancing was matured to add more intelligence by a great crew out of Microsoft via the “Smart” Load Balancing Policy. Innovative Microsoft folk threw together a solution called PowerProxy which provides load balancing and other functionality without the need for APIM. Simon Kurtz even put together a new Python library to provide load balancing at the SDK-level without the need for additional infrastructure. Lots of great ideas put into action.

The Product Group for APIM over at Microsoft was obviously paying attention to the focus in this area and have introduced native functionality which makes addressing this need cake. With the introduction of the load balancer and circuit breaker feature in APIM, you can now perform complex load balancing without needing a complex APIM policy. This dropped with a bunch of other Generative AI Gateway (told you this would become an industry term!) features for APIM that were announced this week. These other features include throttling based on tokens consumed (highly sought after feature!), emitting token counts to App Insights, caching completions for optimization of token usage, and a simpler way to onboard AOAI into APIM. Very cool stuff of which I’ll be covering over the next few weeks. For this post I’m going to focus on the new load balancing and circuit breaker feature.

Before I dive into the new feature I want to do a quick review of why scaling across AOAI instances is so important. For each model you have a limited amount of requests and tokens you can pass to the service within a given subscription within a region. These limits vary on a per model basis. If you’re consuming a lot of prompts or making a lot of requests it’s fairly easy to hit these limits. I’ve seen a customer hit the limits within a region with one document processing application. I had another customer who deployed a single Chat Bot in a simple RAG (retrieval augmented generation) that was being used by large swath of their help desk staff and limits were quickly a problem. The point I’m making here is you will hit these limits and you will need to add figure out how to solve it. Solving it is going to require additional instances in different Azure regions likely spread across multiple subscriptions. This means you’ll need to figure out a way to spread applications across these instances to mitigate the amount of throttling your applications have to deal with.

As I covered earlier, there are a lot of ways you can load balancing this service. You could do it at the local application using Simon’s Python library if you need to get something up and running quickly for an application or two. If you have an existing deployed API Gateway like an Apigee or Mulesoft, you could do it there if you can get the logic right to support it. If you want to custom build something from scratch or customize a community offering like PowerProxy you could do that as well if you’re comfortable owning support for the solution. Finally, you have the native option of using Azure APIM. I’m a fan of the APIM option over the Python library because it’s scalable to support hundreds of applications with a GenAI (generative AI) need. I also like it more than custom building something because the reality is most customers don’t have the people with the necessary skill sets to build something and are even less likely to have the bodies to support yet another custom tool. Another benefit of using APIM include the backend infrastructure powering the solution (load balancers, virtual machines, and the like) are Microsoft’s responsibility to run and maintain. Beyond load balancing, it’s clear that Microsoft is investing in other “Generative AI Gateway” types of functionality that make it a strategic choice to move forward with. These other features are very important from a security and operations perspective as I’ve covered in past posts. No, there was not someone from Microsoft holding me hostage forcing me to recommend APIM. It is a good solution for this use case for most customers today.

Ok, back to the new load balancing and circuit breaker feature. This new feature allows you to use new native APIM functionality to create a load balancing and circuit breaker policy around your APIM backends. Historically to do this you’d need a complex policy like the “smart” load balancing policy seen below to accomplish this feature set.

<policies>
    <inbound>
        <base />
        <!-- Getting the main variable where we keep the list of backends -->
        <cache-lookup-value key="listBackends" variable-name="listBackends" />
        <!-- If we can't find the variable, initialize it -->
        <choose>
            <when condition="@(context.Variables.ContainsKey("listBackends") == false)">
                <set-variable name="listBackends" value="@{
                    // -------------------------------------------------
                    // ------- Explanation of backend properties -------
                    // -------------------------------------------------
                    // "url":          Your backend url
                    // "priority":     Lower value means higher priority over other backends. 
                    //                 If you have more one or more Priority 1 backends, they will always be used instead
                    //                 of Priority 2 or higher. Higher values backends will only be used if your lower values (top priority) are all throttling.
                    // "isThrottling": Indicates if this endpoint is returning 429 (Too many requests) currently
                    // "retryAfter":   We use it to know when to mark this endpoint as healthy again after we received a 429 response

                    JArray backends = new JArray();
                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-eastus.openai.azure.com/" },
                        { "priority", 1},
                        { "isThrottling", false }, 
                        { "retryAfter", DateTime.MinValue } 
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-eastus-2.openai.azure.com/" },
                        { "priority", 1},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-northcentralus.openai.azure.com/" },
                        { "priority", 1},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-canadaeast.openai.azure.com/" },
                        { "priority", 2},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-francecentral.openai.azure.com/" },
                        { "priority", 3},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-uksouth.openai.azure.com/" },
                        { "priority", 3},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-westeurope.openai.azure.com/" },
                        { "priority", 3},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-australia.openai.azure.com/" },
                        { "priority", 4},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    return backends;   
                }" />
                <!-- And store the variable into cache again -->
                <cache-store-value key="listBackends" value="@((JArray)context.Variables["listBackends"])" duration="60" />
            </when>
        </choose>
        <authentication-managed-identity resource="https://cognitiveservices.azure.com" output-token-variable-name="msi-access-token" ignore-error="false" />
        <set-header name="Authorization" exists-action="override">
            <value>@("Bearer " + (string)context.Variables["msi-access-token"])</value>
        </set-header>
        <set-variable name="backendIndex" value="-1" />
        <set-variable name="remainingBackends" value="1" />
    </inbound>
    <backend>
        <retry condition="@(context.Response != null && (context.Response.StatusCode == 429 || context.Response.StatusCode >= 500) && ((Int32)context.Variables["remainingBackends"]) > 0)" count="50" interval="0">
            <!-- Before picking the backend, let's verify if there is any that should be set to not throttling anymore -->
            <set-variable name="listBackends" value="@{
                JArray backends = (JArray)context.Variables["listBackends"];

                for (int i = 0; i < backends.Count; i++)
                {
                    JObject backend = (JObject)backends[i];

                    if (backend.Value<bool>("isThrottling") && DateTime.Now >= backend.Value<DateTime>("retryAfter"))
                    {
                        backend["isThrottling"] = false;
                        backend["retryAfter"] = DateTime.MinValue;
                    }
                }

                return backends; 
            }" />
            <cache-store-value key="listBackends" value="@((JArray)context.Variables["listBackends"])" duration="60" />
            <!-- This is the main logic to pick the backend to be used -->
            <set-variable name="backendIndex" value="@{
                JArray backends = (JArray)context.Variables["listBackends"];

                int selectedPriority = Int32.MaxValue;
                List<int> availableBackends = new List<int>();

                for (int i = 0; i < backends.Count; i++)
                {
                    JObject backend = (JObject)backends[i];

                    if (!backend.Value<bool>("isThrottling"))
                    {
                        int backendPriority = backend.Value<int>("priority");

                        if (backendPriority < selectedPriority)
                        {
                            selectedPriority = backendPriority;
                            availableBackends.Clear();
                            availableBackends.Add(i);
                        } 
                        else if (backendPriority == selectedPriority)
                        {
                            availableBackends.Add(i);
                        }
                    }
                }

                if (availableBackends.Count == 1)
                {
                    return availableBackends[0];
                }
            
                if (availableBackends.Count > 0)
                {
                    //Returns a random backend from the list if we have more than one available with the same priority
                    return availableBackends[new Random().Next(0, availableBackends.Count)];
                }
                else
                {
                    //If there are no available backends, the request will be sent to the first one
                    return 0;    
                }
                }" />
            <set-variable name="backendUrl" value="@(((JObject)((JArray)context.Variables["listBackends"])[(Int32)context.Variables["backendIndex"]]).Value<string>("url") + "/openai")" />
            <set-backend-service base-url="@((string)context.Variables["backendUrl"])" />
            <forward-request buffer-request-body="true" />
            <choose>
                <!-- In case we got 429 or 5xx from a backend, update the list with its status -->
                <when condition="@(context.Response != null && (context.Response.StatusCode == 429 || context.Response.StatusCode >= 500) )">
                    <cache-lookup-value key="listBackends" variable-name="listBackends" />
                    <set-variable name="listBackends" value="@{
                        JArray backends = (JArray)context.Variables["listBackends"];
                        int currentBackendIndex = context.Variables.GetValueOrDefault<int>("backendIndex");
                        int retryAfter = Convert.ToInt32(context.Response.Headers.GetValueOrDefault("Retry-After", "-1"));

                        if (retryAfter == -1)
                        {
                            retryAfter = Convert.ToInt32(context.Response.Headers.GetValueOrDefault("x-ratelimit-reset-requests", "-1"));
                        }

                        if (retryAfter == -1)
                        {
                            retryAfter = Convert.ToInt32(context.Response.Headers.GetValueOrDefault("x-ratelimit-reset-tokens", "10"));
                        }

                        JObject backend = (JObject)backends[currentBackendIndex];
                        backend["isThrottling"] = true;
                        backend["retryAfter"] = DateTime.Now.AddSeconds(retryAfter);

                        return backends;      
                    }" />
                    <cache-store-value key="listBackends" value="@((JArray)context.Variables["listBackends"])" duration="60" />
                    <set-variable name="remainingBackends" value="@{
                        JArray backends = (JArray)context.Variables["listBackends"];

                        int remainingBackends = 0;

                        for (int i = 0; i < backends.Count; i++)
                        {
                            JObject backend = (JObject)backends[i];

                            if (!backend.Value<bool>("isThrottling"))
                            {
                                remainingBackends++;
                            }
                        }

                        return remainingBackends;
                    }" />
                </when>
            </choose>
        </retry>
    </backend>
    <outbound>
        <base />
        <!-- This will return the used backend URL in the HTTP header response. Remove it if you don't want to expose this data -->
        <set-header name="x-openai-backendurl" exists-action="override">
            <value>@(context.Variables.GetValueOrDefault<string>("backendUrl", "none"))</value>
        </set-header>
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

Complex policies like the above are difficult to maintain and easy to break (I know, I break my policies all of time). Compare that with a policy that does something very similar with the new load balancing and circuit breaker feature.

<policies>
    <!-- Throttle, authorize, validate, cache, or transform the requests -->
    <inbound>
        <set-backend-service backend-id="backend_pool_aoai" />
        <base />
    </inbound>
    <!-- Control if and how the requests are forwarded to services  -->
    <backend>
        <base />
    </backend>
    <!-- Customize the responses -->
    <outbound>
        <base />
    </outbound>
    <!-- Handle exceptions and customize error responses  -->
    <on-error>
        <base />
    </on-error>
</policies>

A bit simpler eh? With the new feature you establish a new APIM backend of a “pool” type. In this backend you configure your load balancing and circuit breaker logic. In the Terraform template below, I’ve created a load balanced pool that includes three existing APIM backends which are each an individual AOAI instance. I’ve divided the three backends into two priority groups such that the APIM so that APIM will concentrate the requests to the first priority group until a circuit break rule is triggered. I configured a circuit breaker rule that will hold sending additional requests for 1 minute (tripDuration) to a backend if that backend returns a single (count) 429 over the course of 1 minute (interval). You’ll likely want to play with the tripDuration and interval to figure out what works for you.

Priority group 2 will only be used if all the backends in priority group 1 have circuit breaker rules tripped. The use case here might be that your priority group 1 instance is a AOAI instance setup for PTU (provisioned throughput units) and you want overflow to dump down into instances deployed at the standard tier (basically consumption based).

resource "azapi_resource" "symbolicname" {
  type = "Microsoft.ApiManagement/service/backends@2023-05-01-preview"
  name = "string"
  parent_id = "string"
  body = jsonencode({
    properties = {
      circuitBreaker = {
        rules = [
          {
            failureCondition = {
              count = 1
              errorReasons = [
                "Backend service is throttling"
              ]
              interval = "PT1M"
              statusCodeRanges = [
                {
                  max = 429
                  min = 429
                }
              ]
            }
            name = "breakThrottling "
            tripDuration = "PT1M",
            acceptRetryAfter = true
          }
        ]
      }
      description = "This is the load balanced backend"
      pool = {
        services = [
          {
            id = "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rg-demo-aoai/providers/Microsoft.ApiManagement/service/apim-demo-aoai-jog/backends/openai-3",
            priority = 1
          },
          {
            id = "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rg-demo-aoai/providers/Microsoft.ApiManagement/service/apim-demo-aoai-jog/backends/openai-1",
            priority = 2
          },
          {
            id = "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rg-demo-aoai/providers/Microsoft.ApiManagement/service/apim-demo-aoai-jog/backends/openai-2",
            priority = 2
          }
        ]
      }
    }
  })
}

Very cool right? This makes for way simpler APIM policy which means troubleshooting APIM policy that much easier. You could also establish different pools for different categories of applications. Maybe you have a pool with a PTU and standard tier instances for mission-critical production apps and another pool of only standard instances for non-production applications. You could then direct specific applications (based on their Entra ID service principal id) to different pools. This feature gives you a ton of flexibility in how you handle load balancing without a to of APIM policy overhead.

With the introduction of this feature into APIM, it makes APIM that much more of an appealing solution for this use case. No longer do you need a complex policy and in-depth APIM policy troubleshooting skills to make this work. Tack on the additional GenAI features Microsoft introduced that I mentioned earlier, as well as its existing features and capabilities available in APIM policy, you have a damn fine tool for your Generative AI Gateway use case.

Well folks that wraps up this post. I hope this overview gave you some insight into why load balancing is important with AOAI, what the historical challenges have been doing it within APIM, and how those challenges have been largely removed with the added bonus of additional new GenAI-based features make this a tool worth checking out.

The Challenge of Logging Azure OpenAI Stream Completions

Posted on November 10, 2023 by mattfeltonma

This is part of my series on GenAI Services in Azure:

Updates:

Hello again fellow geeks. Today I’m back with another Azure OpenAI Service (AOAI) post. I’ve talked in the past about the gaps in the native logging for the AOAI service and how the logs lack traceability and details on token usage to be used for chargebacks. I was lucky enough to work with Jake Wang and others on a reference architecture that could address these gaps using Azure API Manager (APIM). I also wrote some custom APIM policies to provide examples for how this information could be captured within APIM. I’ve observed customers coming up with creative solutions such as capturing the data within the application sitting in front of AOAI as a tactical means to get this data while more strategically using third-party API Gateway products such as Apigee, or even building custom highly functional and complex gateways. However, there was a use case that some of these solutions (such as the custom policies I wrote) didn’t account for, and that was streaming completions.

Like OpenAI’s API, the AOAI service API offers support for streaming chat completions. Streaming completions return the model’s completion as a series as events as the tokens are processed versus a non-streaming completion which returns the entire completion once the model is finished processing. The benefit of a streaming completion is a better user experience. There have been studies that show that any delay longer than 10 seconds won’t hold user attention. By streaming the completion as it’s generated the user is receiving that feedback that the website is responding.

The OpenAI documentation points out a few challenges when using streaming completions. One of those challenges is the response from the API no longer includes token usage, which means you need to calculate token usage by some other means such as using OpenAI’s open source tokeniser tiktoken. It also makes it difficult to moderate content because only partial completions are received in each event. Outside of those challenges, there is also a challenge when using APIM. As my peer Shaun Callighan points out, Microsoft does not recommend logging the request/response body when dealing with a stream of server-events such as the API is returning with streaming chat completions because it can cause unexpected buffering (which it does with streaming chat completions). This means the application user will not get the behavior the application owner intended them to get. In my testing, nothing was returned until model finished the completion.

If using the Python SDK, you can make a chat completion streaming by adding the stream=true property to the ChatCompletion object as seen below.

        response = openai.ChatCompletion.create(
            engine=DEPLOYMENT_NAME,
            messages=[
                {
                   "role": "user",
                   "content": "Write me a bedtime story"
                }
            ],
            max_tokens=300,
            stream=True
        )

The body of the response includes a series of server-events such as the below.

...
data: {"id":"chatcmpl-8JNDagQPDWjNWOgbUm9u5lRxcmzIw","object":"chat.completion.chunk","created":1699628174,"model":"gpt-35-turbo","choices":[{"index":0,"finish_reason":null,"delta":{"content":"Once"}}],"usage":null}
data: {"id":"chatcmpl-8JNDagQPDWjNWOgbUm9u5lRxcmzIw","object":"chat.completion.chunk","created":1699628174,"model":"gpt-35-turbo","choices":[{"index":0,"finish_reason":null,"delta":{"content":" upon"}}],"usage":null}
data: {"id":"chatcmpl-8JNDagQPDWjNWOgbUm9u5lRxcmzIw","object":"chat.completion.chunk","created":1699628174,"model":"gpt-35-turbo","choices":[{"index":0,"finish_reason":null,"delta":{"content":" a"}}],"usage":null}
...

So how do you deal with this if you are or were planning to use APIM for logging, load balancing, authorization, and throttling? You have a few options.

You can move logging into the application and use APIM only for load balancing, authorization, and throttling.
You can insert a proxy logging solution behind APIM to handle logging of both streaming and non-streaming completions and use APIM only for load balancing, authorization, and throttling.
You can block streaming completions at APIM.

Option 1

Option 1 is workable at a small scale and is a good tactical solution if you need to get something out to production quickly. The challenge with this option is enforcing it at scale. If you have amazing governance within your organization and excellent SDLC maybe you can enforce this. In my experience, few organizations have the level of maturity needed for this. The other problem with this is ideally logging for the purposes of compliance should be implemented and enforced by another entity to ensure separation of duties.

Benefits

Quick and easy to put in place.

Considerations

Difficult to enforce at scale.
Puts the developers in charge of enforcing logging on themselves. Could be an issue with separation of duties.

Option 2

Option 2 is an interesting solution that my peer Shaun Callighan came up. In Shaun’s architecture a proxy-type solution is placed between APIM and AOAI and that solution handles parsing the requests and responses, calculating token usage, and logging the information to an Event Hub. They have even been kind enough to provide a sample solution demonstrating how this could be done with an Azure Function.

Benefits

Allows you to use continue using APIM for the benefits around load balancing, authorization, and throttling.
Supports streaming chat completions.
Provides the logging necessary for compliance and chargebacks for both streaming and non-streaming chat completions.
Centralized enforcement of logging.

Considerations

You will need to develop your own code to parse the responses/responses, calculate chargebacks, and deliver the logs to Event Hub. (You could use Shaun’s code as a starting point)
You’ll need to ensure this proxy does not become a bottleneck. It will need to scale as requests to the AOAI instance scale along with APIM and whatever else you have in path of the user’s request.

Option 3

Option 3 is another valid option (and honestly a simple fix IMO) and may be where some customers end up in the near term. With this option you block the use of streaming completions at APIM with a custom policy snippet like below. If the developers are worried about the user experience, there is always the option to flash a “processing”-like message in the text window while the model processes the completion.

Benefits

Allows you to continue using APIM for logging, load balancing, throttling, and authorization.
No new code introduced.
Centralized enforcement of logging.
No additional bottlenecks.

Considerations

Your developers may hate you for this.
There may be a legitimate use case where stream chat completions are required.

Since Shaun has a proof-of-concept example for option 2, I figured I’d showcase a sample APIM policy snippet for option 3. In the APIM policy snippet below, I determine if the stream property is included in the request body and store the value in a variable (it will be true or false). I then check the variable to see if the value is true, and if so I return a 404 status code with the message that streaming chat completions are not allowed.

        <!-- Capture the value of the streaming property if it is included -->
        <choose>
            <when condition="@(context.Request.Body.As<JObject>(true)["stream"] != null && context.Request.Body.As<JObject>(true)["stream"].Type != JTokenType.Null)">
                <set-variable name="isStream" value="@{
                    var content = (context.Request.Body?.As<JObject>(true));
                    string streamValue = content["stream"].ToString();
                    return streamValue;
                }" />
            </when>
        </choose>
        <!-- Blocks streaming completions and returns 404 -->
        <choose>
            <when condition="@(context.Variables.GetValueOrDefault<string>("isStream","false").Equals("true", StringComparison.OrdinalIgnoreCase))">
                <return-response>
                    <set-status code="404" reason="BlockStreaming" />
                    <set-header name="Microsoft-Azure-Api-Management-Correlation-Id" exists-action="override">
                        <value>@{return Guid.NewGuid().ToString();}</value>
                    </set-header>
                    <set-body>Streaming chat completions are not allowed by this organization.</set-body>
                </return-response>
            </when>
        </choose>

If you ignore streaming chat completions and try to use a policy such as this one, the model will complete the completion but APIM will throw a 500 status code back at the developer because the structure of a streaming response doesn’t look like the structure of a non-streaming response and it can’t be parsed using that policy’s logic. This means you’ll be throwing money out of the window and potentially struggling with troubleshooting root cause. TLDR, pick an option above to deal with streaming and get it in place if you’re using APIM for logging today or plan to.

Last but not least, I want to link to a wonderful policy snippet by Shaun Callighan. This policy snippet dumps the trace logs from APIM into the headers returned in the response from APIM. This is incredibly helpful when troubleshooting a 500 status code returned by APIM.

Well folks, that wraps up this short blog post on this Friday afternoon. Have a great weekend and happy holidays!

Load Balancing in Azure OpenAI Service

Posted on May 31, 2023 by mattfeltonma

This is part of my series on GenAI Services in Azure:

Updates:

5/23/2024 – Check out my new post on this topic here!

Another Azure OpenAI Service post? Why not? I gotta justify the costs of maintaining the blog somehow!

The demand for the AOAI (Azure OpenAI Service) is absolutely insane. I don’t think I can compare the customer excitement over the service to any other service I’ve seen launch during my time working at cloud providers. With that demand comes the challenge to the cloud service provider of ensuring there is availability of the service for all the customers that want it. In order to do that, Microsoft has placed limits on the number of tokens/minute and requests/minute that can be made to a specific AOAI instance. Many customers are hitting these limits when moving into production. While there is a path for the customer to get the limits raised by putting in a request to their Microsoft account team, this process can take time and there is no guarantee the request can or will be fulfilled.

What can customers do to work around this problem? You need to spin up more AOAI instances. At the time of this writing you can create 3 instances per region per subscription. Creating more instances introduces the new problem distributing traffic across those AOAI instances. There are a few ways you could do this including having the developer code the logic into their application (yuck) ore providing the developer a singular endpoint which is doing the load balancing behind the scenes. The latter solution is where you want to live. Thankfully, this can be done really easily with a piece of Azure infrastructure you are likely already using with AOAI. That piece of infrastructure is APIM (Azure API Management).

As I’ve covered in my posts on AOAI and APIM and my granular chargebacks in AOAI, APIM provides a ton of value the AOIA pattern by providing a gate between the application and the AOAI instance to inspect and action on the request and response. It can be used to enforced Azure AD authentication, provide enhanced security logging, and capture information needed for internal chargebacks. Each of these enhancements is provided through APIM’s custom policy language.

By placing APIM into the mix and using a simple APIM policy we can introduce simple randomized load balancing. Let’s take a deeper look at this policy

<!-- This policy randomly routes (load balances) to one of the two backends -->
<!-- Backend URLs are assumed to be stored in backend-url-1 and backend-url-2 named values (fka properties), but can be provided inline as well -->
<policies>
    <inbound>
        <base />
        <set-variable name="urlId" value="@(new Random(context.RequestId.GetHashCode()).Next(1, 3))" />
        <choose>
            <when condition="@(context.Variables.GetValueOrDefault<int>("urlId") == 1)">
                <set-backend-service base-url="{{backend-url-1}}" />
            </when>
            <when condition="@(context.Variables.GetValueOrDefault<int>("urlId") == 2)">
                <set-backend-service base-url="{{backend-url-2}}" />
            </when>
            <otherwise>
                <!-- Should never happen, but you never know ;) -->
                <return-response>
                    <set-status code="500" reason="InternalServerError" />
                    <set-header name="Microsoft-Azure-Api-Management-Correlation-Id" exists-action="override">
                        <value>@{return Guid.NewGuid().ToString();}</value>
                    </set-header>
                    <set-body>A gateway-related error occurred while processing the request.</set-body>
                </return-response>
            </otherwise>
        </choose>
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

In the policy above a random number is generated that is greater than or equal to 1 and less than 3 using the Next method. The application’s request is sent along to one of the two backends based upon that number. You could add additional backends by upping the max value in the Next method and adding another when condition. Pretty awesome right?

Before you ask, no you do not need a health probe to monitor a cloud service provider managed service. Please don’t make your life difficult by introducing an Application Gateway behind the APIM instance in front of the AOAI instance because Application Gateway allows for health probes and more complex load balancing. All you’re doing is paying Microsoft more money, making your operations’ team life miserable, and adding more latency. Ensuring the service is available and health is on the cloud service provider, not you.

But Matt, what about taking an AOAI instance out of the pool if it beings throttling traffic? Again, no you do not need to this. Eventually this APIM as a simple load balancer pattern will not necessary when the AOAI service is more mature. When that happens, your applications consuming the service will need to be built to handle throttling. Developers are familiar with handling throttling in their application code. Make that their responsibility.

Well folks, that’s this short and sweet post. Let’s summarize what we learned:

This pattern requires Azure AD authentication to AOAI. API keys will not work because each AOAI instance has different API keys.
You may hit the requests/minute and tokens/minute limits of an AOAI instance.
You can request higher limits but the request takes time to be approved.
You can create multiple instances of AOAI to get around the limits within a specific instance.
APIM can provide simple randomized load balancing across multiple instances of AOAI.
You DO NOT need anything more complicated than simple randomized load balancing. This is a temporary solution that you will eventually phase out. Don’t make it more complicated than it needs to be.
DO NOT introduce an Application Gateway behind APIM unless you like paying Microsoft more money.

Have a great week!

Journey Of The Geek

The chronicles of a Bostonian tech geek navigating through life and technology

Tag Archives: apim

The Challenge of Logging Azure OpenAI Stream Completions

Option 1

Option 2

Option 3