Azure OpenAI Service – Load Balancing

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Updates:

  • 10/29/2024 – Microsoft has announced a deployment option referred to as a data zone (https://azure.microsoft.com/en-us/blog/accelerate-scale-with-azure-openai-service-provisioned-offering/). Data zones can be thought of as data sovereignty boundaries incorporated into the existing global deployment option. This will significantly ease load balancing so you will no longer need to deploy individual regional instances and can instead deploy a single instance with a data zone deployment within a single subscription. As you hit the cap for TPM/RPM within that subscription, you can then repeat the process with a new subscription and load balance across the two. This will result in fewer backends and a more simple load balancing setup.

Welcome back folks!

Today I’m back again talking load balancing in AOAI (Azure OpenAI Service). This is an area which has seen a ton of innovation over the past year. From what began as a very basic APIM (API Management) policy snippet providing randomized load balancing was matured to add more intelligence by a great crew out of Microsoft via the “Smart” Load Balancing Policy. Innovative Microsoft folk threw together a solution called PowerProxy which provides load balancing and other functionality without the need for APIM. Simon Kurtz even put together a new Python library to provide load balancing at the SDK-level without the need for additional infrastructure. Lots of great ideas put into action.

The Product Group for APIM over at Microsoft was obviously paying attention to the focus in this area and have introduced native functionality which makes addressing this need cake. With the introduction of the load balancer and circuit breaker feature in APIM, you can now perform complex load balancing without needing a complex APIM policy. This dropped with a bunch of other Generative AI Gateway (told you this would become an industry term!) features for APIM that were announced this week. These other features include throttling based on tokens consumed (highly sought after feature!), emitting token counts to App Insights, caching completions for optimization of token usage, and a simpler way to onboard AOAI into APIM. Very cool stuff of which I’ll be covering over the next few weeks. For this post I’m going to focus on the new load balancing and circuit breaker feature.

Before I dive into the new feature I want to do a quick review of why scaling across AOAI instances is so important. For each model you have a limited amount of requests and tokens you can pass to the service within a given subscription within a region. These limits vary on a per model basis. If you’re consuming a lot of prompts or making a lot of requests it’s fairly easy to hit these limits. I’ve seen a customer hit the limits within a region with one document processing application. I had another customer who deployed a single Chat Bot in a simple RAG (retrieval augmented generation) that was being used by large swath of their help desk staff and limits were quickly a problem. The point I’m making here is you will hit these limits and you will need to add figure out how to solve it. Solving it is going to require additional instances in different Azure regions likely spread across multiple subscriptions. This means you’ll need to figure out a way to spread applications across these instances to mitigate the amount of throttling your applications have to deal with.

Load Balancing Azure OpenAI Service

As I covered earlier, there are a lot of ways you can load balancing this service. You could do it at the local application using Simon’s Python library if you need to get something up and running quickly for an application or two. If you have an existing deployed API Gateway like an Apigee or Mulesoft, you could do it there if you can get the logic right to support it. If you want to custom build something from scratch or customize a community offering like PowerProxy you could do that as well if you’re comfortable owning support for the solution. Finally, you have the native option of using Azure APIM. I’m a fan of the APIM option over the Python library because it’s scalable to support hundreds of applications with a GenAI (generative AI) need. I also like it more than custom building something because the reality is most customers don’t have the people with the necessary skill sets to build something and are even less likely to have the bodies to support yet another custom tool. Another benefit of using APIM include the backend infrastructure powering the solution (load balancers, virtual machines, and the like) are Microsoft’s responsibility to run and maintain. Beyond load balancing, it’s clear that Microsoft is investing in other “Generative AI Gateway” types of functionality that make it a strategic choice to move forward with. These other features are very important from a security and operations perspective as I’ve covered in past posts. No, there was not someone from Microsoft holding me hostage forcing me to recommend APIM. It is a good solution for this use case for most customers today.

Ok, back to the new load balancing and circuit breaker feature. This new feature allows you to use new native APIM functionality to create a load balancing and circuit breaker policy around your APIM backends. Historically to do this you’d need a complex policy like the “smart” load balancing policy seen below to accomplish this feature set.

<policies>
    <inbound>
        <base />
        <!-- Getting the main variable where we keep the list of backends -->
        <cache-lookup-value key="listBackends" variable-name="listBackends" />
        <!-- If we can't find the variable, initialize it -->
        <choose>
            <when condition="@(context.Variables.ContainsKey("listBackends") == false)">
                <set-variable name="listBackends" value="@{
                    // -------------------------------------------------
                    // ------- Explanation of backend properties -------
                    // -------------------------------------------------
                    // "url":          Your backend url
                    // "priority":     Lower value means higher priority over other backends. 
                    //                 If you have more one or more Priority 1 backends, they will always be used instead
                    //                 of Priority 2 or higher. Higher values backends will only be used if your lower values (top priority) are all throttling.
                    // "isThrottling": Indicates if this endpoint is returning 429 (Too many requests) currently
                    // "retryAfter":   We use it to know when to mark this endpoint as healthy again after we received a 429 response

                    JArray backends = new JArray();
                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-eastus.openai.azure.com/" },
                        { "priority", 1},
                        { "isThrottling", false }, 
                        { "retryAfter", DateTime.MinValue } 
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-eastus-2.openai.azure.com/" },
                        { "priority", 1},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-northcentralus.openai.azure.com/" },
                        { "priority", 1},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-canadaeast.openai.azure.com/" },
                        { "priority", 2},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-francecentral.openai.azure.com/" },
                        { "priority", 3},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-uksouth.openai.azure.com/" },
                        { "priority", 3},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-westeurope.openai.azure.com/" },
                        { "priority", 3},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    backends.Add(new JObject()
                    {
                        { "url", "https://andre-openai-australia.openai.azure.com/" },
                        { "priority", 4},
                        { "isThrottling", false },
                        { "retryAfter", DateTime.MinValue }
                    });

                    return backends;   
                }" />
                <!-- And store the variable into cache again -->
                <cache-store-value key="listBackends" value="@((JArray)context.Variables["listBackends"])" duration="60" />
            </when>
        </choose>
        <authentication-managed-identity resource="https://cognitiveservices.azure.com" output-token-variable-name="msi-access-token" ignore-error="false" />
        <set-header name="Authorization" exists-action="override">
            <value>@("Bearer " + (string)context.Variables["msi-access-token"])</value>
        </set-header>
        <set-variable name="backendIndex" value="-1" />
        <set-variable name="remainingBackends" value="1" />
    </inbound>
    <backend>
        <retry condition="@(context.Response != null && (context.Response.StatusCode == 429 || context.Response.StatusCode >= 500) && ((Int32)context.Variables["remainingBackends"]) > 0)" count="50" interval="0">
            <!-- Before picking the backend, let's verify if there is any that should be set to not throttling anymore -->
            <set-variable name="listBackends" value="@{
                JArray backends = (JArray)context.Variables["listBackends"];

                for (int i = 0; i < backends.Count; i++)
                {
                    JObject backend = (JObject)backends[i];

                    if (backend.Value<bool>("isThrottling") && DateTime.Now >= backend.Value<DateTime>("retryAfter"))
                    {
                        backend["isThrottling"] = false;
                        backend["retryAfter"] = DateTime.MinValue;
                    }
                }

                return backends; 
            }" />
            <cache-store-value key="listBackends" value="@((JArray)context.Variables["listBackends"])" duration="60" />
            <!-- This is the main logic to pick the backend to be used -->
            <set-variable name="backendIndex" value="@{
                JArray backends = (JArray)context.Variables["listBackends"];

                int selectedPriority = Int32.MaxValue;
                List<int> availableBackends = new List<int>();

                for (int i = 0; i < backends.Count; i++)
                {
                    JObject backend = (JObject)backends[i];

                    if (!backend.Value<bool>("isThrottling"))
                    {
                        int backendPriority = backend.Value<int>("priority");

                        if (backendPriority < selectedPriority)
                        {
                            selectedPriority = backendPriority;
                            availableBackends.Clear();
                            availableBackends.Add(i);
                        } 
                        else if (backendPriority == selectedPriority)
                        {
                            availableBackends.Add(i);
                        }
                    }
                }

                if (availableBackends.Count == 1)
                {
                    return availableBackends[0];
                }
            
                if (availableBackends.Count > 0)
                {
                    //Returns a random backend from the list if we have more than one available with the same priority
                    return availableBackends[new Random().Next(0, availableBackends.Count)];
                }
                else
                {
                    //If there are no available backends, the request will be sent to the first one
                    return 0;    
                }
                }" />
            <set-variable name="backendUrl" value="@(((JObject)((JArray)context.Variables["listBackends"])[(Int32)context.Variables["backendIndex"]]).Value<string>("url") + "/openai")" />
            <set-backend-service base-url="@((string)context.Variables["backendUrl"])" />
            <forward-request buffer-request-body="true" />
            <choose>
                <!-- In case we got 429 or 5xx from a backend, update the list with its status -->
                <when condition="@(context.Response != null && (context.Response.StatusCode == 429 || context.Response.StatusCode >= 500) )">
                    <cache-lookup-value key="listBackends" variable-name="listBackends" />
                    <set-variable name="listBackends" value="@{
                        JArray backends = (JArray)context.Variables["listBackends"];
                        int currentBackendIndex = context.Variables.GetValueOrDefault<int>("backendIndex");
                        int retryAfter = Convert.ToInt32(context.Response.Headers.GetValueOrDefault("Retry-After", "-1"));

                        if (retryAfter == -1)
                        {
                            retryAfter = Convert.ToInt32(context.Response.Headers.GetValueOrDefault("x-ratelimit-reset-requests", "-1"));
                        }

                        if (retryAfter == -1)
                        {
                            retryAfter = Convert.ToInt32(context.Response.Headers.GetValueOrDefault("x-ratelimit-reset-tokens", "10"));
                        }

                        JObject backend = (JObject)backends[currentBackendIndex];
                        backend["isThrottling"] = true;
                        backend["retryAfter"] = DateTime.Now.AddSeconds(retryAfter);

                        return backends;      
                    }" />
                    <cache-store-value key="listBackends" value="@((JArray)context.Variables["listBackends"])" duration="60" />
                    <set-variable name="remainingBackends" value="@{
                        JArray backends = (JArray)context.Variables["listBackends"];

                        int remainingBackends = 0;

                        for (int i = 0; i < backends.Count; i++)
                        {
                            JObject backend = (JObject)backends[i];

                            if (!backend.Value<bool>("isThrottling"))
                            {
                                remainingBackends++;
                            }
                        }

                        return remainingBackends;
                    }" />
                </when>
            </choose>
        </retry>
    </backend>
    <outbound>
        <base />
        <!-- This will return the used backend URL in the HTTP header response. Remove it if you don't want to expose this data -->
        <set-header name="x-openai-backendurl" exists-action="override">
            <value>@(context.Variables.GetValueOrDefault<string>("backendUrl", "none"))</value>
        </set-header>
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

Complex policies like the above are difficult to maintain and easy to break (I know, I break my policies all of time). Compare that with a policy that does something very similar with the new load balancing and circuit breaker feature.

<policies>
    <!-- Throttle, authorize, validate, cache, or transform the requests -->
    <inbound>
        <set-backend-service backend-id="backend_pool_aoai" />
        <base />
    </inbound>
    <!-- Control if and how the requests are forwarded to services  -->
    <backend>
        <base />
    </backend>
    <!-- Customize the responses -->
    <outbound>
        <base />
    </outbound>
    <!-- Handle exceptions and customize error responses  -->
    <on-error>
        <base />
    </on-error>
</policies>

A bit simpler eh? With the new feature you establish a new APIM backend of a “pool” type. In this backend you configure your load balancing and circuit breaker logic. In the Terraform template below, I’ve created a load balanced pool that includes three existing APIM backends which are each an individual AOAI instance. I’ve divided the three backends into two priority groups such that the APIM so that APIM will concentrate the requests to the first priority group until a circuit break rule is triggered. I configured a circuit breaker rule that will hold sending additional requests for 1 minute (tripDuration) to a backend if that backend returns a single (count) 429 over the course of 1 minute (interval). You’ll likely want to play with the tripDuration and interval to figure out what works for you.

Priority group 2 will only be used if all the backends in priority group 1 have circuit breaker rules tripped. The use case here might be that your priority group 1 instance is a AOAI instance setup for PTU (provisioned throughput units) and you want overflow to dump down into instances deployed at the standard tier (basically consumption based).

resource "azapi_resource" "symbolicname" {
  type = "Microsoft.ApiManagement/service/backends@2023-05-01-preview"
  name = "string"
  parent_id = "string"
  body = jsonencode({
    properties = {
      circuitBreaker = {
        rules = [
          {
            failureCondition = {
              count = 1
              errorReasons = [
                "Backend service is throttling"
              ]
              interval = "PT1M"
              statusCodeRanges = [
                {
                  max = 429
                  min = 429
                }
              ]
            }
            name = "breakThrottling "
            tripDuration = "PT1M",
            acceptRetryAfter = true
          }
        ]
      }
      description = "This is the load balanced backend"
      pool = {
        services = [
          {
            id = "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rg-demo-aoai/providers/Microsoft.ApiManagement/service/apim-demo-aoai-jog/backends/openai-3",
            priority = 1
          },
          {
            id = "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rg-demo-aoai/providers/Microsoft.ApiManagement/service/apim-demo-aoai-jog/backends/openai-1",
            priority = 2
          },
          {
            id = "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rg-demo-aoai/providers/Microsoft.ApiManagement/service/apim-demo-aoai-jog/backends/openai-2",
            priority = 2
          }
        ]
      }
    }
  })
}

Very cool right? This makes for way simpler APIM policy which means troubleshooting APIM policy that much easier. You could also establish different pools for different categories of applications. Maybe you have a pool with a PTU and standard tier instances for mission-critical production apps and another pool of only standard instances for non-production applications. You could then direct specific applications (based on their Entra ID service principal id) to different pools. This feature gives you a ton of flexibility in how you handle load balancing without a to of APIM policy overhead.

With the introduction of this feature into APIM, it makes APIM that much more of an appealing solution for this use case. No longer do you need a complex policy and in-depth APIM policy troubleshooting skills to make this work. Tack on the additional GenAI features Microsoft introduced that I mentioned earlier, as well as its existing features and capabilities available in APIM policy, you have a damn fine tool for your Generative AI Gateway use case.

Well folks that wraps up this post. I hope this overview gave you some insight into why load balancing is important with AOAI, what the historical challenges have been doing it within APIM, and how those challenges have been largely removed with the added bonus of additional new GenAI-based features make this a tool worth checking out.

Azure OpenAI Service – How To Get Insights By Collecting Logging Data

Azure OpenAI Service – How To Get Insights By Collecting Logging Data

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Hello geeks! Yes, I’m back with yet another post on the Azure OpenAI Service. There always seems to be more cool stuff to talk about with this service that isn’t specific to the models themselves. If you follow this blog, you know I’ve spent the past year examining the operational and security aspects of the service. Through trial and error and a ton of discussions with S500 customers across all industries, I’ve learned a ton and my goal has to be share back those lessons learned with the wider community. Today I bring you more nuggets of useful information.

Like any good technology nerd, I’m really nosey. Over the years I’ve learned about all the interesting information web-based services return the response headers and how useful this information can be to centrally capture and correlate to other pieces of logging information. These headers could include things like latency, throttling information, or even usage information that can be used to correlate the costs of your usage of the service. While I had glanced at the response headers from the Azure OpenAI Service when I was doing my work on the granular chargeback and streaming ChatCompletions posts, I hadn’t gone through the headers meticulously. Recently, I was beefing up Shaun Callighan’s excellent logging helper solution with some additional functionality I looked more deeply at the headers and found some cool stuff that was worth sharing.

How to look at the headers (skip if you don’t want to nerd out a bit)

My first go to whenever examining a web service is to power up Fiddler and drop it in between my session and the web service. While this works great on a Windows or MacOS box when you can lazily drop the Fiddler-generated root CA (certificate authority) into whatever certificate store your browser is using to draw its trusted CAs from, it’s a bit more work when conversing with a web service through something like Python. Most SDKs in my experience use the requests module under the hood. In that case it’s a simple matter of passing a kwarg some variant of the option to disable certificate verification in the requests module (usually something like verify=false) like seen below in the azure.identity SDK.

from azure.identity import DefaultAzureCredential, get_bearer_token_provider

try:
    token_provider = get_bearer_token_provider(
        DefaultAzureCredential(
            connection_verify=False
        ),
        "https://cognitiveservices.azure.com/.default",
    )
except:
    logging.error('Failed to obtain access token: ', exc_info=True)

Interestingly, the Python openai SDK does not allow for this. Certificate verification cannot be disabled with an override. Great security control from the SDK developers, but no thought of us lazy folks. The openai SDK uses httpx under the hood, so I took the nuclear option and disabled verification of certificates in the module itself. Obviously a dumb way of doing it, but hey lazy people gotta lazy. If you want to use Fiddler, be smarter than me and use one of the methods outlined in this post to trust the root CA generated by Fiddler.

All this to get the headers? Well, because I like you, I’m going to show you a far easier way to look at these headers using the native openai SDK.

The openai SDK doesn’t give you back the headers by default. Instead the response body is parsed neatly for you and a new object is returned. Thankfully, the developers of the library put in a way to get the raw response object back which includes the headers. Instead of using the method chat.completions.create you can use chat.completions.with_raw_response.create. Glancing at the SDK, it seems like all methods supported by both the native client and AzureOpenAI client support the with_raw_response method.

def get_raw_chat_completion(client, deployment_name, message):
    response = client.chat.completions.with_raw_response.create(
    model=deployment_name,
    messages= [
        {"role":"user",
         "content": message}
    ],
    max_tokens=1000,
    )

    return response

Using this alternative method will save you from having to mess with the trusted certificates as long as you’re good with working with a text-based output like the below.

Headers({'date': 'Fri, 17 May 2024 13:18:21 GMT', 'content-type': 'application/json', 'content-length': '2775', 'connection': 'keep-alive', 'cache-control': 'no
-cache, must-revalidate', 'access-control-allow-origin': '*', 'apim-request-id': '01e06cdc-0418-47c9-9864-c914979e9766', 'strict-transport-security': 'max-age=3
1536000; includeSubDomains; preload', 'x-content-type-options': 'nosniff', 'x-ms-region': 'East US', 'x-ratelimit-remaining-requests': '1', 'x-ratelimit-remaini
ng-tokens': '1000', 'x-ms-rai-invoked': 'true', 'x-request-id': '6939d17e-14b2-44b7-82f4-e751f7bb9f8d', 'x-ms-client-request-id': 'Not-Set', 'azureml-model-sess
ion': 'turbo-0301-57d7036d'})

This can be incredibly useful if you’re dropped some type of gateway, such as an APIM (API Management) instance in front of the OpenAI instance for load balancing, authorization, logging, throttling etc. If you’re using APIM, you can my buddy Shaun’s excellent APIM Policy Snippet to troubleshoot a failing APIM policy. Now that I’ve given you a workaround to using Fiddler, I’m going to use Fiddler to explore these headers for the rest of the post because I’m lazy and I like a pretty GUI sometimes.

Examining the response headers and correlating data to diagnostic logs

Here we can see the response headers returned from a direct call to the Azure OpenAI Service.

The headers which should be of interest to you are the x-ms-region, x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, and x-request-id. The x-ms-region is the region where the Azure OpenAI instance you called is located (I’ll explain why this can be useful in a bit). The x-ratelimit headers tell you how close you are to hitting rate limits on a specific instance of a model in an AOAI instance. This is where load balancing and provisioned throughput units can help mitigate the risk of throttling. The load balancing headers are still important to your application devs to pay attention to and account for even if you’re load balancing across multiple instances because load balancing mitigates but doesn’t eliminate the risk of throttling. The final interesting header is the apim-request-id which is the unique identifier of this specific request to the AOAI service. If you’re wondering, yes it looks like the product group has placed the compute running the models behind an instance of Azure API Management.

Let’s first start with the apim-request-id response header. This header is useful because it can be used to correlate a specific request it’s relevant entry in the native diagnostic logging for the Azure OpenAI Service. While I’ve covered the limited use of the diagnostic logging within the service, there are some good nuggets in there which I’ll cover now.

Using the apim-request-id, I can make a query to wherever I’m storing the diagnostic logs for the AOAI instance to pull the record for the specific request. In my example I’m using a Log Analytics Workspace. Below you can see my Kusto query which pulls the relevant record from the RequestResponse category of logs.

Correlating a request to the Azure OpenAI Service to the diagnostic logs

There are a few useful pieces of information in this log entry.

  • DurationMs – This field tells us how long the response took from the Azure OpenAI Service. My favorite use of this field comes when considering non-PTU-based Azure OpenAI instances. Lots of people want to use the service and the underlining models in a standard pay-as-you-go tier can get busy in certain regions at certain times. If you combine this information with the x-ms-region response header you can begin to build a picture of average response times per region at specific times of the day. If you’re load balancing, you can tweak your logic to direct your organization’s prompts to the region that has the lowest response time. Cool right?
  • properties_s.streamType – This field tells you whether or not the request was a streaming-type completion. This can be helpful to give you an idea of how heavily used streaming is in your org. As I’ve covered previously, capturing streaming prompts and completions and calculating token usage can a challenge. This property can help give you an idea how heavily used it is across your org which may drive you to get a solution in place to do that calculation sooner rather than later.
  • properties_s.modelName, modelVersion – More useful information to enrich the full picture of the service usage while being able to trace that information back to specific prompts and responses.
  • objectId – If your developers are using Entra ID-based identities to authenticate to the AOAI service (which you should be doing and avoiding use of API keys where possible), you’ll have the objectid of the specific service principal that made the request.

Awesome things you can do with this information

You are likely beginning to see the value of collecting the response headers, prompt and completions from the request and respond body, and enriching that information from logging data collected from diagnostics logs. With that information you can begin getting a full picture of how the service is being used across your organization.

Examples include:

  • Calculating token usage for organizational chargebacks
  • Optimizing the way you load balance to take advantage of less-used regions for faster response times
  • Making troubleshooting easier by being able to trace a specific response back to which instance it, the latency, and the prompt and completion returned by the API.

There are a ton of amazing things you can do with this data.

How the hell do you centrally collect and visualize this data?

Your first step should be to centrally capturing this data. You can use the APIM pattern that is quite popular or you can build your own solution (I like to refer to this middle tier component as a “Generative AI Gateway”. $50 says that’s the new buzzwords soon enough). Either way, you want this data captured and delivered somewhere. In my demo environment I deliver the data to an Event Hub, do a bit of transformation and dump it into a CosmosDB with Stream Analytics, and the visualize it with PowerBI. An example of the flow I use in my environment is below.

Example flow of how to capture and monetize operational and security data from your Azure OpenAI Usage

The possibilities for the architecture are plentiful, but the value of this data to operations, security, and finance is worth the effort to assemble something in your environment. I hope this post helped to get your more curious about what your usage looks like and how could use this data to optimize operationally, financially, and even throw in a bit more security with more insight into what your users are doing with this GenAI models by reviewing the captured prompts and responses. While there isn’t a lot of regulation around the use of GenAI yet, it’s coming and by capturing this information you’ll be ready to tackle it.

Thanks for reading!

Azure Authorization – Azure RBAC Delegation

This is part of my series on Azure Authorization.

  1. Azure Authorization – The Basics
  2. Azure Authorization – Azure RBAC Basics
  3. Azure Authorization – actions and notActions
  4. Azure Authorization – Resource Locks and Azure Policy denyActions
  5. Azure Authorization – Azure RBAC Delegation
  6. Azure Authorization – Azure ABAC (Attribute-based Access Control)

Hello again fellow geeks!

I typically avoid doing multiple posts in a month for sanity purposes, but quite possibly one of the most awesome Azure RBAC features has gone generally available under the radar. Azure RBAC Delegation is going to be one of those features that after reading this post, you will immediately go and implement. For those of you coming from AWS, this is going to be your IAM Permissions Boundary-like feature. It addresses one of the major risks to Azure’s authorization model and fills a big gap the platform has had in the authorization space.

Alright, enough hype let’s get to it.

Before I dive into the new feature, I encourage you to read through the prior posts in my Azure Authorization series. These posts will help you better understand how this feature fits into the bigger picture.

As I covered in my Azure RBAC Basics post, only security principals with sufficient permissions in the Microsoft.Authorization resourced provider can create new Azure RBAC Role Assignments. By default, once a security principal is granted that permission it can then assign any Azure RBAC Role to itself or any other security principal within the Entra ID tenant to it’s immediate scope of access and all child scopes due to Azure RBAC’s inheritance model (tenant root -> management group -> subscription -> resource group -> resource). This means a human assigned a role with sufficient permissions (such as an IAM support team) could accidentally / maliciously assign another privileged role to themselves or someone else and wreak havoc. Makes for a not good late night for anyone on-call.

While the human risk exists, the greater risk is with non-human identities. When an organization passes beyond the click-ops and imperative (az cli, PowerShell, etc) stage and moves on to the declarative stage with IaC (infrastructure-as-code), delivery of those IaC templates (ARM, Bicep, Terraform) are put through a CI/CD (continuous integration / continuous delivery) pipeline. To deploy the code to the cloud platform, these pipeline’s compute need an identity to authenticate to the cloud management plane. In Azure, this is accomplished through a service principal or managed identity. That identity must be granted the specific permissions it needs which is done through Azure RBAC Role assignments.

In the ideal world, as much as can be is put through the pipeline, including role assignments. This means the pipeline needs to be able to create Azure RBAC Role assignments which means it needs permissions for the Microsoft.Authorization resource provider (or relevant built-in roles with Owner being common).

To mitigate the risk of one giant blast radius with one pipeline, organizations will often create multiple pipelines with separate pipelines for production and non-production, pipelines for platform components (networking, logging, etc) and others for the workload (possibly one for workload components such as an Event Hub and another separate pipeline for code). Pipelines will be given separate security principals with permissions at different scopes with Central IT typically owning pipelines at higher scopes (management groups) and business units owning pipelines at lower scopes (subscription or resource group).

Example of multiple pipelines and identities

At the end of the day you end up with lots of pipelines and lots of non-humans that hold the Owner role at a given scope. This multiplies the risk of any one of those pipeline identities being misused to grant someone or something permissions beyond what it needs. Organizations typically mitigate through this automated and manual gates which can get incredibly complex at scale.

This is where Azure RBAC Delegation really shines. It allows you to wrap restrictions around how a security principal can exercise its Microsoft.Authorization permissions. These restrictions can include:

  • Restricting to a specific set of Azure RBAC Roles
  • Restricting to a specific security principal type (user, service principal, group)
  • Restricting whether it can create new assignments, update existing assignments, or delete assignments
  • Restricting it to a specific set of security principals (specific set of groups, users, etc)

So how does it do it? Well if you read my prior post, you’ll remember I mentioned the new property included in Azure RBAC Role assignments called conditions. RBAC Delegation uses this new property to wrap those restrictions around the role. Let’s look at an example role using one of the new built-in role Microsoft has introduced called the Role Based Access Control Administrator.

Let’s take a look at the role definition first.

{
    "id": "/subscriptions/b3b7aae7-c6c1-4b3d-bf0f-5cd4ca6b190b/providers/Microsoft.Authorization/roleDefinitions/f58310d9-a9f6-439a-9e8d-f62e7b41a168",
    "properties": {
        "roleName": "Role Based Access Control Administrator",
        "description": "Manage access to Azure resources by assigning roles using Azure RBAC. This role does not allow you to manage access using other ways, such as Azure Policy.",
        "assignableScopes": [
            "/"
        ],
        "permissions": [
            {
                "actions": [
                    "Microsoft.Authorization/roleAssignments/write",
                    "Microsoft.Authorization/roleAssignments/delete",
                    "*/read",
                    "Microsoft.Support/*"
                ],
                "notActions": [],
                "dataActions": [],
                "notDataActions": []
            }
        ]
    }
}

In the above role definition, you can see that the role has been granted only permissions necessary to create, update, and delete role assignments. This is more restrictive than an Owner or User Access Administrator which have a broader set of permissions in the Microsoft.Authorization resource provider. It makes for a good candidate for a business unit pipeline role versus Owner since business units shouldn’t need to be managing Azure Policy or Azure RBAC Role Definitions. That responsibility should within Central IT’s scope.

You do not have to use this built-in role or can certainly design your own. For example, the role above does not include the permissions to manage a resource lock and this might be something you want the business unit to be able to manage. This feature is also supported for custom role. In the example below, I’ve cloned the Role Based Access Control Administrator but added additional permissions to manage resource locks.

{
    "id": "/subscriptions/b3b7aae7-c6c1-4b3d-bf0f-5cd4ca6b190b/providers/Microsoft.Authorization/roleDefinitions/dd681d1a-8358-4080-8a37-9ea46c90295c",
    "properties": {
        "roleName": "Privileged Test Role",
        "description": "This is a test role to demonstrate RBAC delegation",
        "assignableScopes": [
            "/subscriptions/b3b7aae7-c6c1-4b3d-bf0f-5cd4ca6b190b"
        ],
        "permissions": [
            {
                "actions": [
                    "Microsoft.Authorization/roleAssignments/write",
                    "Microsoft.Authorization/roleAssignments/delete",
                    "*/read",
                    "Microsoft.Support/*",
                    "Microsoft.Authorization/locks/read",
                    "Microsoft.Authorization/locks/write",
                    "Microsoft.Authorization/locks/delete"
                ],
                "notActions": [],
                "dataActions": [],
                "notDataActions": []
            }
        ]
    }
}

If I were to attempt to create a new role assignment for this custom role, I’ve given the ability to associate the role with a set of conditions.

Adding conditions to a custom role

Let’s take an example where I want a security principal to be able to create new role assignments but only for the built-in role of Virtual Machine Contributor. The condition in my role would like the below:

    (
        (
            !(ActionMatches{'Microsoft.Authorization/roleAssignments/write'})
        )
        OR 
        (
            @Request[Microsoft.Authorization/roleAssignments:RoleDefinitionId] ForAnyOfAnyValues:GuidEquals {9980e02c-c2be-4d73-94e8-173b1dc7cf3c}
            AND
            @Request[Microsoft.Authorization/roleAssignments:PrincipalType] StringEqualsIgnoreCase 'User'
        )
    )
    AND
    (
        (
            !(ActionMatches{'Microsoft.Authorization/roleAssignments/delete'})
        )
        OR 
        (
            @Request[Microsoft.Authorization/roleAssignments:PrincipalType] StringEqualsIgnoreCase 'User'
        )
    )

Yes, I know the conditional language is ugly as hell. Thankfully, you won’t have to write this yourself which I will demonstrate in a few. First, I want to walk you through the conditional language.

Azure RBAC conditional language

When using RBAC Delegation, you can associate one or more conditions to an Azure RBAC role assignment. Like most conditional logic in programming, you can combine conditions with AND and OR. Within each condition you have an action and expression. When the condition is evaluated the action is first checked for a match. In the example above I have !(ActionMatches{‘Microsoft.Authorization/roleAssignments/write’}) which will match any permission that isn’t write which will cause the condition to result in true and will allow the access without evaluating the expression below. If the action is write, then this evaluated to false and the expression is then evaluated. In the example above I have two expressions. The first expression checks whether the request I’m making to create a role assignment is for the role definition id for the Virtual Machine Contributor. The second expression checks to see if the principal type in the role assignment is of type user. If either of these evaluate to false, then access is denied. If it evaluates to true, it moves on to the next condition which limits the security principal to deleting role assignments assigned to users.

No, you do not need to learn this syntax. Microsoft has been kind enough to provide a GUI-based conditions tool which you can use to build your conditions and view as code you can include in your templates.

GUI-based condition builder

Pretty cool right? The public documentation walks through a number of different scenarios where you can use this, so I’d encourage you to read it to spur ideas beyond the pipeline example I have given in this post. However, the real value out of this feature is stricter control of what how those pipeline identities can affect authorization.

So what are you takeaways for this post?

  • Get this feature implemented YESTERDAY. This is minimal overhead with massive security return.
  • Use the GUI-based condition builder to build your conditions and then copy the JSON into your code.
  • Take some time to learn the conditional syntax. It’s used in other places in Azure RBAC and will likely continue to grow in usage.
  • Start off using the built-in Role Based Access Control Administrator role. If your business units need more than what is in there (such as managing resource locks) clone it and add those permissions.

Well folks, I hope you got some value out of this post. Add a to-do to get this in place in your Azure estate as soon as possible!

Azure Authorization – Resource Locks and Azure Policy denyActions

This is part of my series on Azure Authorization.

  1. Azure Authorization – The Basics
  2. Azure Authorization – Azure RBAC Basics
  3. Azure Authorization – actions and notActions
  4. Azure Authorization – Resource Locks and Azure Policy denyActions
  5. Azure Authorization – Azure RBAC Delegation
  6. Azure Authorization – Azure ABAC (Attribute-based Access Control)

Welcome back! Today I have another post in my series on Azure Authorization. In my last post I covered how permissions listed in notActions and notDataActions in an Azure RBAC Role Definition is not an explicit deny but rather a subtraction from the permissions listed in the definition in the action and nonActions section. In this post I’m going two features which help to address that gap: Resource Locks and the Azure Policy denyActions feature.

Resource Locks

Let’s start with Resource Locks. Resource Locks can be used to protect important resources from actions that could delete the resource or actions that could modify the resource. They are an Azure Resource that are administered through the Microsoft.Authorization resource provider (specifically Microsoft.Authorization/locks) and come in two forms which include delete (CanNotDelete) and modification locks (ReadOnly). Resource locks can be applied at the subscription scope, resource group scope, and resource scope. Resource locks applied at a higher scope are inherited down.

Resource locks are a wonderful defense-in-depth control because they are another authorization control in addition to Azure RBAC. A common use case might be to place a read only resource lock on an Azure Subscription which is used to house ExpressRoutes resources since once setup, ExpressRoute Circuits are relatively static from a configuration perspective. This could help mitigate the risk of an authorized user or CI/CD pipeline identity, who may have the Azure RBAC Owner or Contributor over the subscription, from mucking up the configuration purposefully or by accident and causing broad issues for your Azure landscape.

Example of a resource lock on a resource

Resource locks do have a number of considerations that you should be aware of before you go throwing them everywhere. One of the main considerations is that they can be removed by anyone with access to Microsoft.Authorization/*, which will include built-in Azure RBAC roles such as Owner and User Access Administrator. It’s common practice for organizations to grant the identity (service principal or managed identity) used by a CI/CD pipeline the Owner role in order for that role to create role assignments required for the services it is deploying (you can work around this with delegation which I will cover in my next post!). This means that anyone with sufficient permissions on the pipeline could theoretically pass some code that removes the lock.

Another consideration is resource locks only affect management plane operations (if you’re unfamiliar with this concept check out my first post on authorization). Using a Storage Account as an example, a ReadOnly lock would prevent someone from modifying the CMK used to encrypt a storage account, but it wouldn’t stop a user with sufficient permissions at the data plane from deleting a container or blob. Before applying a ReadOnly resource lock, make sure you understand some of the ramifications of blocking management plane operations. Using an App Service as an example, a ReadOnly lock would prevent you from scaling that App Service up or down.

I’m a big fan of using resource locks for mission critical pieces of infrastructure such as ExpressRoute. Outside of that you’ll want to evaluate the considerations I’ve listed above in the public documentation. If your primary goal is to prevent delete operations, then you may be better off with using the next feature, Azure Policy deny Actions.

Azure Policy denyActions

Azure Policy is Azure’s primary governance tool. For those of you coming from AWS, think of Azure Policy as if AWS IAM Policy conditions had a baby with AWS Config. I like to think of it as a way to enforce the way a resource needs to look, and if the resource doesn’t look that way, you can block the action altogether, log that it’s not compliance, or remediate it.

It’s important to understand that Azure Policy sits in line with the ARM (Azure Resource Manager) API allowing it to prevent or remediate the resource creation or modification before it ever gets processed by the API. This is bonus vs having to remediate it after the fact with something like AWS Config.

Azure Policy Architecture

In addition to the functionality above, Microsoft has added a bit of authorization logic in to Azure Policy with effect of denyAction (yes it was a bit confusing to me why authorization to do something was introduced, but you know what, it can come in handy!). As of the date of this blog, the only action that can be denied is the DELETE action. While this may seem limited, it’s an awesome improvement and addresses some of the gaps in resource locks.

First, you can use the power of the Azure Policy language to filter to a specific resource type type with a specific set of tags. This allows you to apply these rules at scale. Use case here might be I want to deny the delete action on all Log Analytics Workspaces across my entire Azure estate. Second, the policy could be assigned a management group scope. By assigning the policy at the management group scope I can prevent deletions of these resources even by a user or service principal that might hold the Owner role of the subscription. This helps me mitigate the risk present with resource locks when a CI/CD pipeline has been given the Owner role over a subscription.

An example policy could look like something below. This policy would prevent any Log Analytics Workspaces tagged with a tag named rbac with a value equal to prod from being deleted.

$policy_id=(New-AzPolicyDefinition -Name $policy_name `
    -Description "This is a policy used for RBAC demonstration that blocks the deletion of Log Analytics Workspaces tagged with a key of rbac and a value of prod" `
    -ManagementGroupName $management_group_name `
    -Mode Indexed `
    -Policy '{
            "if": {
                "allOf": [
                    {
                        "field": "type",
                        "equals": "Microsoft.OperationalInsights/workspaces"
                    },
                    {
                        "field": "tags.rbac",
                        "equals": "prod"
                    }
                ]
            },
            "then": {
                "effect": "denyAction",
                "details": {
                    "actionNames": [
                        "delete"
                    ],
                    "cascadeBehaviors": {
                        "resourceGroup": "deny"
                    }
                }
            }
        }')
    ```

Just like resource locks, Azure Policy denyActions have some considerations you should be aware of. These include limitations such as it not preventing deletion of the resources when the subscription is deleted. Full limitations can be found here.

Conclusion

What I want you to take away from this post is that there are other tools beyond Azure RBAC that can help you secure your Azure resources. It’s important to practice a defense-in-depth approach and utilize each tool where it makes sense. Some general guidance would be the following:

  • Use resource locks when you want to prevent modification and Azure Policy denyActions when you want to prevent deletion. This will allow you to do it at scale and mitigate the Owner risk.
  • Be careful where you use ReadOnly resource locks. Blocking management plane actions even when the user has the appropriate permission can be super helpful, but it can also bite you in weird ways. Pay attention to the limitations in the documentation.
  • If you know the defined state of a resource needs to look like, and you’re going to deploy it that way, look at using Azure Policy with the deny effect instead of a ReadOnly resource lock. This way can enforce that configuration in code and mitigate the Owner risk.

So this is all well and good, but neither of these features allow you to pick the SPECIFIC actions you want to deny. In an upcoming post I’ll cover the feature that is kind there, but not really, to address that ask.