Azure OpenAI Service – Load Testing

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Hello again geeks. Yes, yet another Azure OpenAI Service (AOAI) post. I promise this one will be worth your time and you’ll be glad you didn’t have to bash your head against the keyboard like I did putting this one together.

Last week I was chatting with a customer who has started down the journey of providing an enterprise-scale production-ready (Fancy words right? Practicing here so I can fake like I’m a real Microsoft employee) AOAI offering to their business units (BUs). What does a typical “enterprise-ready production-scale” deployment of AOAI look like? Well, it looks similar to what you see below. The goal of this type of deployment is to:

As this customer got ready to open it up to the world, they were interested in doing some load testing on it to see how their Generative AI Gateway (Azure API Management in this case) and their backend AOAI instances would hold up to what they believed would be a production load. Some of my peers had done a similar exercise in the past with the Azure Load Testing service and Apache JMeter for a proof-of-concept. I was curious as to what this would like and how it would so I decided to throw something together, hence the post today.

So yeah, I’ve never touched the Azure Load Testing service nor have I touched JMeter more than once many many moons ago. The first step in the process was to read up on the Azure Load Testing service. This service is Microsoft’s cloud-based load testing service. It is essentially a service where MIcrosoft spins up a whole bunch of compute (engines) in Azure Batch which then runs a URL-test, Apache JMeter test, or Locust test. The compute simulates these tests (with the construct of a virtual user) as if it were a set of your users pounding away at the service.

Azure Load Testing architecture

Since most organizations have some familiarity with Apache JMeter I decided that I’d put together an Apache JMeter test. While there are a ton of JMeter examples for simple API calls, I had a hard time finding samples that involve acquiring an Entra ID access token for authentication to the API. While I could have grabbed an access token and tossed it into Azure Key Vault, I wanted to be a bit more fancy.

Creating the JMeter Test

After a bit of Googling I ended up coming across this blog post and this post which between the two I was able to get something working. I first created the thread group in JMeter and then added a Once Only Controller because I only wanted to obtain the access token once for each virtual user. From there, I added an HTTP Request sampler with the configuration below.

Obtaining Entra ID access token in JMeter

The parameters used in the authentication request are pulled from the environment variables object in the test. The environmental variables for the test are pulled from the Azure Load Testing service instance via a combination of environmental variables and secrets stored in Azure Key Vault (more on that later).

Environmental variables for the JMeter test

Once the request is complete and fetched the access token, I then used the JSON Extractor post-processor to extract the access token from the response and package into a new variable called access_token.

Extracting the access token

Ok sweet, got my access token. Next up I wanted to do a ChatCompletion against the AOAI services behind the API Management (APIM) instance. To do that I added another HTTP Request Sampler and populated it with the details below.

Creating the ChatCompletion request

JMeter has a neat feature where you can pass contents of a CSV file to samplers to dynamically populate the values in the request. I wanted the ability to pass it multiple prompts so I added a config element for a CSV Data Set Config. Now there are a few quirks to using this config element with the Azure Load Testing service. One of those quirks is you do not want to specify any file path. Likely, when the engines are spun up, they’re getting the JMeter test and supporting CSVs dropped into the same directory so it’s not needed. Additionally, your CSV file can’t have header rows so you need to ensure you define the header roles in the variable names as is seen in the screenshot below.

CSV Data Set Config

Last but not least, I needed to ensure the HTTP Request passes the appropriate headers. I added the HTTP Header Manager config element and added the Content-Type and Authorization header which contained a reference to the access token I obtained in the prior HTTP Request.

HTTP Header Manager for ChatCompletion


At that point I had a JMeter test that should work within the Azure Load Testing service. The next step was to deploy the Azure Load Testing Service.

Azure Load Testing Service Instance

Deployment of the Azure Load Testing service instance was pretty straightforward. There really aren’t a ton of options for the actual service instance. The key things to note are that the Azure Load Testing service instances use managed identities to pull secrets or certificates from Azure Key Vault. This meant that along with the Azure Load Testing instance, I needed to deploy a user-assigned managed identity (my preference over system-assigned managed identities), an Azure Key Vault instance, secrets in the Azure Key Vault for a service principal that would be used in my tests, and set some Azure RBAC role assignments. The managed identity needs at least the Azure Key Vault Secrets User RBAC role on the Azure Key Vault instance (yes you should be using RBAC authorization model instead of the old access policies at this point).

What I deployed is highlighted in blue in the image below. I’ll cover the virtual network piece in the next section.

Azure Load Testing Test

At this point I got my JMeter test, my sample ChatCompletions, and Azure Load Testing service instance. Now it’s time to create the test within the Azure Load Testing service.

Creation of tests are a data plane activity and the ability to touch the data plane with IaC is very limited so I opted to use CLI (which has its own problems as we’ll see). Before I deployed the test, I had to create my test configuration. With the service you can define your test configuration in YAML. My test included the code below:

version: v0.1
test_id: genai_gateway_test
displayName: "GenAI Gateway Load Test"
description: "This will load test a Generative AI Gateway by sending ChatCompletions"
testType: JMX
testPlan: ./genai_gateway_test.jmx
engineInstances: 1
configurationFiles:
  - './config/chat_completions.csv'
failureCriteria:
  - percentage(error) > 80
autoStop:
  errorPercentage: 80
  timeWindow: 60
env:
  - name: VIRTUAL_USERS
    value: 10
  - name: RAMP_UP
    value: 1
  - name: LOOP_COUNT
    value: 1
  - name: RESOURCE
    value: 'https://cognitiveservices.azure.com'
    # This is the fully-qualified domain name of your Generative AI Gateway
  - name: OPENAI_ENDPOINT
    value: mygenaigateway.company.com
  - name: OPENAI_DEPLOYMENT_NAME
    value: gpt-4o
  - name: OPENAI_API_VERSION
    value: 2024-04-01-preview
secrets:
    # These are the credentials of the service principal that will be used to make the calls to the Generative AI Gateway
  - name: TENANT_ID
    value: https://mykeyvault.vault.azure.net/secrets/tenantid/38a3b814339944348710b216014f5acd
  - name: CLIENT_ID
    value: https://mykeyvault.vault.azure.net/secrets/clientid/94df372a3530469ea6e4b30064d9dbdc
  - name: CLIENT_SECRET
    value: https://mykeyvault.vault.azure.net/secrets/clientsecret/f8612911116f42fe8c1b77c53ca1b8de
# This property does not seem to work as of 10/2024
keyVaultReferenceIdentity: /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/myrg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/myumi
subnetId: /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/myrg/providers/Microsoft.Network/virtualNetworks/myvnet/subnets/mysubnet
publicIPDisabled: true

Yeah, there’s a lot there. There are a few areas I want to highlight.

The first area is the secrets section. Here I included the Key Vault secret references to the service principal credentials I have sitting in the Azure Key Vault. The keyVaultReferenceIdentity is supposed to set the test to use the managed identity you specify (this didn’t work for me as we’ll see later).

The next area is the subnetId and publicIPDisabled fields. The Azure Load Testing service has the ability to run tests where packets originate from a subnet in your virtual network. This allows you to hit services behind Private Endpoints or on-premises. Given that my APIM instance is deployed in internal mode, that was a requirement for me. I also wanted to control egress traffic from the test engines injected into my subnet. This is where I set the publicIPDisabled field to True. This causes all traffic from the test engines to flow through your preferred network path. Unfortunately, this includes both data plane and management plane traffic. You’ll need to ensure you allow required flows out your Internet egress point.

You can reference documentation for the other fields, but most are descriptive enough that you’ll get the picture.

Now it’s time to deploy the test. You can do this with az cli using the az load test create command.

Post Test Deployment

Done right? Ready to run the test? Nope, not yet.

There were a few properties I set within the YAML config that didn’t seem to take. This might be because the az load test is a preview command, I’m not really sure. Either way, the properties I noticed that did not stick were the keyVaultReferenceIdentity and splitAllCSVs properties. I explained the keyVaultReferenceIdentity property above. The splitAllCSVs property will take the contents of the CSV with your ChatCompletion and will distribute them across multiple engines (if you have multiple engines). If you have a large scale test, this is likely something you may want to do.

To ensure the test can pull the secrets needed to authenticate to Entra ID from Azure Key Vault, I needed to manually set it to use the service’s managed identity because the keyVaultReferenceIdentity property did not seem to work. To do that I logged into the Azure Portal and selected the newly created test GenAI Gateway Load Test and selected to modify the configuration of the test.

Modify configuration of test

Under the parameters section towards the bottom, I was able to select the UMI I configured to be used by the Azure Load Testing service instance.

Set the identity to pull secrets from Key Vault

The other thing you can do with the Azure Load Testing service is pull metrics from supporting components (which the service refers to as server-side metrics). For this, I added the four AOAI instances I have sitting behind my APIM instance. I also needed to configure it to use the UMI associated with the service to pull the metrics (this UMI was granted permissions on the AOAI instances to pull the metrics in case I wanted to use any of them for metrics that drive how my test behaves).

Adding server-side metrics to the test

Once those changes were complete I was good to go. If I was using multiple engines (which I’m wasn’t) and I wanted to split the completions in my CSV across engines, I would have to had to manually set the option for that (another one that doesn’t seem to work in the YAML in my testing). This option is located in the Test Plan section of the test configuration under the Split CSV evenly between Test engines option..

At this point you can begin running your tests.

Summing it up

While it takes a bit of doing, getting the Azure Load Testing service up and running was pretty easy. Because I’m a nice guy, I’ve uploaded sample code for everything I’ve done to this repository. Clone it and make it your own.

There are a ton more options within the Azure Load Testing Service beyond what I went over here so get out there and explore it. A few things to be aware of:

  1. Remember that for consumption-based services like Azure OpenAI, load testing could get expensive if you scale up your test large enough. Be ready for those costs.
  2. If you end up using the VNet injection option for your testing like I did, ensure you have proper networking in place. The compute that runs in your subnet needs to be able to make TCP connections to your Generative AI Gateway. It also needs to be able to resolve the name, so make sure you have DNS properly configured.
  3. You can lock down your Key Vault with the service firewall and the usage of Private Endpoints. In my testing, the Azure Load Testing service looks to be communicating over the Microsoft public IP address so ensure you have Allow Trusted Services option checked.

Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Hello again!

Today I’m back with another post focusing on AOAI (Azure OpenAI Service). My focus falls into two buckets: operations and security. For this post I’m going to cover a topic that falls into the operations bucket.

Last year I covered some of the challenges that arise when tracking token usage when the need arises to use streaming-based ChatCompletions. The challenges center around logging the prompt, response, and token usage. The guidance I provide in that prior post is unchanged for logging prompts and completions, but capturing token usage has gotten much easier. Before I dig into the details, I want to very briefly cover why you should care about and track token usage.

The whole “AI is the new electricity” statement isn’t all hype. Your business units are going to want to experiment with it, especially generative AI, to for optimizing business processes such as shaving time off how long it takes a call center rep to resolve a customer’s problem or automating a portion of what is now a manual limited value-added activity of highly paid employees to free them up to focus on activities that drive more business value. As an organization, you’re going to be charged with providing these services to the developers, data scientists, an AI engineers. The demand will be significant and you gotta figure out a scalable way to provide these services while satisfying security, performance, and availability requirements.

This will typically drive an architecture where capacity for generative AI is pooled and distributed to your business units a core service. Acting as a control point to ensure security, availability, and performance requirements can be met, the architectural concept of a Generative AI Gateway is introduced. This component usually translates to Azure API Management, 3rd party API Gateway, or custom developed solution with “generative ai-specific” functionality layered on top (load balancing, rate limiting based on token usage, token usage tracking, prompt and response logging, caching of prompts and responses to reduce costs and latency, etc).

In Azure you might see a design like the image below where you’re distributing the requests across multiple AOAI instances spread across regions, geo-political boundaries, and subscriptions in order to maximize your quota (number of requests and tokens per model). When you have this type of architecture it’s important to get visibility into the token usage of each application for charge backs and to ensure everyone is getting their fair share of the capacity (i.e. rate limiting).

Example high-level architecture using AOAI

Now let’s align the token usage back to streaming ChatCompletions. With a non-streaming ChatCompletion the API automatically returns the number of prompt tokens, completion tokens, and total tokens that were consumed with the request. This information is easy to intercept at the Generative AI Gateway to use as an input for rate limiting or to pass on to some reporting system for charge backs on token usage.

Non-streaming ChatCompletion returning usage

When performing a streaming ChatCompletion the completion is returned in a series of server events (or chunks). Usage statistics were historically not provided in the response from the AOAI service to my understanding and experience. This forced the application developer or the owner of the Generative AI Gateway to incorporate some custom code using a Tokenizer like tiktoken to manually calculate the total number of tokens. An example of such a solution developed by one of my wonderful peers Shaun Callighan can be found here. This was one of the only (maybe the only?) to approach the problem at the time but sometimes resulted in slightly skewed results from what was estimated by the tokenizer to what the actual numbers were when processed by the AOAI service and billed to the customer.

Streaming ChatCompletion chunks of responses



Microsoft has made this easier with the introduction of the azure-openai-emit-token-metrics policy snippet for APIM (Azure API Management) which can emit token usage for both streaming and non-streaming completions (among other operations) to an App Insights instance. I talk through this at length in this post. However, at this time, it’s supported for a limited set of models and not every customer uses APIM. These customers have had to address the problem using a custom solution like I mentioned earlier.

Earlier this week I was mucking around with a simplistic ChatBot I’m building (FYI, Streamlit is an amazing framework to help build GUIs if you’re terrible at frontend design like I am) and I came across an additional parameter that can be passed when making a streaming ChatCompletion. You can pass an additional parameter called stream_options which will provide the token usage of the ChatCompletion in the the second to last chunk delivered back to the client. I’m not sure when this was introduced or how I missed it, but it removes the need to calculate this yourself with a tokenizer.

 response = client.chat.completions.create(
    model=deployment_name,
    messages= [
        {"role":"user",
         "content": message}
    ],
    max_tokens=200,
    stream=True,
    stream_options={
         "include_usage": True
        }
 )

Below you’ll see a sample response from a streaming ChatCompletion when including the stream_options property. In the chunk before the final chunk (there is a final check not visible in this image), the usage statistics are provided and can be extracted.

This provides a much better option than trying to calculate this yourself. I tested this with 3.5-turbo and 4o (both with text and images) and it gave me back the token usage as expected (I’m using API version 2024-02-01). I threw together some very simple (and if it’s coming from me it’s likely gonna be simple because my coding skills leave a lot to be desired) to capture these metrics and return them as part of the completion.

# Class to support completion and token usage
class ChatMessage:
    def __init__(self, full_response, prompt_tokens, completion_tokens, total_tokens):
        self.full_response = full_response
        self.prompt_tokens = prompt_tokens
        self.completion_tokens = completion_tokens
        self.total_tokens = total_tokens

# Streaming chat completions
async def get_streaming_chat_completion(client, deployment_name, messages, max_tokens):
    response = client.chat.completions.create(
        model=deployment_name,
        messages=messages,
        max_tokens=max_tokens,
        stream=True,
        stream_options={
            "include_usage": True
        }
    )
    assistant_message = st.chat_message("assistant")
    full_response = ""

    with assistant_message:
        message_placeholder = st.empty()

    # Intialize token counts
    t_tokens = 0
    c_tokens = 0
    p_tokens = 0
    usage_dict = None


    for chunk in response:
        if chunk.usage:
            usage_dict = chunk.usage
            if p_tokens == 0:
                p_tokens = usage_dict.prompt_tokens
                c_tokens = usage_dict.completion_tokens
                t_tokens = usage_dict.total_tokens

        if hasattr(chunk, 'choices') and chunk.choices:
            content = chunk.choices[0].delta.content
            if content is not None:
                full_response += content
                message_placeholder.markdown(full_response)

    if full_response == "":
        full_response = "Sorry, I was unable to generate a response."

    return ChatMessage(full_response, p_tokens, c_tokens, t_tokens)

For those of using APIM as a Generative AI Gateway, you won’t have to worry about this for most of the OpenAI models offered through AOAI because the policy snippet I mentioned earlier will be improved to support additional models beyond what it supports today. For those of you using third-party gateways, this is likely relevant and may help to simplify your code and eliminate the discrepancies you see from calculating token usage yourself vs what you’re seeing displayed within the AOAI instance.

Well folks, this post was short and sweet. Hopefully this small tidbit of information helps a few folks out there who were going the tokenizer route. Any simplification these days is welcome!

Azure Networking – Inspecting traffic to Private Endpoints Revisited… Again. Maybe for the last time?

Update 11/4/2024 – Added limitations
Update 10/11/2024 – Updated with generally available announcement

Welcome back! Today I’m going to step back from the Generative AI world and talk about some good ole networking. Networking is one of those technical components of every solution that gets glossed over until the rubber hits the road and the application graduates to “production-worthy”. Sitting happily beside security, it’s the topic I’m most often asked to help out with at Microsoft. I’m going to share a new feature has gone generally available under the radar that is pretty damn cool, even if a bit confusing.

Organizations in the regulated space frequently have security controls where a simple 5-tuple-based firewall rule at OSI layer 4 won’t suffice and traffic inspection needs to occur to analyze layer 7. Take for an example a publicly facing web application deployed to Azure. These applications can be subject to traffic inspection at multiple layers like an edge security service (Akamai, CloudFlare, FrontDoor, etc) and again when the traffic enters the customer’s virtual through a security appliance (F5, Palo Alto, Application Gateway, Azure Firewall, etc). Most of the time you can get away with those two inspection points (edge security service and security appliance deployed into virtual network) for public traffic and one inspection point for private traffic (security appliance deployed into virtual network and umpteenth number of security appliances on-premises). However, that isn’t always the case.

Many customers I work with have robust inspection requirements that may require multiple inspection points within Azure. The two most common patterns where this pops up is when traffic first moves through an Application Gateway or APIM (API Management) instance. In these scenarios some customers want to funnel the traffic through an additional inspection point such as their third-party firewall for additional checks or a centralized choke point managed by information security (in the event Application Gateways / APIM have been democratized). When the backend is a traditional virtual machine or virtual network injected/integrated (think something like an App Service Environment v3) the routing is quite simple and looks like something like the below.

Traffic inspection with traditional virtual machine or VNet Injected/VNet integrated service

In the above image we slap a custom route table on the Application Gateway subnet, and add a user-defined route that says when contacting the subnet containing the frontend resources of the application, it needs to go the firewall first. To ensure the symmetry of return traffic, we put a route table on the frontend subnet with a user-defined route that says communication to the Application Gateway subnet needs to also go to the firewall. The routes in these two route tables are more specific than the system route for the virtual network and take precedence forcing both the incoming and return traffic to flow symmetrically through the firewall. Easy enough.

The routing when inspecting traffic to services which receive their inbound traffic via a Private Endpoint (such as an App Service running in a Premium App Services Plan, a Storage Account, a Key Vault, etc) that inspection gets more challenging. These challenges exist for both controlling the traffic to the Private Endpoint and controlling the return traffic.

When a Private Endpoint is provisioned in a virtual network, a new system route is injected into the route tables of each subnet in that virtual network AND any peered virtual networks. This route is a /32 for the IP address assigned to the network interface associated with the Private Endpoint as seen in the image below.

System route added by the creation of a Private Endpoint in a virtual network

Historically, to work around this you had to drop /32 routes everywhere to override those routes to push the incoming traffic to the Private Endpoints through an inspection point. This was a nightmare at scale as you can imagine. Back in August 2023, Microsoft introduced what they call Private Endpoint Network Policies, which is a property of a subnet that allows you to better manage this routing (in addition to optionally enforcing Network Security Groups on Private Endpoints) by allowing less specific routes to override the more specific Private Endpoint /32 routes. You set this property to Enabled (both this routing feature and network security group enforcement) or RouteTableEnabled (just this routing feature). This property is set on the subnet you place the Private Endpoints into. Yeah I know, confusing because that is not how routing is supposed to work (where less specific routes of the same length override more specific routes), but this is an SDN (software defined network) so they’ll do what they please and you’ll like it.

Private Endpoint route invalid because Private Endpoint Network Policy property set to RouteTableEnabled

While this feature helped to address traffic to the Private Endpoint, handling the return traffic wasn’t so simple. Wrapping a custom route table around a subnet containing Private Endpoints does nothing to control return traffic from the Private Endpoints. They do not care about your user-defined routes and won’t honor them. This created an asymmetric traffic flow where incoming traffic was routed through the inspection point but return traffic bypassed it and went direct to the calling endpoint.

This misconfiguration was very common in customer environments and rarely was noticed because many TCP sessions with Private Endpoints are short lived and thus the calling client isn’t affected by the TCP RST sent by the firewall after X number of minutes. Customers could work around this by SNATing to the NVA’s (inspection point) IP address and ensure the return traffic was sent back to the NVA before it was passed back to the calling client. What made it more confusing was some services “just worked” because Microsoft was handling that symmetry in the data plane of the SDN. Azure Storage was an example of such a service. If you’re interested in understanding the old behavior, check out this post.

Prior asymmetric behavior without SNAT at NVA

You’ll notice I said “prior” behavior. Yes folks, SNATing when using a 3rd-party NVA (announcement is specific to 3rd-party NVAs. Those of you using Azure Firewall in a virtual network, Azure Firewall in a VWAN Secure Hub, or a 3rd-party NVA in a VWAN Secure Hub will need to continue to SNAT for now (As of 11/2024) until this feature is extended to those use case.

I bet you’re thinking “Oh cool, Microsoft is now having Private Endpoints honor user-defined routes in route tables”. Ha ha, that would make far too much sense! Instead Microsoft has chosen to require resource tags on the NICs of the NVAs to remove the SNAT requirement. Yeah, wouldn’t have been my choice either but here we are. Additionally, in my testing, I had it working without the resource tags to get a symmetric flow of traffic. My assumption (and total assumption as an unimportant person at Microsoft) is that this may be the default behavior on some of the newer SDN stacks while older SDN stacks may require the tags. Either way, do what the documentation says and put the tags in place.

As of today (10/11/2024) the generally available documentation is confusing as to what you need to do. I’ve provided some feedback to the author to fix some of the wording, but in the meantime let me explain what you need to do. You need to create a resource tag on either the NIC (non-VMSS) or VM instance (VMSS) that has a key of disableSnatOnPL with a value of true.

Magic of SDN ensuring symmetric flow without SNAT

TLDR; SNAT should no longer be required to ensure symmetric traffic flow when placing an NVA between an endpoint and a Private Endpoint if you have the proper resource tag in place. My testing of the new feature was done in Central US and Canada Central with both Azure Key Vault and Azure SQL. I tested when the calling endpoint was within the same virtual network, when it was in a peered virtual network connected in a hub and spoke environment, and when the calling machine was on-premises calling a private endpoint in a spoke. In all scenarios the NVA showed a symmetric flow of traffic in a packet capture.

Azure OpenAI Service – Tracking Token Usage with APIM

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Yeah, yeah, yeah, I missed posting in July. I have been appropriately shamed on a daily basis by WordPress reminders.

I’m going to make up for it today by covering another of the “Generative AI Gateway” features of APIM (Azure API Management) that were announced a few months back. I’ve already covered the circuit breaker and load balancing and the token-based rate limiting features. These two features have made it far easier to distribute and control the usage of the AOAI (Azure OpenAI Service) that is being offered as a core enterprise service. One of the challenges that isn’t addressed by those features is charge backs.

As I’ve covered in prior posts, you can get away with an instance or two of AOAI dedicated to an app when you have one or two applications at the POC (proof-of-concept) stage. Capacity and charge back isn’t an issue in that model. However, your volume of applications will grow as well as the capacity of tokens and requests those applications require as they move to production. This necessitates AOAI being offered as a core foundational service as basic as DNS or networking. The patterns for doing this involve centrally distributing requests across several instances of AOAI spread across different regions and subscriptions using a feature like the circuit breaker and load balancing features of APIM. Once you have several applications drawing from a common pool, you then need to control how much each of those applications can consume using a feature like the token-based rate limiting feature of APIM.

Common way to scale AOAI service

Wonderful! You’ve built a service that has significant capacity and can service your BUs from a central endpoint. Very cool, but how are you gonna determine who is consuming what volume?

You may think, “That information is returned in the response. I can have the developers use a common code snippet to send that information for each response to a central database where I can track it.” Yeah nah, that ain’t gonna work. First, you ain’t ever gonna get that level of consistency across your enterprise (if you do have this, drop me an email because I want to work there). Second, as of today, the APIs do not return the number of tokens used for streaming based chat completions which will be a large majority of what is being sent to the models.

I know you, and you’re determined. You follow-up with, “Well Matt, I’m simply going to pull the native metrics from each of the AOAI instances I’m load balancing to.” Well yeah, you could do that but guess what? Those only show you the total consumed across the instance and do not provide a dimension for you to determine how much of that total was related to a specific application.

Native metrics and its dimensions for an instance of AOAI

“Well Matt, I’m going to configure diagnostic logging for each of my AOAI instances and check off the Request and Response Logs. Surely that information will be in there!”. You don’t quit do you? Let me shatter your hopes yet again, no that will not work. As I’ve covered in a prior post while the logs do contain the Entra ID object ID (assuming you used Entra ID-based authentication) you won’t find any token counts in those logs either.

AOAI Request and Response Logs

Well fine then, you’re going to use a custom logging solution to capture token usage when it’s returned by the API and calculate it when it isn’t. While yes this does work and does provide a number of additional benefits beyond information for charge backs (and I’m a fan of this pattern) it takes some custom code development and some APIM policy snippet expertise. What if there was an easier way?

That is where the token metrics feature of APIM really shines. This feature allows you to configure APIM to emit a custom metric for the tokens consumed by a Completion, Chat Completion (EVEN STREAMING!!), or Embeddings API call to an AOAI backend with a very basic APIM Policy snippet. You can even add custom dimensions and that is where this feature gets really powerful.

The first step in setting this up is to spin up an instance of Application Insights (if your APIM isn’t already hooked into one) and a Log Analytics Workspace the Application Insights instance will be associated with. Once your App Insights instance is created, you need to modify the settings API in APIM you’ve defined for AOAI and turn on the App Insights integration and enable custom metrics as seen below.

Enable custom metrics in APIM

Next up, you need to modify your APIM policy. In the APIM Policy snippet below I extra a few pieces of data from the request and add them as dimensions to the custom metric. Here I’m extracting the Entra ID app id of security principal accessing the AOAI service (this would be the application’s identity if you’re using Entra ID authentication to the AOAI service) and the model deployment name being called from AOAI which I’ve standardized to be the same as the model name.

         <!-- Extract the application id from the Entra ID access token -->

        <set-variable name="appId" value="@(context.Request.Headers.GetValueOrDefault("Authorization",string.Empty).Split(' ').Last().AsJwt().Claims.GetValueOrDefault("appid", string.Empty))" />

        <!-- Extract the model name from the URL -->

        <set-variable name="uriPath" value="@(context.Request.OriginalUrl.Path)" />
        <set-variable name="deploymentName" value="@(System.Text.RegularExpressions.Regex.Match((string)context.Variables["uriPath"], "/deployments/([^/]+)").Groups[1].Value)" />

        <!-- Emit token metrics to Application Insights -->

        <azure-openai-emit-token-metric namespace="openai-metrics">
            <dimension name="model" value="@(context.Variables.GetValueOrDefault<string>("deploymentName","None"))" />
            <dimension name="client_ip" value="@(context.Request.IpAddress)" />
            <dimension name="appId" value="@(context.Variables.GetValueOrDefault<string>("appId","00000000-0000-0000-0000-000000000000"))" />
        </azure-openai-emit-token-metric>

After making a few calls from my code to APIM, the metrics begin to populate in the App Insights instance. To view those metrics you’ll want to go into the App Insights blade and go to the Monitoring -> Metrics section. Under the Metrics Namespace drop down you’ll see the namespace you’ve created in the policy snippet. I named mine openai-metrics.

Accessing custom metrics in App Insights for token metrics

I can now select metrics based on prompt tokens, completion tokens, and total tokens consumed. Here I select the completion tokens and split the data by the appId, client IP address, and model to give me a view of how many tokens each app is consuming and of which model at any given time span.

Metrics split by dimensions

Very cool right?

As of today, there are some key limitations to be aware of:

  1. Only Chat Completions, Completions, and Embedding API operations are supported today.
  2. Each API operation is further limited by which models it supports. For example, as of August 2024, Chat Completions only supports gpt-3.5 and gpt-4. No 4o support yet unfortunately.
  3. If you’re using a load balanced pool backend, you can’t yet use the actual backend the pool send the request to as a dimension.

Well folks, hopefully this helps you better understand why this functionality was added and the value it provides. While you could do this with another API Gateway (pick your favorite), it likely won’t be as simple as it it with APIM’s policy snippet. Another win for cloud native I guess!

Thanks!