Azure OpenAI Service – Tracking Token Usage with APIM

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Yeah, yeah, yeah, I missed posting in July. I have been appropriately shamed on a daily basis by WordPress reminders.

I’m going to make up for it today by covering another of the “Generative AI Gateway” features of APIM (Azure API Management) that were announced a few months back. I’ve already covered the circuit breaker and load balancing and the token-based rate limiting features. These two features have made it far easier to distribute and control the usage of the AOAI (Azure OpenAI Service) that is being offered as a core enterprise service. One of the challenges that isn’t addressed by those features is charge backs.

As I’ve covered in prior posts, you can get away with an instance or two of AOAI dedicated to an app when you have one or two applications at the POC (proof-of-concept) stage. Capacity and charge back isn’t an issue in that model. However, your volume of applications will grow as well as the capacity of tokens and requests those applications require as they move to production. This necessitates AOAI being offered as a core foundational service as basic as DNS or networking. The patterns for doing this involve centrally distributing requests across several instances of AOAI spread across different regions and subscriptions using a feature like the circuit breaker and load balancing features of APIM. Once you have several applications drawing from a common pool, you then need to control how much each of those applications can consume using a feature like the token-based rate limiting feature of APIM.

Common way to scale AOAI service

Wonderful! You’ve built a service that has significant capacity and can service your BUs from a central endpoint. Very cool, but how are you gonna determine who is consuming what volume?

You may think, “That information is returned in the response. I can have the developers use a common code snippet to send that information for each response to a central database where I can track it.” Yeah nah, that ain’t gonna work. First, you ain’t ever gonna get that level of consistency across your enterprise (if you do have this, drop me an email because I want to work there). Second, as of today, the APIs do not return the number of tokens used for streaming based chat completions which will be a large majority of what is being sent to the models.

I know you, and you’re determined. You follow-up with, “Well Matt, I’m simply going to pull the native metrics from each of the AOAI instances I’m load balancing to.” Well yeah, you could do that but guess what? Those only show you the total consumed across the instance and do not provide a dimension for you to determine how much of that total was related to a specific application.

Native metrics and its dimensions for an instance of AOAI

“Well Matt, I’m going to configure diagnostic logging for each of my AOAI instances and check off the Request and Response Logs. Surely that information will be in there!”. You don’t quit do you? Let me shatter your hopes yet again, no that will not work. As I’ve covered in a prior post while the logs do contain the Entra ID object ID (assuming you used Entra ID-based authentication) you won’t find any token counts in those logs either.

AOAI Request and Response Logs

Well fine then, you’re going to use a custom logging solution to capture token usage when it’s returned by the API and calculate it when it isn’t. While yes this does work and does provide a number of additional benefits beyond information for charge backs (and I’m a fan of this pattern) it takes some custom code development and some APIM policy snippet expertise. What if there was an easier way?

That is where the token metrics feature of APIM really shines. This feature allows you to configure APIM to emit a custom metric for the tokens consumed by a Completion, Chat Completion (EVEN STREAMING!!), or Embeddings API call to an AOAI backend with a very basic APIM Policy snippet. You can even add custom dimensions and that is where this feature gets really powerful.

The first step in setting this up is to spin up an instance of Application Insights (if your APIM isn’t already hooked into one) and a Log Analytics Workspace the Application Insights instance will be associated with. Once your App Insights instance is created, you need to modify the settings API in APIM you’ve defined for AOAI and turn on the App Insights integration and enable custom metrics as seen below.

Enable custom metrics in APIM

Next up, you need to modify your APIM policy. In the APIM Policy snippet below I extra a few pieces of data from the request and add them as dimensions to the custom metric. Here I’m extracting the Entra ID app id of security principal accessing the AOAI service (this would be the application’s identity if you’re using Entra ID authentication to the AOAI service) and the model deployment name being called from AOAI which I’ve standardized to be the same as the model name.

         <!-- Extract the application id from the Entra ID access token -->

        <set-variable name="appId" value="@(context.Request.Headers.GetValueOrDefault("Authorization",string.Empty).Split(' ').Last().AsJwt().Claims.GetValueOrDefault("appid", string.Empty))" />

        <!-- Extract the model name from the URL -->

        <set-variable name="uriPath" value="@(context.Request.OriginalUrl.Path)" />
        <set-variable name="deploymentName" value="@(System.Text.RegularExpressions.Regex.Match((string)context.Variables["uriPath"], "/deployments/([^/]+)").Groups[1].Value)" />

        <!-- Emit token metrics to Application Insights -->

        <azure-openai-emit-token-metric namespace="openai-metrics">
            <dimension name="model" value="@(context.Variables.GetValueOrDefault<string>("deploymentName","None"))" />
            <dimension name="client_ip" value="@(context.Request.IpAddress)" />
            <dimension name="appId" value="@(context.Variables.GetValueOrDefault<string>("appId","00000000-0000-0000-0000-000000000000"))" />
        </azure-openai-emit-token-metric>

After making a few calls from my code to APIM, the metrics begin to populate in the App Insights instance. To view those metrics you’ll want to go into the App Insights blade and go to the Monitoring -> Metrics section. Under the Metrics Namespace drop down you’ll see the namespace you’ve created in the policy snippet. I named mine openai-metrics.

Accessing custom metrics in App Insights for token metrics

I can now select metrics based on prompt tokens, completion tokens, and total tokens consumed. Here I select the completion tokens and split the data by the appId, client IP address, and model to give me a view of how many tokens each app is consuming and of which model at any given time span.

Metrics split by dimensions

Very cool right?

As of today, there are some key limitations to be aware of:

  1. Only Chat Completions, Completions, and Embedding API operations are supported today.
  2. Each API operation is further limited by which models it supports. For example, as of August 2024, Chat Completions only supports gpt-3.5 and gpt-4. No 4o support yet unfortunately.
  3. If you’re using a load balanced pool backend, you can’t yet use the actual backend the pool send the request to as a dimension.

Well folks, hopefully this helps you better understand why this functionality was added and the value it provides. While you could do this with another API Gateway (pick your favorite), it likely won’t be as simple as it it with APIM’s policy snippet. Another win for cloud native I guess!

Thanks!

Azure OpenAI Service – How To Get Insights By Collecting Logging Data

Azure OpenAI Service – How To Get Insights By Collecting Logging Data

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Hello geeks! Yes, I’m back with yet another post on the Azure OpenAI Service. There always seems to be more cool stuff to talk about with this service that isn’t specific to the models themselves. If you follow this blog, you know I’ve spent the past year examining the operational and security aspects of the service. Through trial and error and a ton of discussions with S500 customers across all industries, I’ve learned a ton and my goal has to be share back those lessons learned with the wider community. Today I bring you more nuggets of useful information.

Like any good technology nerd, I’m really nosey. Over the years I’ve learned about all the interesting information web-based services return the response headers and how useful this information can be to centrally capture and correlate to other pieces of logging information. These headers could include things like latency, throttling information, or even usage information that can be used to correlate the costs of your usage of the service. While I had glanced at the response headers from the Azure OpenAI Service when I was doing my work on the granular chargeback and streaming ChatCompletions posts, I hadn’t gone through the headers meticulously. Recently, I was beefing up Shaun Callighan’s excellent logging helper solution with some additional functionality I looked more deeply at the headers and found some cool stuff that was worth sharing.

How to look at the headers (skip if you don’t want to nerd out a bit)

My first go to whenever examining a web service is to power up Fiddler and drop it in between my session and the web service. While this works great on a Windows or MacOS box when you can lazily drop the Fiddler-generated root CA (certificate authority) into whatever certificate store your browser is using to draw its trusted CAs from, it’s a bit more work when conversing with a web service through something like Python. Most SDKs in my experience use the requests module under the hood. In that case it’s a simple matter of passing a kwarg some variant of the option to disable certificate verification in the requests module (usually something like verify=false) like seen below in the azure.identity SDK.

from azure.identity import DefaultAzureCredential, get_bearer_token_provider

try:
    token_provider = get_bearer_token_provider(
        DefaultAzureCredential(
            connection_verify=False
        ),
        "https://cognitiveservices.azure.com/.default",
    )
except:
    logging.error('Failed to obtain access token: ', exc_info=True)

Interestingly, the Python openai SDK does not allow for this. Certificate verification cannot be disabled with an override. Great security control from the SDK developers, but no thought of us lazy folks. The openai SDK uses httpx under the hood, so I took the nuclear option and disabled verification of certificates in the module itself. Obviously a dumb way of doing it, but hey lazy people gotta lazy. If you want to use Fiddler, be smarter than me and use one of the methods outlined in this post to trust the root CA generated by Fiddler.

All this to get the headers? Well, because I like you, I’m going to show you a far easier way to look at these headers using the native openai SDK.

The openai SDK doesn’t give you back the headers by default. Instead the response body is parsed neatly for you and a new object is returned. Thankfully, the developers of the library put in a way to get the raw response object back which includes the headers. Instead of using the method chat.completions.create you can use chat.completions.with_raw_response.create. Glancing at the SDK, it seems like all methods supported by both the native client and AzureOpenAI client support the with_raw_response method.

def get_raw_chat_completion(client, deployment_name, message):
    response = client.chat.completions.with_raw_response.create(
    model=deployment_name,
    messages= [
        {"role":"user",
         "content": message}
    ],
    max_tokens=1000,
    )

    return response

Using this alternative method will save you from having to mess with the trusted certificates as long as you’re good with working with a text-based output like the below.

Headers({'date': 'Fri, 17 May 2024 13:18:21 GMT', 'content-type': 'application/json', 'content-length': '2775', 'connection': 'keep-alive', 'cache-control': 'no
-cache, must-revalidate', 'access-control-allow-origin': '*', 'apim-request-id': '01e06cdc-0418-47c9-9864-c914979e9766', 'strict-transport-security': 'max-age=3
1536000; includeSubDomains; preload', 'x-content-type-options': 'nosniff', 'x-ms-region': 'East US', 'x-ratelimit-remaining-requests': '1', 'x-ratelimit-remaini
ng-tokens': '1000', 'x-ms-rai-invoked': 'true', 'x-request-id': '6939d17e-14b2-44b7-82f4-e751f7bb9f8d', 'x-ms-client-request-id': 'Not-Set', 'azureml-model-sess
ion': 'turbo-0301-57d7036d'})

This can be incredibly useful if you’re dropped some type of gateway, such as an APIM (API Management) instance in front of the OpenAI instance for load balancing, authorization, logging, throttling etc. If you’re using APIM, you can my buddy Shaun’s excellent APIM Policy Snippet to troubleshoot a failing APIM policy. Now that I’ve given you a workaround to using Fiddler, I’m going to use Fiddler to explore these headers for the rest of the post because I’m lazy and I like a pretty GUI sometimes.

Examining the response headers and correlating data to diagnostic logs

Here we can see the response headers returned from a direct call to the Azure OpenAI Service.

The headers which should be of interest to you are the x-ms-region, x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, and x-request-id. The x-ms-region is the region where the Azure OpenAI instance you called is located (I’ll explain why this can be useful in a bit). The x-ratelimit headers tell you how close you are to hitting rate limits on a specific instance of a model in an AOAI instance. This is where load balancing and provisioned throughput units can help mitigate the risk of throttling. The load balancing headers are still important to your application devs to pay attention to and account for even if you’re load balancing across multiple instances because load balancing mitigates but doesn’t eliminate the risk of throttling. The final interesting header is the apim-request-id which is the unique identifier of this specific request to the AOAI service. If you’re wondering, yes it looks like the product group has placed the compute running the models behind an instance of Azure API Management.

Let’s first start with the apim-request-id response header. This header is useful because it can be used to correlate a specific request it’s relevant entry in the native diagnostic logging for the Azure OpenAI Service. While I’ve covered the limited use of the diagnostic logging within the service, there are some good nuggets in there which I’ll cover now.

Using the apim-request-id, I can make a query to wherever I’m storing the diagnostic logs for the AOAI instance to pull the record for the specific request. In my example I’m using a Log Analytics Workspace. Below you can see my Kusto query which pulls the relevant record from the RequestResponse category of logs.

Correlating a request to the Azure OpenAI Service to the diagnostic logs

There are a few useful pieces of information in this log entry.

  • DurationMs – This field tells us how long the response took from the Azure OpenAI Service. My favorite use of this field comes when considering non-PTU-based Azure OpenAI instances. Lots of people want to use the service and the underlining models in a standard pay-as-you-go tier can get busy in certain regions at certain times. If you combine this information with the x-ms-region response header you can begin to build a picture of average response times per region at specific times of the day. If you’re load balancing, you can tweak your logic to direct your organization’s prompts to the region that has the lowest response time. Cool right?
  • properties_s.streamType – This field tells you whether or not the request was a streaming-type completion. This can be helpful to give you an idea of how heavily used streaming is in your org. As I’ve covered previously, capturing streaming prompts and completions and calculating token usage can a challenge. This property can help give you an idea how heavily used it is across your org which may drive you to get a solution in place to do that calculation sooner rather than later.
  • properties_s.modelName, modelVersion – More useful information to enrich the full picture of the service usage while being able to trace that information back to specific prompts and responses.
  • objectId – If your developers are using Entra ID-based identities to authenticate to the AOAI service (which you should be doing and avoiding use of API keys where possible), you’ll have the objectid of the specific service principal that made the request.

Awesome things you can do with this information

You are likely beginning to see the value of collecting the response headers, prompt and completions from the request and respond body, and enriching that information from logging data collected from diagnostics logs. With that information you can begin getting a full picture of how the service is being used across your organization.

Examples include:

  • Calculating token usage for organizational chargebacks
  • Optimizing the way you load balance to take advantage of less-used regions for faster response times
  • Making troubleshooting easier by being able to trace a specific response back to which instance it, the latency, and the prompt and completion returned by the API.

There are a ton of amazing things you can do with this data.

How the hell do you centrally collect and visualize this data?

Your first step should be to centrally capturing this data. You can use the APIM pattern that is quite popular or you can build your own solution (I like to refer to this middle tier component as a “Generative AI Gateway”. $50 says that’s the new buzzwords soon enough). Either way, you want this data captured and delivered somewhere. In my demo environment I deliver the data to an Event Hub, do a bit of transformation and dump it into a CosmosDB with Stream Analytics, and the visualize it with PowerBI. An example of the flow I use in my environment is below.

Example flow of how to capture and monetize operational and security data from your Azure OpenAI Usage

The possibilities for the architecture are plentiful, but the value of this data to operations, security, and finance is worth the effort to assemble something in your environment. I hope this post helped to get your more curious about what your usage looks like and how could use this data to optimize operationally, financially, and even throw in a bit more security with more insight into what your users are doing with this GenAI models by reviewing the captured prompts and responses. While there isn’t a lot of regulation around the use of GenAI yet, it’s coming and by capturing this information you’ll be ready to tackle it.

Thanks for reading!