AI Foundry – The Basics

This is a part of my series on AI Foundry:

Happy New Year! Over the last few months of 2024, I was buried in AI Foundry (FKA Azure AI Studio. Hey marketing needs to do something a few times a year.) proof-of-concepts with my best buddy Jose. The product is very new, very powerful, but also very complex to get up and running in a regulated environment. Through many hours labbing scenarios out and troubleshooting with customers, I figured it was time to share some of what I learned.

So what is AI Foundry? If you want the Microsoft explanation of what it is, read the documentation. Here you’ll get the Matt Felton opinion on what it is (god help you). AI Foundry is a toolset intended to help AI Engineers build Generative AI applications. It allows them to interact with the LLMs (large language models) and build complex workflows (via Prompt Flows) in a no code (like Chat Playground) or low code (prompt flow interface) environment. As you can imagine, this is an attractive tool to get the people who know Generative AI quick access to the LLMs so use cases can be validated before expensive development cycles are spent building the pretty front-end and code required to make it a real application.

Before we get into the guts of the use cases Jose and I have run into, I want to start with the basics of how the hell this service is setup. This will likely require a post or two, so grab your coffee and get ready for a crash course.

AI Foundry is built on top of AML (Azure Machine Learning). If you’re ever built out a locked down AML instance, you have some understanding of the many services that work together to provide the service. AI Foundry inherits these components and provides a sleek user interface on top and typically requires additional resources like Azure OpenAI and AI Search (for RAG use cases). Like AML there are lots and lots of pieces that you need to think about, plan for, and implement in the correct manner to make the secured instance of AI Foundry work.

One specific feature of AML plays an important role within AI Foundry and that is the concept of a hub workspace. A hub is an AML workspace that centrally manages the security, networking, compute resources, and quota for children AML workspaces. These child AML workspaces are referred to as projects. The whole goal of the hub is to make it easier for your various business units to do the stuff they need to do with AML/AI Foundry without having to manage the complex pieces like the security and networking. My guidance would be to think about giving each business unit a Foundry Hub that they can group projects of similar environment (prod or non-prod) and data sensitivity.

General relationship between AI Foundry Hub and AI Foundry Projects

Ok, so you get the basic gist of this. When you deploy an AI Foundry instance, behind the scenes an AML workspace designated as a hub is created. Each project you create is a “child” AML workspace of the hub workspace and will inherit some resources from the hub. Now that you’re grounded in that basic piece, let’s talk about the individual components.

The many resources involved with a secured AI Foundry instance

As you can see in the above image, there are a lot of components of this solution that you will likely use if you want to deploy an AI Foundry instance that has the necessary security and networking controls. Let me give you a quick and dirty explanation of each component. I’ll dive deeper into the identity, authorization, and networking aspects of these components in future posts.

  • Managed identities
    • There are lots of managed identities in use with this product. There is a managed identity for the hub, a managed identity for the project, and managed identities for the various compute. One of the challenges of AI Foundry is knowing which managed identity (and if not a managed identity, the user’s identity) is being used to access a specific resource.
  • Azure Storage Account
    • Just like in AML, there is a default storage account associated with the workspace. Unlike with a traditional AML workspace you may be familiar with, the hub feature allows all project workspaces to leverage the same storage account. The storage account is used by the workspaces to store artifacts like logs in blob storage and artifacts like prompt flows in file storage. The hub and projects isolate their data to specific containers (for blob) and folders (for file) with Azure ABAC (holy f*ck a use case for this feature finally) setup such that the managed identities for the workspaces can only access containers/folders for data related to their specific workspace.
  • Azure Key Vault
    • The Azure Key Vault will store any keys used for connections created from within the AI Foundry project. This could be keys for the default storage account or keys used for API access to models you deploy from the model catalog.
  • Azure Container Registry
    • While this is deemed optional, I’d recommend you plan on deploying it. When deploy a prompt flow there uses certain functionality to the managed compute, the container image used isn’t the default runtime and an ACR instance will be spun up automatically without all the security controls you’ll likely want.
  • Azure OpenAI Service
    • This is used for deployment of OpenAI and some Microsoft chat and embedding models
  • Azure AI Service
    • This can be used as an alternative to the Azure OpenAI Service. It has some additional functionality beyond just hosting the models such as speed and the like.
  • Azure AI Search
    • This will be used for anything RAG related. Most likely you’ll see it used with the Chat With Your Data feature of the Chat Playground.
  • Managed Network
    • This is used to host the compute instances, serverless endpoints and managed online endpoints spun up for compute operations within AI Foundry. I’ll do a deep dive into networking within the service in a future post.
  • Azure Firewall
    • If building a secured instance, you’re going to use a managed virtual network that is locked down to all Internet access via outbound rules. Under the hood an Azure Firewall instance (standard for almost all use cases) will be spun up in the managed virtual network. You interact with it through the creation of the outbound rules and can’t directly administer it. However, you will be paying for it.
  • Role Assignments
    • So so so many role assignments. I’ll cover these in a future post.
  • Azure Private DNS
    • Used heavily for interacting with the AI Foundry instance and models/endpoints you deploy. I’ll cover this in an upcoming post.

Are you frightened yet? If not you should be! Don’t worry though, over this series I’ll walk you through the pain points of getting this service up and running. Once you get past the complex configuration, it’s a crazy valuable service that you’ll have a high demand for from your business units.

In the next post I’ll walk through the complexities of authorization within AI Foundry.

Azure OpenAI Service – Load Testing

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Hello again geeks. Yes, yet another Azure OpenAI Service (AOAI) post. I promise this one will be worth your time and you’ll be glad you didn’t have to bash your head against the keyboard like I did putting this one together.

Last week I was chatting with a customer who has started down the journey of providing an enterprise-scale production-ready (Fancy words right? Practicing here so I can fake like I’m a real Microsoft employee) AOAI offering to their business units (BUs). What does a typical “enterprise-ready production-scale” deployment of AOAI look like? Well, it looks similar to what you see below. The goal of this type of deployment is to:

As this customer got ready to open it up to the world, they were interested in doing some load testing on it to see how their Generative AI Gateway (Azure API Management in this case) and their backend AOAI instances would hold up to what they believed would be a production load. Some of my peers had done a similar exercise in the past with the Azure Load Testing service and Apache JMeter for a proof-of-concept. I was curious as to what this would like and how it would so I decided to throw something together, hence the post today.

So yeah, I’ve never touched the Azure Load Testing service nor have I touched JMeter more than once many many moons ago. The first step in the process was to read up on the Azure Load Testing service. This service is Microsoft’s cloud-based load testing service. It is essentially a service where MIcrosoft spins up a whole bunch of compute (engines) in Azure Batch which then runs a URL-test, Apache JMeter test, or Locust test. The compute simulates these tests (with the construct of a virtual user) as if it were a set of your users pounding away at the service.

Azure Load Testing architecture

Since most organizations have some familiarity with Apache JMeter I decided that I’d put together an Apache JMeter test. While there are a ton of JMeter examples for simple API calls, I had a hard time finding samples that involve acquiring an Entra ID access token for authentication to the API. While I could have grabbed an access token and tossed it into Azure Key Vault, I wanted to be a bit more fancy.

Creating the JMeter Test

After a bit of Googling I ended up coming across this blog post and this post which between the two I was able to get something working. I first created the thread group in JMeter and then added a Once Only Controller because I only wanted to obtain the access token once for each virtual user. From there, I added an HTTP Request sampler with the configuration below.

Obtaining Entra ID access token in JMeter

The parameters used in the authentication request are pulled from the environment variables object in the test. The environmental variables for the test are pulled from the Azure Load Testing service instance via a combination of environmental variables and secrets stored in Azure Key Vault (more on that later).

Environmental variables for the JMeter test

Once the request is complete and fetched the access token, I then used the JSON Extractor post-processor to extract the access token from the response and package into a new variable called access_token.

Extracting the access token

Ok sweet, got my access token. Next up I wanted to do a ChatCompletion against the AOAI services behind the API Management (APIM) instance. To do that I added another HTTP Request Sampler and populated it with the details below.

Creating the ChatCompletion request

JMeter has a neat feature where you can pass contents of a CSV file to samplers to dynamically populate the values in the request. I wanted the ability to pass it multiple prompts so I added a config element for a CSV Data Set Config. Now there are a few quirks to using this config element with the Azure Load Testing service. One of those quirks is you do not want to specify any file path. Likely, when the engines are spun up, they’re getting the JMeter test and supporting CSVs dropped into the same directory so it’s not needed. Additionally, your CSV file can’t have header rows so you need to ensure you define the header roles in the variable names as is seen in the screenshot below.

CSV Data Set Config

Last but not least, I needed to ensure the HTTP Request passes the appropriate headers. I added the HTTP Header Manager config element and added the Content-Type and Authorization header which contained a reference to the access token I obtained in the prior HTTP Request.

HTTP Header Manager for ChatCompletion


At that point I had a JMeter test that should work within the Azure Load Testing service. The next step was to deploy the Azure Load Testing Service.

Azure Load Testing Service Instance

Deployment of the Azure Load Testing service instance was pretty straightforward. There really aren’t a ton of options for the actual service instance. The key things to note are that the Azure Load Testing service instances use managed identities to pull secrets or certificates from Azure Key Vault. This meant that along with the Azure Load Testing instance, I needed to deploy a user-assigned managed identity (my preference over system-assigned managed identities), an Azure Key Vault instance, secrets in the Azure Key Vault for a service principal that would be used in my tests, and set some Azure RBAC role assignments. The managed identity needs at least the Azure Key Vault Secrets User RBAC role on the Azure Key Vault instance (yes you should be using RBAC authorization model instead of the old access policies at this point).

What I deployed is highlighted in blue in the image below. I’ll cover the virtual network piece in the next section.

Azure Load Testing Test

At this point I got my JMeter test, my sample ChatCompletions, and Azure Load Testing service instance. Now it’s time to create the test within the Azure Load Testing service.

Creation of tests are a data plane activity and the ability to touch the data plane with IaC is very limited so I opted to use CLI (which has its own problems as we’ll see). Before I deployed the test, I had to create my test configuration. With the service you can define your test configuration in YAML. My test included the code below:

version: v0.1
test_id: genai_gateway_test
displayName: "GenAI Gateway Load Test"
description: "This will load test a Generative AI Gateway by sending ChatCompletions"
testType: JMX
testPlan: ./genai_gateway_test.jmx
engineInstances: 1
configurationFiles:
  - './config/chat_completions.csv'
failureCriteria:
  - percentage(error) > 80
autoStop:
  errorPercentage: 80
  timeWindow: 60
env:
  - name: VIRTUAL_USERS
    value: 10
  - name: RAMP_UP
    value: 1
  - name: LOOP_COUNT
    value: 1
  - name: RESOURCE
    value: 'https://cognitiveservices.azure.com'
    # This is the fully-qualified domain name of your Generative AI Gateway
  - name: OPENAI_ENDPOINT
    value: mygenaigateway.company.com
  - name: OPENAI_DEPLOYMENT_NAME
    value: gpt-4o
  - name: OPENAI_API_VERSION
    value: 2024-04-01-preview
secrets:
    # These are the credentials of the service principal that will be used to make the calls to the Generative AI Gateway
  - name: TENANT_ID
    value: https://mykeyvault.vault.azure.net/secrets/tenantid/38a3b814339944348710b216014f5acd
  - name: CLIENT_ID
    value: https://mykeyvault.vault.azure.net/secrets/clientid/94df372a3530469ea6e4b30064d9dbdc
  - name: CLIENT_SECRET
    value: https://mykeyvault.vault.azure.net/secrets/clientsecret/f8612911116f42fe8c1b77c53ca1b8de
# This property does not seem to work as of 10/2024
keyVaultReferenceIdentity: /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/myrg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/myumi
subnetId: /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/myrg/providers/Microsoft.Network/virtualNetworks/myvnet/subnets/mysubnet
publicIPDisabled: true

Yeah, there’s a lot there. There are a few areas I want to highlight.

The first area is the secrets section. Here I included the Key Vault secret references to the service principal credentials I have sitting in the Azure Key Vault. The keyVaultReferenceIdentity is supposed to set the test to use the managed identity you specify (this didn’t work for me as we’ll see later).

The next area is the subnetId and publicIPDisabled fields. The Azure Load Testing service has the ability to run tests where packets originate from a subnet in your virtual network. This allows you to hit services behind Private Endpoints or on-premises. Given that my APIM instance is deployed in internal mode, that was a requirement for me. I also wanted to control egress traffic from the test engines injected into my subnet. This is where I set the publicIPDisabled field to True. This causes all traffic from the test engines to flow through your preferred network path. Unfortunately, this includes both data plane and management plane traffic. You’ll need to ensure you allow required flows out your Internet egress point.

You can reference documentation for the other fields, but most are descriptive enough that you’ll get the picture.

Now it’s time to deploy the test. You can do this with az cli using the az load test create command.

Post Test Deployment

Done right? Ready to run the test? Nope, not yet.

There were a few properties I set within the YAML config that didn’t seem to take. This might be because the az load test is a preview command, I’m not really sure. Either way, the properties I noticed that did not stick were the keyVaultReferenceIdentity and splitAllCSVs properties. I explained the keyVaultReferenceIdentity property above. The splitAllCSVs property will take the contents of the CSV with your ChatCompletion and will distribute them across multiple engines (if you have multiple engines). If you have a large scale test, this is likely something you may want to do.

To ensure the test can pull the secrets needed to authenticate to Entra ID from Azure Key Vault, I needed to manually set it to use the service’s managed identity because the keyVaultReferenceIdentity property did not seem to work. To do that I logged into the Azure Portal and selected the newly created test GenAI Gateway Load Test and selected to modify the configuration of the test.

Modify configuration of test

Under the parameters section towards the bottom, I was able to select the UMI I configured to be used by the Azure Load Testing service instance.

Set the identity to pull secrets from Key Vault

The other thing you can do with the Azure Load Testing service is pull metrics from supporting components (which the service refers to as server-side metrics). For this, I added the four AOAI instances I have sitting behind my APIM instance. I also needed to configure it to use the UMI associated with the service to pull the metrics (this UMI was granted permissions on the AOAI instances to pull the metrics in case I wanted to use any of them for metrics that drive how my test behaves).

Adding server-side metrics to the test

Once those changes were complete I was good to go. If I was using multiple engines (which I’m wasn’t) and I wanted to split the completions in my CSV across engines, I would have to had to manually set the option for that (another one that doesn’t seem to work in the YAML in my testing). This option is located in the Test Plan section of the test configuration under the Split CSV evenly between Test engines option..

At this point you can begin running your tests.

Summing it up

While it takes a bit of doing, getting the Azure Load Testing service up and running was pretty easy. Because I’m a nice guy, I’ve uploaded sample code for everything I’ve done to this repository. Clone it and make it your own.

There are a ton more options within the Azure Load Testing Service beyond what I went over here so get out there and explore it. A few things to be aware of:

  1. Remember that for consumption-based services like Azure OpenAI, load testing could get expensive if you scale up your test large enough. Be ready for those costs.
  2. If you end up using the VNet injection option for your testing like I did, ensure you have proper networking in place. The compute that runs in your subnet needs to be able to make TCP connections to your Generative AI Gateway. It also needs to be able to resolve the name, so make sure you have DNS properly configured.
  3. You can lock down your Key Vault with the service firewall and the usage of Private Endpoints. In my testing, the Azure Load Testing service looks to be communicating over the Microsoft public IP address so ensure you have Allow Trusted Services option checked.

Azure AI Studio – Chat Playground and API Management

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Hello again folks!

Today, I’m going to be posting my first post in a series on Azure AI Studio. I’ll let the true AI professionals give you the gory details and features of the service. The way my small brain thinks of the service is a platform built on top of AML (Azure Machine Learning) to make building applications that use Generative AI more developer-friendly. You can build and test applications, deploy third-party models, and organize applications into “projects” which can be secured to a specific project team but share resources across an organization via the concept of a hub. I’ll cover more on those pieces in a future blog post, but for today I want to focus on a pattern I was messing around that I think would be appealing to most folks.

One of the neat features of AI Studio is the Chat Playground. The Chat Playground is a web interface for interacting with models you have deployed to Azure AI Studio. You can send prompts and receive completions, adjust parameters such as temperature, and even get a code sample of the code being run by the web interface. The models that can e deployed include OpenAI models deployed to an AOAI (Azure OpenAI Service) instance or third-party models like Meta’s Llama deployed to a serverless endpoint or self-managed compute (in AML called managed online endpoint). For the purposes of this post I’m going to be focusing on OpenAI models deployed to an AOAI instance.

Azure AI Studio Chat Playground

You’re probably looking at this and thinking, “Yeah that is cool… a similar functionality exists in Azure OpenAI Studio and it does the same thing.” That’s correct it does, but for many organizations using the Azure OpenAI Studio’s Chat Playground isn’t an option for a number of different reasons both operational and security-related.

From an operational perspective, the Azure OpenAI Studio’s Chat Playground is designed to communicate directly with the endpoint for an AOAI instance. As I’ve covered in previous posts, this can be problematic. One reason is you’re limited to the quota within the instance which could cause you to hit limits quickly if you direct a whole ton of users to it. Typically, you will load balance across multiple instances deployed to multiple regions across multiple subscriptions as I discuss in my post on load balancing AOAI. The other problem is dealing with internal chargebacks. If I have multiple BUs (business units) hammering away at an instance, I don’t have any easy to determine who which folks in what BU consumed what. While metrics are token usage are captured in the metrics streamed from an instance, there is no way to associate that usage with an individual.

On the security side, communicating directly with the AOAI instance means I can’t review the prompts and responses being sent and received by the service. Many regulated organizations have requirements for these to be captured for review to ensure the service is being used appropriately and sensitive data isn’t being sent that hasn’t been approved to be sent. Additionally, availability of the AOAI instance could be affected by one user going nuts and consuming the full quota.

The challenges outlined above have driven many customers to insert a control point. The industry seems determined to coin this architectural component a Gen AI Gateway so I’ll play along. For you fellow old folks, all a Gen AI Gateway really is an API Gateway with some Gen AI-related features slapped on top of it. It sits between the front-facing user application and the models processing the prompts and responses. The GenAI-specific features available within the gateway help to address the operational and security challenges I’ve outlined above. If you’re curious about the specifics on this, you can check out my post on load balancing, logging, tracking token usage, rate limiting, and extracting useful information from the conversation such as prompts and responses.

Example design and process flow of a Gen AI Gateway

In the image above I’ve included an example of how APIM (Azure API Management) could be used to provide such functionality. Within the customer base I work with at Microsoft, many customers have built something that functions similar to what you see above. A design like this helps to address the operational and security challenges I’ve outlined above.

Wonderful right? Now what the **** does this have to do with AI Studio’s Chat Playground? Well, unlike the Azure OpenAI Studio’s Chat Playground, AI Studio’s offering does support modifying the endpoint to point to your generative AI gateway. How you do this isn’t super intuitive, but it does work. Whether you go this route is totally up to you. Ok, disclaimer is done, let’s talk about how you do this.

One thing to understand about using AI Studio’s Chat Playground is it works the same way that Azure OpenAI Studio’s version works in regards to where the TCP connections are sourced from when making calls to the model. As can be seen in the Fiddler capture below, the TCP connections made when you submit a prompt from the Chat Playground are sourced from the user’s endpoint.

Fiddler capture showing Chat Completion coming from user endpoint

This makes our life much easier because we likely control the path that user’s packet takes and the DNS the user uses which means we can direct that user’s packet to a Gen AI Gateway. For the purposes of this post, my goal is to funnel these prompts and completions through an APIM instance I have in place which has some APIM policy snippets that do some checks and balances and a call a small app (based off an awesome solution assembled by my buddy Shaun Callighan) which logs prompts and responses and calculates token metrics. The data processed by the app are then sent to an Event Hub, processed by Stream Analytics, and dumped into CosmosDB.

APIM between Chat Playground and AOAI

When you want to connect to an AOAI instance from AI Studio’s Chat Playground you add it as a connection. These connections can created at the hub level (think of this as a logical container for the projects) and then shared across projects. When adding the connection you can browse for the instance you want to connect to or enter manually.

Adding a connection to an AOAI instance

If you were to do that you won’t be able to create a deployment of a model or access a deployment of a model deployed in the instances behind it. This is because AI Studio is making calls to the Azure management plane to enumerate the deployments within the instance. Since there isn’t an AOAI with the hostname of your AOAI instance, you’ll be unable to add deployments or pick a deployment from the Chat Playground.

To work around this, you need to add a connection to one of your AOAI instances. This will be your “stub” instance that we’ll modify the endpoint of to point to API Management. If you’re load balancing across multiple AOAI instances behind APIM, you need to ensure that you’ve already created your model deployments and you’ve named them consistently across all of the AOAI instances you’re load balancing to. In the image below, I modify the endpoint to point to my APIM instance. The azure-openai-log-helper path is added to send it to a specific API I have setup on APIM that handles logging. For your environment, you’ll likely just need the hostname.

Modifying the endpoint name

Now before you go running and trying to use the Chat Playground, you’ll have to make a change to the APIM policy. Since the user’s browser is being told to make the call to this endpoint from a different domain (AI Studio’s domain) we need to ensure there is a CORS policy in place on the APIM instance to allow for this, otherwise it will be blocked by APIM. If you forget about this policy you’ll get a back a 200 from the APIM instance but nothing will be in the response.

Your CORS policy could look like the below:

        <cors>
            <allowed-origins>
                <origin>https://ai.azure.com/</origin>
                <origin>https://ai.azure.com</origin>
            </allowed-origins>
            <allowed-methods preflight-result-max-age="300">
                <method>POST</method>
                <method>OPTIONS</method>
            </allowed-methods>
            <allowed-headers>
                <header>authorization</header>
                <header>content-type</header>
                <header>request-id</header>
                <header>traceparent</header>
                <header>x-ms-client-request-id</header>
                <header>x-ms-useragent</header>
            </allowed-headers>
        </cors>

Once you’ve modified your APIM policy with the CORS update, you’ll be good to go! Your requests will now flow through APIM for all the GenAI Gateway goodness.

Chat Completion from AI Studio Chat Playground flowing through APIM

When messing with this I ran into a few things I want to call out:

  1. Do not forget the CORS policy. If you run into a 200 response from APIM with no content, it’s probably the CORS snippet.
  2. If you have a validate-jwt snippet in your APIM policy that includes validating the claim includes cognitivesservices, remove that. The claim passed by AI Studio includes a trailing forward slash which won’t likely match what you get back if you’re using the MSAL library in code. You could certainly include some logic to handle it, but honestly the security benefit is so little from checking the claim just make it easy on yourself and remove the check for the claim. Keep the check that validate-jwt snippet but restrict it to checking the tenant ID in the token.
  3. Chat Playground will pass the content property as the prompt as an array (this is the more modern approach to allow for multi-modal models like GPT-4o which can handle images and audio). If you have an APIM policy in place to parse the request body and extract information you’ll need to update it to also handle when content is passed as an array.
  4. Chat Playground allows for the user to submit an image along with text in the prompt. Ensure your APIM policy is capable of handling prompts like that. Dealing with human users being able to submit images to an LLM and ensuring you’re reviewing that image for DLP and calculating token consumption for streaming Chat Completion is a whole other blog topic that I’m not going to do today. Key thing is you want to account for that. Block images or ensure your policy is capable of handling it if you’re deploying 4o or 4 Vision.

Well folks that sums up this post. I realize this solution is a bit funky, and I’m not gonna tell you to use it. I’m simply putting it out there as an option if you have a business need strong enough to provide a ChatGPT-style solution but don’t have the bandwidth or time to whip up your own application.

Enjoy!

Azure OpenAI Service – Tracking Token Usage with APIM

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Yeah, yeah, yeah, I missed posting in July. I have been appropriately shamed on a daily basis by WordPress reminders.

I’m going to make up for it today by covering another of the “Generative AI Gateway” features of APIM (Azure API Management) that were announced a few months back. I’ve already covered the circuit breaker and load balancing and the token-based rate limiting features. These two features have made it far easier to distribute and control the usage of the AOAI (Azure OpenAI Service) that is being offered as a core enterprise service. One of the challenges that isn’t addressed by those features is charge backs.

As I’ve covered in prior posts, you can get away with an instance or two of AOAI dedicated to an app when you have one or two applications at the POC (proof-of-concept) stage. Capacity and charge back isn’t an issue in that model. However, your volume of applications will grow as well as the capacity of tokens and requests those applications require as they move to production. This necessitates AOAI being offered as a core foundational service as basic as DNS or networking. The patterns for doing this involve centrally distributing requests across several instances of AOAI spread across different regions and subscriptions using a feature like the circuit breaker and load balancing features of APIM. Once you have several applications drawing from a common pool, you then need to control how much each of those applications can consume using a feature like the token-based rate limiting feature of APIM.

Common way to scale AOAI service

Wonderful! You’ve built a service that has significant capacity and can service your BUs from a central endpoint. Very cool, but how are you gonna determine who is consuming what volume?

You may think, “That information is returned in the response. I can have the developers use a common code snippet to send that information for each response to a central database where I can track it.” Yeah nah, that ain’t gonna work. First, you ain’t ever gonna get that level of consistency across your enterprise (if you do have this, drop me an email because I want to work there). Second, as of today, the APIs do not return the number of tokens used for streaming based chat completions which will be a large majority of what is being sent to the models.

I know you, and you’re determined. You follow-up with, “Well Matt, I’m simply going to pull the native metrics from each of the AOAI instances I’m load balancing to.” Well yeah, you could do that but guess what? Those only show you the total consumed across the instance and do not provide a dimension for you to determine how much of that total was related to a specific application.

Native metrics and its dimensions for an instance of AOAI

“Well Matt, I’m going to configure diagnostic logging for each of my AOAI instances and check off the Request and Response Logs. Surely that information will be in there!”. You don’t quit do you? Let me shatter your hopes yet again, no that will not work. As I’ve covered in a prior post while the logs do contain the Entra ID object ID (assuming you used Entra ID-based authentication) you won’t find any token counts in those logs either.

AOAI Request and Response Logs

Well fine then, you’re going to use a custom logging solution to capture token usage when it’s returned by the API and calculate it when it isn’t. While yes this does work and does provide a number of additional benefits beyond information for charge backs (and I’m a fan of this pattern) it takes some custom code development and some APIM policy snippet expertise. What if there was an easier way?

That is where the token metrics feature of APIM really shines. This feature allows you to configure APIM to emit a custom metric for the tokens consumed by a Completion, Chat Completion (EVEN STREAMING!!), or Embeddings API call to an AOAI backend with a very basic APIM Policy snippet. You can even add custom dimensions and that is where this feature gets really powerful.

The first step in setting this up is to spin up an instance of Application Insights (if your APIM isn’t already hooked into one) and a Log Analytics Workspace the Application Insights instance will be associated with. Once your App Insights instance is created, you need to modify the settings API in APIM you’ve defined for AOAI and turn on the App Insights integration and enable custom metrics as seen below.

Enable custom metrics in APIM

Next up, you need to modify your APIM policy. In the APIM Policy snippet below I extra a few pieces of data from the request and add them as dimensions to the custom metric. Here I’m extracting the Entra ID app id of security principal accessing the AOAI service (this would be the application’s identity if you’re using Entra ID authentication to the AOAI service) and the model deployment name being called from AOAI which I’ve standardized to be the same as the model name.

         <!-- Extract the application id from the Entra ID access token -->

        <set-variable name="appId" value="@(context.Request.Headers.GetValueOrDefault("Authorization",string.Empty).Split(' ').Last().AsJwt().Claims.GetValueOrDefault("appid", string.Empty))" />

        <!-- Extract the model name from the URL -->

        <set-variable name="uriPath" value="@(context.Request.OriginalUrl.Path)" />
        <set-variable name="deploymentName" value="@(System.Text.RegularExpressions.Regex.Match((string)context.Variables["uriPath"], "/deployments/([^/]+)").Groups[1].Value)" />

        <!-- Emit token metrics to Application Insights -->

        <azure-openai-emit-token-metric namespace="openai-metrics">
            <dimension name="model" value="@(context.Variables.GetValueOrDefault<string>("deploymentName","None"))" />
            <dimension name="client_ip" value="@(context.Request.IpAddress)" />
            <dimension name="appId" value="@(context.Variables.GetValueOrDefault<string>("appId","00000000-0000-0000-0000-000000000000"))" />
        </azure-openai-emit-token-metric>

After making a few calls from my code to APIM, the metrics begin to populate in the App Insights instance. To view those metrics you’ll want to go into the App Insights blade and go to the Monitoring -> Metrics section. Under the Metrics Namespace drop down you’ll see the namespace you’ve created in the policy snippet. I named mine openai-metrics.

Accessing custom metrics in App Insights for token metrics

I can now select metrics based on prompt tokens, completion tokens, and total tokens consumed. Here I select the completion tokens and split the data by the appId, client IP address, and model to give me a view of how many tokens each app is consuming and of which model at any given time span.

Metrics split by dimensions

Very cool right?

As of today, there are some key limitations to be aware of:

  1. Only Chat Completions, Completions, and Embedding API operations are supported today.
  2. Each API operation is further limited by which models it supports. For example, as of August 2024, Chat Completions only supports gpt-3.5 and gpt-4. No 4o support yet unfortunately.
  3. If you’re using a load balanced pool backend, you can’t yet use the actual backend the pool send the request to as a dimension.

Well folks, hopefully this helps you better understand why this functionality was added and the value it provides. While you could do this with another API Gateway (pick your favorite), it likely won’t be as simple as it it with APIM’s policy snippet. Another win for cloud native I guess!

Thanks!