Azure OpenAI Service – Controlling Outbound Access

Hello again folks! Work has been complete insanity and has been preventing me from posting as of late. Finally, I was able to carve out some time to get a quick blog post done that I’ve been sitting on for while.

I have blogged extensively on the Azure OpenAI Service (AOAI) and today I will continue with another post in that series. These days you folks are more likely using AOAI through the new Azure AI Services resource versus using the AOAI resource. The good news is the guidance I will provide tonight will be relevant to both resources.

One topic I often get asked to discuss with customers is the “old person” aspects of the service such as the security controls Microsoft makes available to customers when using the AOAI service. Besides the identity-based controls, the most common security control that pops up in the regulated space is the available networking controls. While incoming network controls exercised through Private Endpoints or the service firewall (sometimes called the IP firewall) are common, one of the often missed controls within the AOAI service is the outbound network controls. Oddly enough, this is missed often in non-compute PaaS.

You may now be asking yourself, “Why the heck would the AOAI service need to make its own outbound network connections and why should I care about it?”. Excellent question, and honestly, not one I thought about much when I first started working with the service because the use cases I’m going to discuss didn’t come up because the feature set didn’t exist or it wasn’t commonly used. There are two main use cases I’m going to cover (note there are likely others and these are simply the most common):

The first use case is easily the most common right now. The “chat with your data” feature is a feature within AOAI that allows you to pass some extra information in your ChatCompletion API call that will instruct the AOAI service to query an AI Search index that you have populated with data from your environment to extend the model’s knowledge base without completely retraining it. This is essentially a simple way to muck with a retrieval augmented generation (RAG) pattern without having to write the code to orchestrate the interaction between the two services such as detailed in the link above. Instead, the “chat with your data” feature handles the heavy lifting of this for you if you have a populated index and want to add a few additional lines of code. In a future article I’ll go into more depth around this pattern because understanding the complete network and identity interaction is pretty interesting and often misconfigured. For now, understand the basics of it with the flow diagram below. Here also is some sample code if you want to play around with it yourself.

The second use case is when using a multimodal model like GPT-4 or GPT-4o. These models allow for you to pass them other types of data besides text such as images and audio. When requesting an image be analyzed, you have the option of passing the image as base64 or you can pass it a URL. If you pass it a URL the AOAI service will make an outbound network connection to the endpoint specified in the URL to retrieve the image for analysis

 response = client.chat.completions.create(
            ## Model must be a multimodal model
            model=os.getenv('LLM_DEPLOYMENT_NAME'),
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "Describe the image"
                        },
                        {
                            "type": "image_url",
                            "image_url": {"url": "{{SOME_PUBLIC_URL}}"}
                        }
                    ]
                }
            ],
            max_tokens=100         
        )

In both of these scenarios the AOAI service establishes a network connection from the Microsoft public backbone to the resource (such as AI Search in scenario 1 or a public blob in scenario 2). Unlike compute-based PaaS (App Services, Functions, Azure Container Apps, AKS, etc), today Microsoft does not provide a means for you to send this outbound traffic from AOAI through your virtual network with virtual network injection or virtual network integration. Given that you can’t pass this traffic through your virtual network, how can you mitigate potential data exfiltration risks or poisoning attacks? For example, let’s say an attacker compromises an application’s code and modifies it such that the “chat with your data” feature uses an attacker’s instance of AI Search to capture sensitive data in the queries or poisoning the responses back to the user with bad data. Maybe an attacker decides to use your AOAI instances to process images stolen from another company and placed on a public endpoint. I’m sure someone more creative could come up with a plethora of attacks. Either way, you want to control what your resources communicate with. The positive news is there is a way to do this today and likely an even better way to do this tomorrow when it comes to the AOAI service.

The AOAI (and AI Services) resources fall under the Cognitive Services framework. The benefit of being within the scope of this framework is they inherit some of the security capabilities of this framework. Some examples include support for Private Endpoints or disabling local key-based authentication. Another feature that is available to AOAI is an outbound network control. On an AOAI or AI Services resource, you can configure two properties to lock down the services’ ability to make outbound network calls. These two properties are:

  • restrictOutboundNetworkAccess – Boolean and set to true to block outbound access to everything but the exceptions listed in the allowedFqdnList property
  • allowedFqdnList – A list of FQDNs the service should be able to communicate with for outbound network calls

Using these two controls you can prevent your AOAI or AI Services resource from making outbound network calls except to the list of FQDNs you include. For example, you might whitelist your AI Search instance FQDN for the “chat with your data” feature or your blob storage account endpoint for image analysis. This is a feature I’d highly recommend you enable by default on any new AOAI or AI Service you provision moving forward.

The good news for those of you in the Terraform world is this feature is available directly within the azurerm provider as seen in a sample template below.

resource "azurerm_cognitive_account" "openai" {
  name                = "${local.openai_name}${var.purpose}${var.location_code}${var.random_string}"
  location            = var.location
  resource_group_name = var.resource_group_name
  kind                = "OpenAI"

  custom_subdomain_name = "${local.openai_name}${var.purpose}${var.location_code}${var.random_string}"
  sku_name              = "S0"

  public_network_access_enabled = var.public_network_access
  outbound_network_access_restricted = true
  fqdns = var.allowed_fqdn_list

  network_acls {
    default_action = "Deny"
    ip_rules = var.allowed_ips
    bypass = "AzureServices"
  }

  identity {
    type = "SystemAssigned"
  }

  tags = var.tags

  lifecycle {
    ignore_changes = [
      tags["created_date"],
      tags["created_by"]
    ]
  }
}

If a user attempts to circumvent these controls they will receive a descriptive error stating that outbound access is restricted. For those of you operating in a regulated environment, you should be slapping this setting on every new AOAI or AI Service instance you provision just like you’re doing with controlling inbound access with a Private Endpoint.

Alright folks, that sums up this quick blog post. Let me summarize the lessons learned:

  1. Be aware of which PaaS Services in Azure are capable of establishing outbound network connectivity and explore the controls available to you to restrict it.
  2. For AOAI and AI Services use the restrictOutboundNetworkAccess and allowedFqdnList properties to block outbound network calls except to the endpoints you specify.
  3. Make controlling outbound access of your PaaS a baseline control. Don’t just focus on inbound, focus on both.

Before I close out, recall that I mentioned a way today (the above) and a way in the future. The latter will be the new feature Microsoft announced into public preview Network Security Perimeters. As that feature matures and its list of supported services expands, controlling inbound and outbound network access for PaaS (and getting visibility into it via logs) is going to get far easier. Expect a blog post on that feature in the near future.

Thanks folks!

Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Hello again!

Today I’m back with another post focusing on AOAI (Azure OpenAI Service). My focus falls into two buckets: operations and security. For this post I’m going to cover a topic that falls into the operations bucket.

Last year I covered some of the challenges that arise when tracking token usage when the need arises to use streaming-based ChatCompletions. The challenges center around logging the prompt, response, and token usage. The guidance I provide in that prior post is unchanged for logging prompts and completions, but capturing token usage has gotten much easier. Before I dig into the details, I want to very briefly cover why you should care about and track token usage.

The whole “AI is the new electricity” statement isn’t all hype. Your business units are going to want to experiment with it, especially generative AI, to for optimizing business processes such as shaving time off how long it takes a call center rep to resolve a customer’s problem or automating a portion of what is now a manual limited value-added activity of highly paid employees to free them up to focus on activities that drive more business value. As an organization, you’re going to be charged with providing these services to the developers, data scientists, an AI engineers. The demand will be significant and you gotta figure out a scalable way to provide these services while satisfying security, performance, and availability requirements.

This will typically drive an architecture where capacity for generative AI is pooled and distributed to your business units a core service. Acting as a control point to ensure security, availability, and performance requirements can be met, the architectural concept of a Generative AI Gateway is introduced. This component usually translates to Azure API Management, 3rd party API Gateway, or custom developed solution with “generative ai-specific” functionality layered on top (load balancing, rate limiting based on token usage, token usage tracking, prompt and response logging, caching of prompts and responses to reduce costs and latency, etc).

In Azure you might see a design like the image below where you’re distributing the requests across multiple AOAI instances spread across regions, geo-political boundaries, and subscriptions in order to maximize your quota (number of requests and tokens per model). When you have this type of architecture it’s important to get visibility into the token usage of each application for charge backs and to ensure everyone is getting their fair share of the capacity (i.e. rate limiting).

Example high-level architecture using AOAI

Now let’s align the token usage back to streaming ChatCompletions. With a non-streaming ChatCompletion the API automatically returns the number of prompt tokens, completion tokens, and total tokens that were consumed with the request. This information is easy to intercept at the Generative AI Gateway to use as an input for rate limiting or to pass on to some reporting system for charge backs on token usage.

Non-streaming ChatCompletion returning usage

When performing a streaming ChatCompletion the completion is returned in a series of server events (or chunks). Usage statistics were historically not provided in the response from the AOAI service to my understanding and experience. This forced the application developer or the owner of the Generative AI Gateway to incorporate some custom code using a Tokenizer like tiktoken to manually calculate the total number of tokens. An example of such a solution developed by one of my wonderful peers Shaun Callighan can be found here. This was one of the only (maybe the only?) to approach the problem at the time but sometimes resulted in slightly skewed results from what was estimated by the tokenizer to what the actual numbers were when processed by the AOAI service and billed to the customer.

Streaming ChatCompletion chunks of responses



Microsoft has made this easier with the introduction of the azure-openai-emit-token-metrics policy snippet for APIM (Azure API Management) which can emit token usage for both streaming and non-streaming completions (among other operations) to an App Insights instance. I talk through this at length in this post. However, at this time, it’s supported for a limited set of models and not every customer uses APIM. These customers have had to address the problem using a custom solution like I mentioned earlier.

Earlier this week I was mucking around with a simplistic ChatBot I’m building (FYI, Streamlit is an amazing framework to help build GUIs if you’re terrible at frontend design like I am) and I came across an additional parameter that can be passed when making a streaming ChatCompletion. You can pass an additional parameter called stream_options which will provide the token usage of the ChatCompletion in the the second to last chunk delivered back to the client. I’m not sure when this was introduced or how I missed it, but it removes the need to calculate this yourself with a tokenizer.

 response = client.chat.completions.create(
    model=deployment_name,
    messages= [
        {"role":"user",
         "content": message}
    ],
    max_tokens=200,
    stream=True,
    stream_options={
         "include_usage": True
        }
 )

Below you’ll see a sample response from a streaming ChatCompletion when including the stream_options property. In the chunk before the final chunk (there is a final check not visible in this image), the usage statistics are provided and can be extracted.

This provides a much better option than trying to calculate this yourself. I tested this with 3.5-turbo and 4o (both with text and images) and it gave me back the token usage as expected (I’m using API version 2024-02-01). I threw together some very simple (and if it’s coming from me it’s likely gonna be simple because my coding skills leave a lot to be desired) to capture these metrics and return them as part of the completion.

# Class to support completion and token usage
class ChatMessage:
    def __init__(self, full_response, prompt_tokens, completion_tokens, total_tokens):
        self.full_response = full_response
        self.prompt_tokens = prompt_tokens
        self.completion_tokens = completion_tokens
        self.total_tokens = total_tokens

# Streaming chat completions
async def get_streaming_chat_completion(client, deployment_name, messages, max_tokens):
    response = client.chat.completions.create(
        model=deployment_name,
        messages=messages,
        max_tokens=max_tokens,
        stream=True,
        stream_options={
            "include_usage": True
        }
    )
    assistant_message = st.chat_message("assistant")
    full_response = ""

    with assistant_message:
        message_placeholder = st.empty()

    # Intialize token counts
    t_tokens = 0
    c_tokens = 0
    p_tokens = 0
    usage_dict = None


    for chunk in response:
        if chunk.usage:
            usage_dict = chunk.usage
            if p_tokens == 0:
                p_tokens = usage_dict.prompt_tokens
                c_tokens = usage_dict.completion_tokens
                t_tokens = usage_dict.total_tokens

        if hasattr(chunk, 'choices') and chunk.choices:
            content = chunk.choices[0].delta.content
            if content is not None:
                full_response += content
                message_placeholder.markdown(full_response)

    if full_response == "":
        full_response = "Sorry, I was unable to generate a response."

    return ChatMessage(full_response, p_tokens, c_tokens, t_tokens)

For those of using APIM as a Generative AI Gateway, you won’t have to worry about this for most of the OpenAI models offered through AOAI because the policy snippet I mentioned earlier will be improved to support additional models beyond what it supports today. For those of you using third-party gateways, this is likely relevant and may help to simplify your code and eliminate the discrepancies you see from calculating token usage yourself vs what you’re seeing displayed within the AOAI instance.

Well folks, this post was short and sweet. Hopefully this small tidbit of information helps a few folks out there who were going the tokenizer route. Any simplification these days is welcome!