Generative AI in Azure for the Generalist – Prompt and Response Logging with API Management

Hello folks!

The rate of change in tech is the most crazy I’ve experienced in my career. What you knew yesterday is quickly replaced with major changes a week or two later. The generative AI space is one of those areas that seems to change on a daily basis, and with these changes comes updated and new patterns and products. Given some major changes over the past few months, I’ve decided to kick off a new blog series that will cover generative AI in Azure for the generalist. The focus will be on folks like myself that sit squarely in the generalist vertical. In this series I’ll cover new topics as well as revisiting topics I’ve covered in the past and how they have changed.

In spirit of that latter point, tonight I’ll be covering an AWESOME new feature in Azure API Management (APIM).

The Background

I’ve talked pretty extensively about APIM’s role in the generative AI space where it provides the features and functionality of the architectural component of a Generative AI Gateway (GenAI Gateway). So what is a GenAI Gateway? Well, you see, someone at Forrester/Gartner needed to create a new phrase that vendors could adopt and sell existing products under, they had a pitch meeting, and yadda yadda yadda. But seriously, in its most simple sense a GenAI Gateway is essentially an API Gateway with additional functionality and features specific to the challenges of doing Generative AI at scale. These challenges can include fine-grained authorization, rate limiting, usage tracking, load balancing, caching, additional logging and monitoring and more.

Common GenAI Gateway functionality

Cloud providers jumped at the chance to add this functionality to their existing native API Gateway products. Microsoft began integrating this functionality into APIM first with load balancing, then with throttling based upon token usage and token tracking for charge backs and sharing model quota across an enterprise, and semantic caching for cost reduction and improved response times. One of the areas that was somewhat of a gap was prompt and response logging.

Back in 2023 I wrote an article about the challenges of prompt and response logging when using a generative AI gateway pattern, and specifically some of the challenges around when APIM was used as the gateway. The history of how folks tried to tackle the issue is pretty interesting context to understand how we ended up where we were.

Before I jump into that history, it’s worth understanding why you should care about prompt and response logging. Those cares are typically grouped in two buckets:

  1. Operational
  2. Security

In the operational bucket we care about these things because they provide great insight into how our users are using these tools to identify commonly asked questions. For example, if we see a question pop up a lot, maybe it’s something we need to add to a user-facing FAQ. Or perhaps we build a workflow into our app that checks commonly asked questions and provides an answer before we call an LLM in order to save some costs and time. There are many creative uses to having these things saved and available.

In the security bucket we care because we want to ensure the LLMs are used responsibly. We don’t want people abusing the LLMs and getting instructions on how to malicious things and we also want to monitor them to ensure we don’t see odd behavior that might be indicative of an attacker who may have compromised a chat bot. Lastly, we capture this because it’s only a matter of a time before some government somewhere in the world pushes legislation that requires us to. It’s coming folks.

Now let’s talk the history of how folks tried to solve this problem.

First, we tried logging requests and responses to Application Insights using the built-in integration with APIM. This worked great until the max tokens for prompts grew too large such that requests and responses started getting truncated. Next, we tried using APIM’s integration with Event Hub (logger) in combination with complex custom APIM policy to parse the request and response, extract the prompt and completion, and deliver to an Event Hub for it to get picked up by some type of automated function and stored in some type of backend data store like a CosmosDB. This worked for a short time where folks were largely experimenting with how the LLMs (large language models) worked with their data but started to fall apart when these LLMs were baked into a chat bot handed out to users (they were also a nightmare to maintain due to frequent API changes to the structure of requests and responses). The reason for this is chat bots demand streaming based completions which deliver the tokens as they generated (which seems more human like) vs the user waiting for the entire completion to be generated. APIM would end up buffering the response and breaking the user experience. To solve this problem, folks were introducing custom code to do the parsing outside of APIM (such as this creative solution by my peer Shaun Callighan). Writing custom code, running it somewhere, and integrating it into APIM was a tough pill to swallow. Most of my customer base either accepted prompt and response logging would be dependent on the developer baking it into their application or they would simply accept not getting that information for the time being.

What’s New

Kind of a shitty situation to be in, right? Well, I have good news for you. Last week the APIM Product Group (PG) released a stellar new feature to support prompt and response logging (both streaming and non-streaming) with a few clicks of the mouse (or slight modifications of code). This morning I had a chance to muck around with it and I wanted to get out this quick article to share with folks the basics of setting this up and provide a bit of detail into how it works (I’ll be updating post this as I experiment more).

The feature is now available directly as an additional log emitted by the APIM instance via the diagnostic settings. This means you can stream these logs to a Log Analytics Workspace, Azure Storage Account, or on to an Event Hub where you send it to any place your heart desires.

Setup

Setting this feature up is pretty cake and requires only a few steps to get it done.

First up you’ll need to enable the additional log in diagnostic settings as seen below.

New diagnostic setting

Once the new diagnostic setting is enabled, you then need to enable it for your API that represents your instance of the an Azure OpenAI resource(s) or Azure AI Foundry (FKA Azure AI Service) instance hosting your LLMs.

Enabling feature in API

Once the feature is enabled in both places, the events should begin to get captured in around 15 or so minutes. I chose to send mine to a Log Analytics Workspace and had a new table named ApiManagementGatewayLlmLogs appear (took about 15 minutes to finally appear) which contains events related to my operations against the LLMs. Each log entry represents a 32KB chunk of the request and response for up to 2MB. The SequenceNumber field is used to denote the order of the chunks as seen in the image below with the CorrelationId field requesting the unique identifier for each request and response.

Expanding an event gives you the ability to review the prompt and response in full detail. This particular request spanned three separate events (sequence 0-2) with the first sequence (0) containing the prompt, completion, and total tokens and the second sequence (1) including the prompt and last sequence (2) containing the model’s response.

Example prompt and response logging event

I tested with both a multi-modal model (gpt-4o) and a reasoning model (o1) and both sets of events were captured. I haven’t seen an authoritative list for which models are supported, but when I do I’ll update this post with a link.

I’m also waiting to hear back from the PG as to how APIM determines it’s a call to an LLM. My guess is by operation name, but waiting on that response as well. I haven’t tested other operations such as creating embeddings yet, so if you do, feel free to reply to this post with your findings. If I’m able to get a full list of operations supported by this logging, I’ll update the post.

Wrapping It Up

That about sums up this quick post. My main goal here was to publicize this new feature because it’s a real game changer for APIM and addresses a major pain point of Generative AI Gateways in general. It’s been really cool to see this from the beginning, and I’m not sure about other folks, but I love understanding the journey a technology takes, the new problems that pop up, and the solutions that solve those problems. It really helps give context to why the solution looks the way it does.

See you next post!

Simple Patterns for Chatting with Your Data – Using the Microsoft public backbone

Hello again folks! Recently, I’ve been working with some far more intelligent peers (such as my buddy Jose Medina Gomez, you should definitely check out his repos because he has some awesome stuff there) on getting some new-to-Azure customers up and running in the GenAI (generative AI) space. Specifically, these customers had some custom data they wanted to use the LLMs (large language models) to reason on and answer questions about. This called for using a RAG (retrieval-augmented generation) pattern to provide an LLM access to an external knowledge base. I thought it would be helpful to other folks out there like myself that are new to this world to document some simple patterns to do this type of stuff that keep security in mind. I’ll cover these over a few posts with this being the first.

The Pattern

The first pattern I want to cover is what I call the “Microsoft public backbone” pattern. This pattern is ideal for customers to minimal to no Azure presence who need something up and running quickly with some basic security guardrails. The pattern looks like what you see below:

Microsoft-backbone Pattern

The key benefits of this pattern are:

  • All traffic between Microsoft PaaS (platform-as-a-service) services flows over the Microsoft public backbone and the organization’s application communicates with the services over the Microsoft public backbone.
  • All Microsoft PaaS services use the built-in service firewall to control inbound traffic.
  • Microsoft PaaS services that support outbound network controls use those controls to mitigate the risk of data exfiltration.
  • Authentication between each component uses Entra ID based authentication and Azure RBAC authorization.
  • Minimizes costs by choosing more affordable SKUs where possible.
  • Captures logs where available.

What I like about this pattern is it is super simple to get up and running (can take less than one hour) and provides decent security controls with minimal headache. It’s by no means production-ready pattern for reasons I’ll discuss further in this post, but for a quick proof-of-concept or for getting your feet wet with RAG-like patterns, this is a great choice.

I’ll now spend a few moments providing detail to each of the benefits I outlined above.

The Benefits and Considerations

Benefit 1: All traffic between Microsoft PaaS (platform-as-a-service) services flows over the Microsoft public backbone and the organization’s application communicates with the services over the Microsoft public backbone

Simplicity is the name of the game here folks. By keeping keeping all communication to the Microsoft public backbone you avoid the complexities of Private Link integration. For organizations that are new to Azure and don’t have a platform landing zone (hybrid connectivity, network inspection, Internet egress support for Azure resources, DNS forwarding for Private Link) this pattern can be done without much effort. As an added benefit, PaaS to PaaS stays over the Microsoft public backbone providing you with the the security controls Microsoft provides across their public backbone.

Benefit 2: All Microsoft PaaS services use the built-in service firewall to control inbound traffic

Almost all (I’m sure there are some exceptions that aren’t top of mind) Microsoft PaaS services provide a basic built-in firewall I refer to as the service firewall. The service firewall is off by default, but can be toggled on to restrict inbound traffic to the public endpoint for the PaaS (which ever PaaS has). Most commonly (every PaaS service seems to work a bit differently) you can create exceptions to the firewall based on IP address or allow “trusted” Microsoft services to bypass the firewall. Additionally, Azure Storage has is its capability which allows to configure specific resource instances to bypass the firewall based on the resource instance identifier and its managed identity.

The “trusted” Microsoft service exception needs a bit more explaining. Most Azure PaaS (again, there are always exceptions because Microsoft loves its snowflakes) has a checkbox in the Portal with text like seen in the screenshot below. This checkbox allows traffic from a specific set of Azure services (identified by their public IP addresses) to bypass the service firewall. Today, this will be a box you will often need to check whenever you are doing PaaS to PaaS. The key thing to understand about this checkbox is it’s all public IPs associated with whatever the “trusted” services are the specific product group identifies. This means instances not owned by you could be allowed to bypass the service firewall (making authentication and authorization critical). Thankfully, the upcoming Network Security Perimeters feature will likely address this gap and make this box something of the past.

Trusted services bypass option

Benefit 3: Microsoft PaaS services that support outbound network controls use those controls to mitigate the risk of data exfiltration

While controlling inbound traffic for a PaaS is typically a Private Endpoint or service firewall (or eventually Network Security Perimeters) use case, controlling outbound traffic tends a bit a bit more tricky. For many compute-based services (AKS, App Services, Azure Container Services, etc) you are able to force outbound traffic through your virtual network allowing you to get visibility into the traffic and control what that service can make outbound network connections to.

With PaaS services like the ones used in this architecture, these types of virtual network integration aren’t an option. For most non-compute-based PaaS you are essentially SOL (I’ll let you figure out this acronym yourself). The services that fall under the Cognitive Services framework (such as Azure OpenAI Service and AI Services) support outbound traffic controls. You can check out my prior post for the details on those controls. In this architecture we use Azure OpenAI Services so we can take advantage of those outbound controls.

Restricting outbound access in Cognitive Services

Controlling outbound access from a PaaS will be another place Network Security Perimeters will become the predominant control mechanism.

Benefit 4: Authentication between each component uses Entra ID based authentication and Azure RBAC authorization

In this pattern Entra ID-based authentication and Azure RBAC authorization is used at each hop for human-to-service and service-to-service communication. Users interacting with these service will use their Entra ID user identities which are typically synchronized from an on-premises Windows Active Directory. Non-humans (applications and services) will use Entra ID service principals to authenticate to each other. This will either be a standard service principal, identified by a client id and client secret (for you AWS folks this is essentially your IAM User), or a special type of service principal called a managed identity (for those of you coming from AWS, this is as close to an IAM Role as Azure gets).

Azure RBAC roles are assigned with least privilege in mind. Users (or groups) are assigned the minimal permissions they need to upload data to the storage account and perform needed functions with the PaaS to load and query the data. Services are provided the necessary permissions they need to interact.

Benefit 5: Minimizes costs by choosing more affordable SKUs where possible

Costs are already pretty cheap with this pattern. This pattern minimizes costs further by sacrificing the Shared Private Access feature of AI Search. Yeah, you lose on the fuzzy feeling of the communication between AI Search and Azure Storage or Azure OpenAI Service happening over Private Link, but you save some money with the more basic SKU and still get the security of the Microsoft public backbone and the service firewalls.

Note that this design choice is made to optimize costs. Performance within the Basic SKU may not be sufficient for your use case.

Benefit 6: Captures logs where available

Finally, let’s look at logging. In this pattern you’ll get your management plane activities (actions on the resources) via Azure Activity Logs and you’ll get data plane (actions on the data held by the resources) activities via diagnostic settings delivering logs to a Log Analytics Workspace.

Each of these resources has a selection of logs available. Some are “ok” (Azure OpenAI Service) and some are “meh” (Azure AI Search). However, you will want all of these logs for both security and operational use.

The Considerations

There can’t only be benefits, right? The major considerations of this pattern is it’s very much built for proof-of-concept. You get basic network security controls with the service firewall, but no inspection of traffic unless you have an inspection point on-premises in front of the developer. Additionally, before communication from the developer gets to Azure it will have to traverse the public Internet before it gets to the Microsoft public backbone. While all of your communication will happen over TLS, you don’t get the security benefits of wrapping that encrypted session with an IPSec tunnel or funneling it over a known path and operational benefits of consistent latency with ExpressRoute.

Scalability of AI Search is another consideration. The Basic SKU will offer you a limited amount of scale.

On the LLM front, this pattern only allows you to deploy models available within an Azure OpenAI Service (or AI Services) instance (thanks to Jose for highlighting this consideration). There are options to adjust this pattern to use other LLMs, but it will require the introduction of AI Foundry which is quite the beast.

There are likely others I’m missing, but this is still a great little pattern to see what the LLMs can do that comes wrapped with decent security controls and requires minimal coding.

Loading Data

So you’ve decided that the benefits and considerations make sense to you and you want to move ahead, or maybe you’re just dipping your toes into this world and you want to muck around with things. Now you’re left wondering, “Ok I set this thing up like you documented above, how the heck do I use it?”

Alrighty, I’m going to show you the the quick and dirty way. Do not assume the way I’m going to show you is the only way to do this pattern. There are lots of variations, especially on how you chunk and load the data into AI Search. My advice to you in that department would be to work with the data folks at your organization or engage a Microsoft solutions architect on the optimal way to chunk and load your data. Do it wrong, and the responses from the LLMs will be crappy. After watching my buddy Jose and many of his peers, it’s very much an art form that requires experience and experimentation.

For the less experienced folks like myself, there is a built-in wizard within AI Search that helps to chunk and vectorize the data. If you open the Azure Portal you’ll see an option called Import and Vectorize as seen in the screenshot below.

The easy button

Clicking that option will open up the wizard (yes Microsoft still loves its wizards). On the first screen you’ll select the Azure Blob Storage option. On the next screen you’ll configure the options below. If you’ve set things up as I’ve outlined them in the initial pattern diagram (RBAC and network controls) this will work like a champ (don’t forget to deploy a chat model like gpt-4o and embedding model like text-embedding-3-large to the AOAI (Azure OpenAI Service) instance). I’m assuming you already created a container in the Azure Storage account and uploaded data (like some PDFs). I’ve found this useful when referencing and consuming RFCs to confirm my understanding.

Here you’re specifying that the AI Search instance grab the data you’ve uploaded to the blob container using its system-assigned managed identity.

Connect to your data

The next screen provides us with the options to vectorize (or create embeddings) for our data. We can then use AI Search to query both the text-based chunks and vectors to optimize the results we return to the LLM. Here I’m selecting to use an embedding model deployed to an Azure OpenAI instance. More advanced scenarios you may choose to incorporate other embeddings you’ve built yourself or sourced from the model marketplace and deployed to AI Foundry (thanks to Jose for mentioning this).

I also select to use the deployed text-embedding-3-large embedding model and am again using the AI Search managed identity to call the Azure OpenAI service to create the embeddings.

Vectorize your text

The data I’m using (10K financial reports) doesn’t have any images so I ignore the Vectorize and enrich your images option.

Finally, I opt to use the semantic ranker (great article on this) to improve the results of my queries to AI Search and leave the other options as the default since this is a one time operation. If you are doing this regularly, getting a good data pipeline in place (either push or pull) is mission critical (another learning from my buddy Jose). Someone smarter than me can help you with that.

Review the settings and opt to create the indexer. The full run will only take a few minutes if you don’t have a ton of content. For larger data volume, get yourself some coffee and work on something else while you wait. If you have any failures at this step, it will likely be you don’t have the networking controls setup correctly. Review the image I posted at the beginning of this post and get busy with the resource logs (it’s good experience!).

Indexer in progress

Once it’s complete, you’ll see a screen like this if you select the indexer that was created. It will will show you how many of the docs it pulled from the container and how many were successfully imaged.

Successful run. Yay!

Next you can navigate to the index and run a test search. Here you’ll get back the relevant records and you can muck around with direct searches against the index to get a feel for the structure of the chunked data. If you don’t get an responses it’s likely an RBAC or networking issue. For RBAC, ensure you granted yourself both management plane (Search Service Contributor) and data plane (Search Data Index Reader or Contributor).

Directly searching chunked data

Chatting with your data

Alright, your data has been pulled into AI Search. How do you go about extending this knowledge base to the LLM? There are a ton of ways to do it, but for something quick and dirty, I’m a fan of either writing some simple Python code or using the Chat Playground via your Azure OpenAI Services or AI Services instance. For this blog, I’m going to be lazy and focus on the latter.

For this you’ll want to navigate to the AOAI instance and select the “Explore Azure AI Foundry portal” link. No this isn’t actually AI Foundry and is instead the rebranded (and standardized) Azure OpenAI Playground incorporated into a Foundry-like interface.

Entering the Azure AI Foundry portal

Once entering the new portal you’ll be dropped into the Chat playground. Here you’ll want to use the Add your data link and then Add a data source link as seen below.

Importing index to Chat Playground

On the next screen I choose to add the index I created earlier during my data import also choosing to use vector-based searches to improve the quality of the search results returned to the LLM. This is where the embedding model I deployed earlier comes into play as seen in the image below.

Adding data source

On the next screen I opt to do a hybrid + semantic search to ensure I get the best results out of a typical keyword search, vector search, and semantic search. A default semantic search configuration was created for you when you imported the data into AI Search.

Data management screen

Lastly, I choose to use the system-assigned managed identity of the AOAI instances when calls are made from the AOAI instance to AI Search. This is where the Azure RBAC assignments I show in the original diagram come into play. Any missing permissions on the managed identity will pop up for you here during the validation stage.

Data Connection screen

After saving and closing I’m good to go! In the chat window I can ask a question such as “How many shares did Microsoft buy back in 2024?” The LLM optimizes my query for AI Search, creates vector-based embeddings of my question, performs the hybrid and semantic search against the AI Search instance, summarizes the results, and returns them to the Chat Playground.

Chat with your data process

Below you see the answer to my question with citations back to the original chunked data in AI Search. Cool shit right?

LLM results

If you’re just dipping your toes into this world or you’re an organization validating that the Azure platform’s AI Services can do what you need them to do before you invest heavily into the platform, this is a great pattern to mess around with. It’s super easy to get up and running, doesn’t require a deep understanding of Azure, and still provides foundational security controls that every organization should have in place. All this in a quick few hours of work.

In upcoming posts I’ll showcase some variations of this pattern such as the incorporation of Private Link and using CoPilot Studio as a frontend to build a quick and simple Teams bot using a small variation of this pattern (this was a really fun one Denis Rougeau, Aga Shirazi, Jose and I have been rolling out to a few customers. Super excited to talk more about that one!).

Until next time!

Azure OpenAI Service – Controlling Outbound Access

Hello again folks! Work has been complete insanity and has been preventing me from posting as of late. Finally, I was able to carve out some time to get a quick blog post done that I’ve been sitting on for while.

I have blogged extensively on the Azure OpenAI Service (AOAI) and today I will continue with another post in that series. These days you folks are more likely using AOAI through the new Azure AI Services resource versus using the AOAI resource. The good news is the guidance I will provide tonight will be relevant to both resources.

One topic I often get asked to discuss with customers is the “old person” aspects of the service such as the security controls Microsoft makes available to customers when using the AOAI service. Besides the identity-based controls, the most common security control that pops up in the regulated space is the available networking controls. While incoming network controls exercised through Private Endpoints or the service firewall (sometimes called the IP firewall) are common, one of the often missed controls within the AOAI service is the outbound network controls. Oddly enough, this is missed often in non-compute PaaS.

You may now be asking yourself, “Why the heck would the AOAI service need to make its own outbound network connections and why should I care about it?”. Excellent question, and honestly, not one I thought about much when I first started working with the service because the use cases I’m going to discuss didn’t come up because the feature set didn’t exist or it wasn’t commonly used. There are two main use cases I’m going to cover (note there are likely others and these are simply the most common):

The first use case is easily the most common right now. The “chat with your data” feature is a feature within AOAI that allows you to pass some extra information in your ChatCompletion API call that will instruct the AOAI service to query an AI Search index that you have populated with data from your environment to extend the model’s knowledge base without completely retraining it. This is essentially a simple way to muck with a retrieval augmented generation (RAG) pattern without having to write the code to orchestrate the interaction between the two services such as detailed in the link above. Instead, the “chat with your data” feature handles the heavy lifting of this for you if you have a populated index and want to add a few additional lines of code. In a future article I’ll go into more depth around this pattern because understanding the complete network and identity interaction is pretty interesting and often misconfigured. For now, understand the basics of it with the flow diagram below. Here also is some sample code if you want to play around with it yourself.

The second use case is when using a multimodal model like GPT-4 or GPT-4o. These models allow for you to pass them other types of data besides text such as images and audio. When requesting an image be analyzed, you have the option of passing the image as base64 or you can pass it a URL. If you pass it a URL the AOAI service will make an outbound network connection to the endpoint specified in the URL to retrieve the image for analysis

 response = client.chat.completions.create(
            ## Model must be a multimodal model
            model=os.getenv('LLM_DEPLOYMENT_NAME'),
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "Describe the image"
                        },
                        {
                            "type": "image_url",
                            "image_url": {"url": "{{SOME_PUBLIC_URL}}"}
                        }
                    ]
                }
            ],
            max_tokens=100         
        )

In both of these scenarios the AOAI service establishes a network connection from the Microsoft public backbone to the resource (such as AI Search in scenario 1 or a public blob in scenario 2). Unlike compute-based PaaS (App Services, Functions, Azure Container Apps, AKS, etc), today Microsoft does not provide a means for you to send this outbound traffic from AOAI through your virtual network with virtual network injection or virtual network integration. Given that you can’t pass this traffic through your virtual network, how can you mitigate potential data exfiltration risks or poisoning attacks? For example, let’s say an attacker compromises an application’s code and modifies it such that the “chat with your data” feature uses an attacker’s instance of AI Search to capture sensitive data in the queries or poisoning the responses back to the user with bad data. Maybe an attacker decides to use your AOAI instances to process images stolen from another company and placed on a public endpoint. I’m sure someone more creative could come up with a plethora of attacks. Either way, you want to control what your resources communicate with. The positive news is there is a way to do this today and likely an even better way to do this tomorrow when it comes to the AOAI service.

The AOAI (and AI Services) resources fall under the Cognitive Services framework. The benefit of being within the scope of this framework is they inherit some of the security capabilities of this framework. Some examples include support for Private Endpoints or disabling local key-based authentication. Another feature that is available to AOAI is an outbound network control. On an AOAI or AI Services resource, you can configure two properties to lock down the services’ ability to make outbound network calls. These two properties are:

  • restrictOutboundNetworkAccess – Boolean and set to true to block outbound access to everything but the exceptions listed in the allowedFqdnList property
  • allowedFqdnList – A list of FQDNs the service should be able to communicate with for outbound network calls

Using these two controls you can prevent your AOAI or AI Services resource from making outbound network calls except to the list of FQDNs you include. For example, you might whitelist your AI Search instance FQDN for the “chat with your data” feature or your blob storage account endpoint for image analysis. This is a feature I’d highly recommend you enable by default on any new AOAI or AI Service you provision moving forward.

The good news for those of you in the Terraform world is this feature is available directly within the azurerm provider as seen in a sample template below.

resource "azurerm_cognitive_account" "openai" {
  name                = "${local.openai_name}${var.purpose}${var.location_code}${var.random_string}"
  location            = var.location
  resource_group_name = var.resource_group_name
  kind                = "OpenAI"

  custom_subdomain_name = "${local.openai_name}${var.purpose}${var.location_code}${var.random_string}"
  sku_name              = "S0"

  public_network_access_enabled = var.public_network_access
  outbound_network_access_restricted = true
  fqdns = var.allowed_fqdn_list

  network_acls {
    default_action = "Deny"
    ip_rules = var.allowed_ips
    bypass = "AzureServices"
  }

  identity {
    type = "SystemAssigned"
  }

  tags = var.tags

  lifecycle {
    ignore_changes = [
      tags["created_date"],
      tags["created_by"]
    ]
  }
}

If a user attempts to circumvent these controls they will receive a descriptive error stating that outbound access is restricted. For those of you operating in a regulated environment, you should be slapping this setting on every new AOAI or AI Service instance you provision just like you’re doing with controlling inbound access with a Private Endpoint.

Alright folks, that sums up this quick blog post. Let me summarize the lessons learned:

  1. Be aware of which PaaS Services in Azure are capable of establishing outbound network connectivity and explore the controls available to you to restrict it.
  2. For AOAI and AI Services use the restrictOutboundNetworkAccess and allowedFqdnList properties to block outbound network calls except to the endpoints you specify.
  3. Make controlling outbound access of your PaaS a baseline control. Don’t just focus on inbound, focus on both.

Before I close out, recall that I mentioned a way today (the above) and a way in the future. The latter will be the new feature Microsoft announced into public preview Network Security Perimeters. As that feature matures and its list of supported services expands, controlling inbound and outbound network access for PaaS (and getting visibility into it via logs) is going to get far easier. Expect a blog post on that feature in the near future.

Thanks folks!

AI Foundry – The Basics

This is a part of my series on AI Foundry:

Happy New Year! Over the last few months of 2024, I was buried in AI Foundry (FKA Azure AI Studio. Hey marketing needs to do something a few times a year.) proof-of-concepts with my best buddy Jose. The product is very new, very powerful, but also very complex to get up and running in a regulated environment. Through many hours labbing scenarios out and troubleshooting with customers, I figured it was time to share some of what I learned.

So what is AI Foundry? If you want the Microsoft explanation of what it is, read the documentation. Here you’ll get the Matt Felton opinion on what it is (god help you). AI Foundry is a toolset intended to help AI Engineers build Generative AI applications. It allows them to interact with the LLMs (large language models) and build complex workflows (via Prompt Flows) in a no code (like Chat Playground) or low code (prompt flow interface) environment. As you can imagine, this is an attractive tool to get the people who know Generative AI quick access to the LLMs so use cases can be validated before expensive development cycles are spent building the pretty front-end and code required to make it a real application.

Before we get into the guts of the use cases Jose and I have run into, I want to start with the basics of how the hell this service is setup. This will likely require a post or two, so grab your coffee and get ready for a crash course.

AI Foundry is built on top of AML (Azure Machine Learning). If you’re ever built out a locked down AML instance, you have some understanding of the many services that work together to provide the service. AI Foundry inherits these components and provides a sleek user interface on top and typically requires additional resources like Azure OpenAI and AI Search (for RAG use cases). Like AML there are lots and lots of pieces that you need to think about, plan for, and implement in the correct manner to make the secured instance of AI Foundry work.

One specific feature of AML plays an important role within AI Foundry and that is the concept of a hub workspace. A hub is an AML workspace that centrally manages the security, networking, compute resources, and quota for children AML workspaces. These child AML workspaces are referred to as projects. The whole goal of the hub is to make it easier for your various business units to do the stuff they need to do with AML/AI Foundry without having to manage the complex pieces like the security and networking. My guidance would be to think about giving each business unit a Foundry Hub that they can group projects of similar environment (prod or non-prod) and data sensitivity.

General relationship between AI Foundry Hub and AI Foundry Projects

Ok, so you get the basic gist of this. When you deploy an AI Foundry instance, behind the scenes an AML workspace designated as a hub is created. Each project you create is a “child” AML workspace of the hub workspace and will inherit some resources from the hub. Now that you’re grounded in that basic piece, let’s talk about the individual components.

The many resources involved with a secured AI Foundry instance

As you can see in the above image, there are a lot of components of this solution that you will likely use if you want to deploy an AI Foundry instance that has the necessary security and networking controls. Let me give you a quick and dirty explanation of each component. I’ll dive deeper into the identity, authorization, and networking aspects of these components in future posts.

  • Managed identities
    • There are lots of managed identities in use with this product. There is a managed identity for the hub, a managed identity for the project, and managed identities for the various compute. One of the challenges of AI Foundry is knowing which managed identity (and if not a managed identity, the user’s identity) is being used to access a specific resource.
  • Azure Storage Account
    • Just like in AML, there is a default storage account associated with the workspace. Unlike with a traditional AML workspace you may be familiar with, the hub feature allows all project workspaces to leverage the same storage account. The storage account is used by the workspaces to store artifacts like logs in blob storage and artifacts like prompt flows in file storage. The hub and projects isolate their data to specific containers (for blob) and folders (for file) with Azure ABAC (holy f*ck a use case for this feature finally) setup such that the managed identities for the workspaces can only access containers/folders for data related to their specific workspace.
  • Azure Key Vault
    • The Azure Key Vault will store any keys used for connections created from within the AI Foundry project. This could be keys for the default storage account or keys used for API access to models you deploy from the model catalog.
  • Azure Container Registry
    • While this is deemed optional, I’d recommend you plan on deploying it. When deploy a prompt flow there uses certain functionality to the managed compute, the container image used isn’t the default runtime and an ACR instance will be spun up automatically without all the security controls you’ll likely want.
  • Azure OpenAI Service
    • This is used for deployment of OpenAI and some Microsoft chat and embedding models
  • Azure AI Service
    • This can be used as an alternative to the Azure OpenAI Service. It has some additional functionality beyond just hosting the models such as speed and the like.
  • Azure AI Search
    • This will be used for anything RAG related. Most likely you’ll see it used with the Chat With Your Data feature of the Chat Playground.
  • Managed Network
    • This is used to host the compute instances, serverless endpoints and managed online endpoints spun up for compute operations within AI Foundry. I’ll do a deep dive into networking within the service in a future post.
  • Azure Firewall
    • If building a secured instance, you’re going to use a managed virtual network that is locked down to all Internet access via outbound rules. Under the hood an Azure Firewall instance (standard for almost all use cases) will be spun up in the managed virtual network. You interact with it through the creation of the outbound rules and can’t directly administer it. However, you will be paying for it.
  • Role Assignments
    • So so so many role assignments. I’ll cover these in a future post.
  • Azure Private DNS
    • Used heavily for interacting with the AI Foundry instance and models/endpoints you deploy. I’ll cover this in an upcoming post.

Are you frightened yet? If not you should be! Don’t worry though, over this series I’ll walk you through the pain points of getting this service up and running. Once you get past the complex configuration, it’s a crazy valuable service that you’ll have a high demand for from your business units.

In the next post I’ll walk through the complexities of authorization within AI Foundry.