Simple Patterns for Chatting with Your Data – Using the Microsoft public backbone

Hello again folks! Recently, I’ve been working with some far more intelligent peers (such as my buddy Jose Medina Gomez, you should definitely check out his repos because he has some awesome stuff there) on getting some new-to-Azure customers up and running in the GenAI (generative AI) space. Specifically, these customers had some custom data they wanted to use the LLMs (large language models) to reason on and answer questions about. This called for using a RAG (retrieval-augmented generation) pattern to provide an LLM access to an external knowledge base. I thought it would be helpful to other folks out there like myself that are new to this world to document some simple patterns to do this type of stuff that keep security in mind. I’ll cover these over a few posts with this being the first.

The Pattern

The first pattern I want to cover is what I call the “Microsoft public backbone” pattern. This pattern is ideal for customers to minimal to no Azure presence who need something up and running quickly with some basic security guardrails. The pattern looks like what you see below:

Microsoft-backbone Pattern

The key benefits of this pattern are:

  • All traffic between Microsoft PaaS (platform-as-a-service) services flows over the Microsoft public backbone and the organization’s application communicates with the services over the Microsoft public backbone.
  • All Microsoft PaaS services use the built-in service firewall to control inbound traffic.
  • Microsoft PaaS services that support outbound network controls use those controls to mitigate the risk of data exfiltration.
  • Authentication between each component uses Entra ID based authentication and Azure RBAC authorization.
  • Minimizes costs by choosing more affordable SKUs where possible.
  • Captures logs where available.

What I like about this pattern is it is super simple to get up and running (can take less than one hour) and provides decent security controls with minimal headache. It’s by no means production-ready pattern for reasons I’ll discuss further in this post, but for a quick proof-of-concept or for getting your feet wet with RAG-like patterns, this is a great choice.

I’ll now spend a few moments providing detail to each of the benefits I outlined above.

The Benefits and Considerations

Benefit 1: All traffic between Microsoft PaaS (platform-as-a-service) services flows over the Microsoft public backbone and the organization’s application communicates with the services over the Microsoft public backbone

Simplicity is the name of the game here folks. By keeping keeping all communication to the Microsoft public backbone you avoid the complexities of Private Link integration. For organizations that are new to Azure and don’t have a platform landing zone (hybrid connectivity, network inspection, Internet egress support for Azure resources, DNS forwarding for Private Link) this pattern can be done without much effort. As an added benefit, PaaS to PaaS stays over the Microsoft public backbone providing you with the the security controls Microsoft provides across their public backbone.

Benefit 2: All Microsoft PaaS services use the built-in service firewall to control inbound traffic

Almost all (I’m sure there are some exceptions that aren’t top of mind) Microsoft PaaS services provide a basic built-in firewall I refer to as the service firewall. The service firewall is off by default, but can be toggled on to restrict inbound traffic to the public endpoint for the PaaS (which ever PaaS has). Most commonly (every PaaS service seems to work a bit differently) you can create exceptions to the firewall based on IP address or allow “trusted” Microsoft services to bypass the firewall. Additionally, Azure Storage has is its capability which allows to configure specific resource instances to bypass the firewall based on the resource instance identifier and its managed identity.

The “trusted” Microsoft service exception needs a bit more explaining. Most Azure PaaS (again, there are always exceptions because Microsoft loves its snowflakes) has a checkbox in the Portal with text like seen in the screenshot below. This checkbox allows traffic from a specific set of Azure services (identified by their public IP addresses) to bypass the service firewall. Today, this will be a box you will often need to check whenever you are doing PaaS to PaaS. The key thing to understand about this checkbox is it’s all public IPs associated with whatever the “trusted” services are the specific product group identifies. This means instances not owned by you could be allowed to bypass the service firewall (making authentication and authorization critical). Thankfully, the upcoming Network Security Perimeters feature will likely address this gap and make this box something of the past.

Trusted services bypass option

Benefit 3: Microsoft PaaS services that support outbound network controls use those controls to mitigate the risk of data exfiltration

While controlling inbound traffic for a PaaS is typically a Private Endpoint or service firewall (or eventually Network Security Perimeters) use case, controlling outbound traffic tends a bit a bit more tricky. For many compute-based services (AKS, App Services, Azure Container Services, etc) you are able to force outbound traffic through your virtual network allowing you to get visibility into the traffic and control what that service can make outbound network connections to.

With PaaS services like the ones used in this architecture, these types of virtual network integration aren’t an option. For most non-compute-based PaaS you are essentially SOL (I’ll let you figure out this acronym yourself). The services that fall under the Cognitive Services framework (such as Azure OpenAI Service and AI Services) support outbound traffic controls. You can check out my prior post for the details on those controls. In this architecture we use Azure OpenAI Services so we can take advantage of those outbound controls.

Restricting outbound access in Cognitive Services

Controlling outbound access from a PaaS will be another place Network Security Perimeters will become the predominant control mechanism.

Benefit 4: Authentication between each component uses Entra ID based authentication and Azure RBAC authorization

In this pattern Entra ID-based authentication and Azure RBAC authorization is used at each hop for human-to-service and service-to-service communication. Users interacting with these service will use their Entra ID user identities which are typically synchronized from an on-premises Windows Active Directory. Non-humans (applications and services) will use Entra ID service principals to authenticate to each other. This will either be a standard service principal, identified by a client id and client secret (for you AWS folks this is essentially your IAM User), or a special type of service principal called a managed identity (for those of you coming from AWS, this is as close to an IAM Role as Azure gets).

Azure RBAC roles are assigned with least privilege in mind. Users (or groups) are assigned the minimal permissions they need to upload data to the storage account and perform needed functions with the PaaS to load and query the data. Services are provided the necessary permissions they need to interact.

Benefit 5: Minimizes costs by choosing more affordable SKUs where possible

Costs are already pretty cheap with this pattern. This pattern minimizes costs further by sacrificing the Shared Private Access feature of AI Search. Yeah, you lose on the fuzzy feeling of the communication between AI Search and Azure Storage or Azure OpenAI Service happening over Private Link, but you save some money with the more basic SKU and still get the security of the Microsoft public backbone and the service firewalls.

Note that this design choice is made to optimize costs. Performance within the Basic SKU may not be sufficient for your use case.

Benefit 6: Captures logs where available

Finally, let’s look at logging. In this pattern you’ll get your management plane activities (actions on the resources) via Azure Activity Logs and you’ll get data plane (actions on the data held by the resources) activities via diagnostic settings delivering logs to a Log Analytics Workspace.

Each of these resources has a selection of logs available. Some are “ok” (Azure OpenAI Service) and some are “meh” (Azure AI Search). However, you will want all of these logs for both security and operational use.

The Considerations

There can’t only be benefits, right? The major considerations of this pattern is it’s very much built for proof-of-concept. You get basic network security controls with the service firewall, but no inspection of traffic unless you have an inspection point on-premises in front of the developer. Additionally, before communication from the developer gets to Azure it will have to traverse the public Internet before it gets to the Microsoft public backbone. While all of your communication will happen over TLS, you don’t get the security benefits of wrapping that encrypted session with an IPSec tunnel or funneling it over a known path and operational benefits of consistent latency with ExpressRoute.

Scalability of AI Search is another consideration. The Basic SKU will offer you a limited amount of scale.

On the LLM front, this pattern only allows you to deploy models available within an Azure OpenAI Service (or AI Services) instance (thanks to Jose for highlighting this consideration). There are options to adjust this pattern to use other LLMs, but it will require the introduction of AI Foundry which is quite the beast.

There are likely others I’m missing, but this is still a great little pattern to see what the LLMs can do that comes wrapped with decent security controls and requires minimal coding.

Loading Data

So you’ve decided that the benefits and considerations make sense to you and you want to move ahead, or maybe you’re just dipping your toes into this world and you want to muck around with things. Now you’re left wondering, “Ok I set this thing up like you documented above, how the heck do I use it?”

Alrighty, I’m going to show you the the quick and dirty way. Do not assume the way I’m going to show you is the only way to do this pattern. There are lots of variations, especially on how you chunk and load the data into AI Search. My advice to you in that department would be to work with the data folks at your organization or engage a Microsoft solutions architect on the optimal way to chunk and load your data. Do it wrong, and the responses from the LLMs will be crappy. After watching my buddy Jose and many of his peers, it’s very much an art form that requires experience and experimentation.

For the less experienced folks like myself, there is a built-in wizard within AI Search that helps to chunk and vectorize the data. If you open the Azure Portal you’ll see an option called Import and Vectorize as seen in the screenshot below.

The easy button

Clicking that option will open up the wizard (yes Microsoft still loves its wizards). On the first screen you’ll select the Azure Blob Storage option. On the next screen you’ll configure the options below. If you’ve set things up as I’ve outlined them in the initial pattern diagram (RBAC and network controls) this will work like a champ (don’t forget to deploy a chat model like gpt-4o and embedding model like text-embedding-3-large to the AOAI (Azure OpenAI Service) instance). I’m assuming you already created a container in the Azure Storage account and uploaded data (like some PDFs). I’ve found this useful when referencing and consuming RFCs to confirm my understanding.

Here you’re specifying that the AI Search instance grab the data you’ve uploaded to the blob container using its system-assigned managed identity.

Connect to your data

The next screen provides us with the options to vectorize (or create embeddings) for our data. We can then use AI Search to query both the text-based chunks and vectors to optimize the results we return to the LLM. Here I’m selecting to use an embedding model deployed to an Azure OpenAI instance. More advanced scenarios you may choose to incorporate other embeddings you’ve built yourself or sourced from the model marketplace and deployed to AI Foundry (thanks to Jose for mentioning this).

I also select to use the deployed text-embedding-3-large embedding model and am again using the AI Search managed identity to call the Azure OpenAI service to create the embeddings.

Vectorize your text

The data I’m using (10K financial reports) doesn’t have any images so I ignore the Vectorize and enrich your images option.

Finally, I opt to use the semantic ranker (great article on this) to improve the results of my queries to AI Search and leave the other options as the default since this is a one time operation. If you are doing this regularly, getting a good data pipeline in place (either push or pull) is mission critical (another learning from my buddy Jose). Someone smarter than me can help you with that.

Review the settings and opt to create the indexer. The full run will only take a few minutes if you don’t have a ton of content. For larger data volume, get yourself some coffee and work on something else while you wait. If you have any failures at this step, it will likely be you don’t have the networking controls setup correctly. Review the image I posted at the beginning of this post and get busy with the resource logs (it’s good experience!).

Indexer in progress

Once it’s complete, you’ll see a screen like this if you select the indexer that was created. It will will show you how many of the docs it pulled from the container and how many were successfully imaged.

Successful run. Yay!

Next you can navigate to the index and run a test search. Here you’ll get back the relevant records and you can muck around with direct searches against the index to get a feel for the structure of the chunked data. If you don’t get an responses it’s likely an RBAC or networking issue. For RBAC, ensure you granted yourself both management plane (Search Service Contributor) and data plane (Search Data Index Reader or Contributor).

Directly searching chunked data

Chatting with your data

Alright, your data has been pulled into AI Search. How do you go about extending this knowledge base to the LLM? There are a ton of ways to do it, but for something quick and dirty, I’m a fan of either writing some simple Python code or using the Chat Playground via your Azure OpenAI Services or AI Services instance. For this blog, I’m going to be lazy and focus on the latter.

For this you’ll want to navigate to the AOAI instance and select the “Explore Azure AI Foundry portal” link. No this isn’t actually AI Foundry and is instead the rebranded (and standardized) Azure OpenAI Playground incorporated into a Foundry-like interface.

Entering the Azure AI Foundry portal

Once entering the new portal you’ll be dropped into the Chat playground. Here you’ll want to use the Add your data link and then Add a data source link as seen below.

Importing index to Chat Playground

On the next screen I choose to add the index I created earlier during my data import also choosing to use vector-based searches to improve the quality of the search results returned to the LLM. This is where the embedding model I deployed earlier comes into play as seen in the image below.

Adding data source

On the next screen I opt to do a hybrid + semantic search to ensure I get the best results out of a typical keyword search, vector search, and semantic search. A default semantic search configuration was created for you when you imported the data into AI Search.

Data management screen

Lastly, I choose to use the system-assigned managed identity of the AOAI instances when calls are made from the AOAI instance to AI Search. This is where the Azure RBAC assignments I show in the original diagram come into play. Any missing permissions on the managed identity will pop up for you here during the validation stage.

Data Connection screen

After saving and closing I’m good to go! In the chat window I can ask a question such as “How many shares did Microsoft buy back in 2024?” The LLM optimizes my query for AI Search, creates vector-based embeddings of my question, performs the hybrid and semantic search against the AI Search instance, summarizes the results, and returns them to the Chat Playground.

Chat with your data process

Below you see the answer to my question with citations back to the original chunked data in AI Search. Cool shit right?

LLM results

If you’re just dipping your toes into this world or you’re an organization validating that the Azure platform’s AI Services can do what you need them to do before you invest heavily into the platform, this is a great pattern to mess around with. It’s super easy to get up and running, doesn’t require a deep understanding of Azure, and still provides foundational security controls that every organization should have in place. All this in a quick few hours of work.

In upcoming posts I’ll showcase some variations of this pattern such as the incorporation of Private Link and using CoPilot Studio as a frontend to build a quick and simple Teams bot using a small variation of this pattern (this was a really fun one Denis Rougeau, Aga Shirazi, Jose and I have been rolling out to a few customers. Super excited to talk more about that one!).

Until next time!

Azure OpenAI Service – Controlling Outbound Access

Hello again folks! Work has been complete insanity and has been preventing me from posting as of late. Finally, I was able to carve out some time to get a quick blog post done that I’ve been sitting on for while.

I have blogged extensively on the Azure OpenAI Service (AOAI) and today I will continue with another post in that series. These days you folks are more likely using AOAI through the new Azure AI Services resource versus using the AOAI resource. The good news is the guidance I will provide tonight will be relevant to both resources.

One topic I often get asked to discuss with customers is the “old person” aspects of the service such as the security controls Microsoft makes available to customers when using the AOAI service. Besides the identity-based controls, the most common security control that pops up in the regulated space is the available networking controls. While incoming network controls exercised through Private Endpoints or the service firewall (sometimes called the IP firewall) are common, one of the often missed controls within the AOAI service is the outbound network controls. Oddly enough, this is missed often in non-compute PaaS.

You may now be asking yourself, “Why the heck would the AOAI service need to make its own outbound network connections and why should I care about it?”. Excellent question, and honestly, not one I thought about much when I first started working with the service because the use cases I’m going to discuss didn’t come up because the feature set didn’t exist or it wasn’t commonly used. There are two main use cases I’m going to cover (note there are likely others and these are simply the most common):

The first use case is easily the most common right now. The “chat with your data” feature is a feature within AOAI that allows you to pass some extra information in your ChatCompletion API call that will instruct the AOAI service to query an AI Search index that you have populated with data from your environment to extend the model’s knowledge base without completely retraining it. This is essentially a simple way to muck with a retrieval augmented generation (RAG) pattern without having to write the code to orchestrate the interaction between the two services such as detailed in the link above. Instead, the “chat with your data” feature handles the heavy lifting of this for you if you have a populated index and want to add a few additional lines of code. In a future article I’ll go into more depth around this pattern because understanding the complete network and identity interaction is pretty interesting and often misconfigured. For now, understand the basics of it with the flow diagram below. Here also is some sample code if you want to play around with it yourself.

The second use case is when using a multimodal model like GPT-4 or GPT-4o. These models allow for you to pass them other types of data besides text such as images and audio. When requesting an image be analyzed, you have the option of passing the image as base64 or you can pass it a URL. If you pass it a URL the AOAI service will make an outbound network connection to the endpoint specified in the URL to retrieve the image for analysis

 response = client.chat.completions.create(
            ## Model must be a multimodal model
            model=os.getenv('LLM_DEPLOYMENT_NAME'),
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "Describe the image"
                        },
                        {
                            "type": "image_url",
                            "image_url": {"url": "{{SOME_PUBLIC_URL}}"}
                        }
                    ]
                }
            ],
            max_tokens=100         
        )

In both of these scenarios the AOAI service establishes a network connection from the Microsoft public backbone to the resource (such as AI Search in scenario 1 or a public blob in scenario 2). Unlike compute-based PaaS (App Services, Functions, Azure Container Apps, AKS, etc), today Microsoft does not provide a means for you to send this outbound traffic from AOAI through your virtual network with virtual network injection or virtual network integration. Given that you can’t pass this traffic through your virtual network, how can you mitigate potential data exfiltration risks or poisoning attacks? For example, let’s say an attacker compromises an application’s code and modifies it such that the “chat with your data” feature uses an attacker’s instance of AI Search to capture sensitive data in the queries or poisoning the responses back to the user with bad data. Maybe an attacker decides to use your AOAI instances to process images stolen from another company and placed on a public endpoint. I’m sure someone more creative could come up with a plethora of attacks. Either way, you want to control what your resources communicate with. The positive news is there is a way to do this today and likely an even better way to do this tomorrow when it comes to the AOAI service.

The AOAI (and AI Services) resources fall under the Cognitive Services framework. The benefit of being within the scope of this framework is they inherit some of the security capabilities of this framework. Some examples include support for Private Endpoints or disabling local key-based authentication. Another feature that is available to AOAI is an outbound network control. On an AOAI or AI Services resource, you can configure two properties to lock down the services’ ability to make outbound network calls. These two properties are:

  • restrictOutboundNetworkAccess – Boolean and set to true to block outbound access to everything but the exceptions listed in the allowedFqdnList property
  • allowedFqdnList – A list of FQDNs the service should be able to communicate with for outbound network calls

Using these two controls you can prevent your AOAI or AI Services resource from making outbound network calls except to the list of FQDNs you include. For example, you might whitelist your AI Search instance FQDN for the “chat with your data” feature or your blob storage account endpoint for image analysis. This is a feature I’d highly recommend you enable by default on any new AOAI or AI Service you provision moving forward.

The good news for those of you in the Terraform world is this feature is available directly within the azurerm provider as seen in a sample template below.

resource "azurerm_cognitive_account" "openai" {
  name                = "${local.openai_name}${var.purpose}${var.location_code}${var.random_string}"
  location            = var.location
  resource_group_name = var.resource_group_name
  kind                = "OpenAI"

  custom_subdomain_name = "${local.openai_name}${var.purpose}${var.location_code}${var.random_string}"
  sku_name              = "S0"

  public_network_access_enabled = var.public_network_access
  outbound_network_access_restricted = true
  fqdns = var.allowed_fqdn_list

  network_acls {
    default_action = "Deny"
    ip_rules = var.allowed_ips
    bypass = "AzureServices"
  }

  identity {
    type = "SystemAssigned"
  }

  tags = var.tags

  lifecycle {
    ignore_changes = [
      tags["created_date"],
      tags["created_by"]
    ]
  }
}

If a user attempts to circumvent these controls they will receive a descriptive error stating that outbound access is restricted. For those of you operating in a regulated environment, you should be slapping this setting on every new AOAI or AI Service instance you provision just like you’re doing with controlling inbound access with a Private Endpoint.

Alright folks, that sums up this quick blog post. Let me summarize the lessons learned:

  1. Be aware of which PaaS Services in Azure are capable of establishing outbound network connectivity and explore the controls available to you to restrict it.
  2. For AOAI and AI Services use the restrictOutboundNetworkAccess and allowedFqdnList properties to block outbound network calls except to the endpoints you specify.
  3. Make controlling outbound access of your PaaS a baseline control. Don’t just focus on inbound, focus on both.

Before I close out, recall that I mentioned a way today (the above) and a way in the future. The latter will be the new feature Microsoft announced into public preview Network Security Perimeters. As that feature matures and its list of supported services expands, controlling inbound and outbound network access for PaaS (and getting visibility into it via logs) is going to get far easier. Expect a blog post on that feature in the near future.

Thanks folks!

AI Foundry – Encryption with CMK

This is a part of my series on AI Foundry:

Over the past week I’ve been spending time messing around with different AI Foundry implementations. The most recent implementation is one of the more complex. After much struggling with getting it up and running, I figured I’d share my experience (and frustrations) with you all. Let’s delve into the complexities (and pain) of an AI Foundry instance which uses a CMK (customer-managed key) for encryption.

Encryption in Azure

Let’s start by doing a very quick overview of encryption in Azure. Before even that, let’s talk about what encryption is. For that I will source the definition direct from NIST (National Institute for Standards and Technology) “Cryptographic transformation of data (called “plaintext”) into a form (called “ciphertext”) that conceals the data’s original meaning to prevent it from being known or used”. In layman’s terms, “make data look like junk to prevent bad keys from viewing it”. When we talk about encryption, we typically cover two strategies: in use and in transit (there is also encryption in use which is pretty neat if you need some good bedtime material). In transit is all about how we encrypt the data as it flows over a network from point A to point B. When we talk about at rest, we’re talking about the data as it sits on a some type of storage device and waits to be accessed.

Encryption in transit within Azure will come by default for almost all use cases in Azure where you are using PaaS (platform-as-a-service). You’ll be on the hook for using encrypted protocols if you build your own stuff on a virtual machine or the like. Encryption at rest is the strategy relevant to this post. Like most CSPs (Cloud Service Providers), Microsoft will typically encrypt data by default. The data (such as your file in Azure Storage) is encrypted by a symmetric encryption algorithm such as AES256. The secret key to decrypt the data is often referred to as the DEK (Data Encryption Key). The DEK is then encrypted using an asymmetric encryption algorithm such as RSA and this key is referred to as the KEK (Key Encryption Key). These encryption operations, and the keys used to support them, are typically performed on some HSM (hardware security module) the cloud provider manages behind the scenes. This model results in a good balance of performance (key used for bulk operations is small, KEK can be rotated without having to decrypt and re-encrypt the data) and security (larger key used to secure smaller key). This process is called envelope encryption.

Envelope Encryption

You may be asking “What about the encrypted DEK, where is that stored?” Great question! The answer is, it depends. Some CSPs attach it as metadata to the encrypted cipher text (This was the way Amazon would store encrypted DEKs for SSE-CMK for S3 in the past. Not sure if they still do this today) or they may discard it altogether, and retain the required data to re-create it at decryption time (this is the way Azure Storage does it).

For most Azure services (and you should validate this with each service) where there is persistent data, Microsoft is encrypting that data using the method above. The KEK in this instance is managed by Microsoft in some Microsoft-managed Key Vault instance (where operations on the key and storage of the key at rest is performed by Microsoft-managed HSMs) deployed to an Azure subscription in a Microsoft-managed Entra ID tenant. Microsoft manages access to the KEK and rotation of the KEK. If you’re ever curious of how often Microsoft manages its keys, you’ll find that information buried in the compliance documentation in the Microsoft Service Trust Portal. When Microsoft manages the KEK this is sometimes referred to as an MMK (Microsoft Managed Key), PMK (Provider Managed Key, my preferred term), or some similar term. PMKs should be your go to if they align with your organizational controls because it’s the least overhead and will introduce the fewest limitations to your designs.

If your organizational security policies require you to control access to the key, maintain logs of when the keys are used, and control rotation of the keys, you are going to be living in CMK (customer manage key) land. In that land the DEKs created by the platform are going to be encrypted by a KEK (or CMK) living in an instance of Azure Key Vault in an Azure subscription in your Entra ID tenant. It will be on you to grant access to the services that require access to that key and it will be on you to rotate that key. You will also get access to when and by who the key is used by.

I could go on and on about this topic because I find the whole cryptography field (not the math stuff, that goes over my head. There is a reason I dropped CompSci in college) really interesting. But alas, we need to get back to AI Foundry with CMKs.

AI Foundry and CMKs

Like I’ve covered in my earlier posts, every AI Foundry instance is a whole bunch of resources that are looped together with both direct and indirect relationships. AI Foundry is built on top of AML (Azure Machine Learning) and like AML, its core resource is the workspace. Unlike the AML you’re likely used to, instead of standalone workspaces, AI Foundry utilizes AML Hub and Project workspaces. I’ve explained the concept in my first post, but you think of these workspaces as having sort of a parent (hub) and child (project) relationship. This yields some of the benefits I discussed in my first post such as the ability to share connections. Encryption of the workspace and its data is what I will be focusing on.

AI Foundry Resources

It’s important to understand AI Foundry’s relationship with AML because it uses the same patterns for CMK that AML uses. Behind the scenes of every AML / AI Foundry workspace is a CosmosDB, AI Search, and (not the default) Storage Account (likely other components as well but this is what’s important for this conversation). Normally, you don’t see these resources as they’re deployed in an Azure subscription associated to a Microsoft tenant. These resources have the following purpose:

  • Azure Cosmos DB – Stores metadata for Azure AI projects and tools including index names, flow creation timestamps, deployment tabs, and metrics.
  • Azure AI Search – Stores indices that are used to help query AI Foundry content such as an index with all your model deployment names.
  • Azure Storage Account – Stores instructions for customization tasks such as JSON representation of flows you create.

The data stored within these resources is encrypted by a PMK by default. When you opt to create an AI Foundry instance, the data in these resources are what you are encrypting with your CMK. There are two methods to encrypt this data. One method is generally available and should be familar to those of you who have used AML with CMKs in the past and the other method is still in preview.

With the first method the resources above are deployed to a Microsoft-managed resource group within your subscription. This resource group will have a naming convention similar to azureml-YOURWORKSPACENAME_SOMEGUID. Unlike most patterns where Microsoft creates a resource group in your subscription, there are no denyAssignments preventing you from modifying this resources . This means you can muck these resources as much as you please if you’d like a painful experience (which will break shit when you do as we will see later). Take note you will pay additional costs for each of these resources since they now exist within your subscription.

The second method is referred to as “service-side storage”. In this model the Cosmos DB instance and AI Search instance are no longer stored in your subscription and instead are stored in an Azure subscription in a Microsoft-managed Entra ID tenant. There are still additional costs for this option since it still requires dedicated resources.

With both of these methods, you point the AI Foundry instance to a key stored in an instance of Azure Key Vault in your environment. The user-assigned managed identity associated with your hub needs appropriate permissions on the key. One catch here is the vault has to be configured with legacy Key Vault access policies. You cannot use Azure RBAC at this time (someone can correct me if this isn’t true anymore).

Now that you understand the basics of how CMK is handled, let’s focus on the first method since that is what is generally available today. There are lots of fun ways for this one to break.

The Many Ways to Break Shit

Remember when I mentioned you could modify the resources in the managed resource group? Well, that bit me when I did made modifications and it bit me when some automation run by the bosses upstairs ran at night.

One thing that popped out to me was the Cosmos DB and AI Search instances both had the service firewalls set to allow all public traffic. The Azure Storage Account firewall is enabled and configured to allow only trusted Azure services. Given I like to lock public network access down, I decided to see what would happen if I enabled the service firewall on the other two resources.

The first thing I tried to lock down was the AI Search instance. For this I disabled public access and allowed the trusted Azure services exception. After waiting around 10 minutes for the changes to take effect (really Microsoft?) I started to receive the error below indicating that the change seemed to break the service. The diagnostic logs for the AI Search instance showed it throwing 403 forbidden back (yes folks you can enable diagnostic settings for these resources and handle them the same way you would your own resources). Ok, so can’t mess with that firewall setting.

Error when managed AI Search or CosmosDB is busted

After re-enabling public network access (and waiting another ten minutes) to the AI Search instance to get back to steady state, I decided to look at Cosmos DB. When an AI Foundry workspace is configured with a Private Endpoint, Microsoft creates a Private Endpoint for the Cosmos DB instance in some Microsoft-managed virtual network. Reviewing diagnostic logs does confirm the traffic to the Cosmos DB is coming in via RFC1918 traffic so wherever this private endpoint is, it does get used.

This lead me down to the path of disabling public access for the Cosmos DB instance (which took another 10 minutes). I then tested various operations in AI Foundry and didn’t run into an issue. It SEEMS like this is supported, but I’m still waiting on the product group to confirm for me. Once I hear back, I’ll update this post.

Ok, so those were the things that I did with the resources. Now let’s talk about how automation by folks who mean well can cause chaos.

One morning I woke up to do more testing with this deployment and I started to get the same error I received when I locked down AI Search. After blowing up the lab a few times to vet it wasn’t something I broke mucking with the resources, I dug into the diagnostic logs. Within the Cosmos DB data plane diagnostic logs, I noticed that every call was being spit back complaining invalid access key. A quick check of the disableLocalAuth property of the Cosmos DB instance confirmed that some automation run by the bosses upstairs had disabled local authentication in Cosmos which breaks this whole integration.

A bit more digging showed that all three resources rely upon access keys instead of Entra ID authentication and Azure RBAC authorization. This is likely due to the multi-tenant aspects of the solution (some stuff in your tenant and some in Microsoft’s tenant). I wanted to call this one out, because I know many folks out there disable local access keys for data plane access to services. You’ll need to put in the appropriate exceptions for these three managed resources. If you don’t, you will break your Foundry instance. If you’re not sure if this is your issue, review the diagnostic logs of the resources and check to see if your error is similar to the one I provided above.

Lessons Learned

Hopefully this helps you understand a bit better about the benefits and considerations (more so considerations in this scenario) of using a CMK with AI Foundry. Fingers crossed service-side encryption mechanism in preview goes generally available in the near future and makes this painful process obselete. Until then, here are some suggestion:

  1. Remember that these resources can be modified. You should avoid making any modification (exempting diagnostic settings) to these resources.
  2. Enable diagnostic logging on the resources in the managed resource group. If anything, they will be helpful to troubleshoot if something breaks whether it be identity or networking related.
  3. Understand there is additional cost of this implementation which isn’t typical of CMK usage in my experience (minus cost of the key which is standard in every CMK integration).

AI Foundry – Credential vs Identity Data Stores

This is a part of my series on AI Foundry:

Hello again folks. Today, I’m going to continue my series on AI Foundry. I’ve been scratching my head on how best to tackle this series, because the service consists of so many foundational services plumbed together into a larger solution so there is a lot to talk about. The product can be complicated when implementing it with all the security bells and whistles. Getting it right requires a solid baseline understanding of the foundational components security capabilities (such as Azure Storage, Azure Key Vault, etc) and how these components work together for the purposes of AI Foundry.

The many components of an AI Foundry deployment

For the purposes of this post, I’m going to focus in on Azure Storage, specifically the storage account associated with the AI Foundry Hub. I will refer to this storage account as the default storage account. As I covered in my first post, AI Foundry is built on top of Azure Machine Learning. Like Azure Machine Learning, AI Foundry uses the default storage account to store artifacts created by the AI Foundry hub and projects. This includes files for the Prompt Flows you create, files used by the compute provisioned in the managed virtual network, and other artifacts related to the functionality of the product. This storage account is shared across the AI Foundry hub and all projects created within it.

The default storage account is critical to the functionality and if you muck up the identity or networking configuration, the product simply won’t work. The errors you’ll receive won’t always indicate an obvious problem with your storage account configuration. To help you avoid mucking up the identity portion, I’m going to use this post to explain your options for identity integration with the default storage account.

AI Foundry uses workspace connection resources to connect to external resources outside of the workspace. This includes the default storage account, AOAI (Azure OpenAI Service) or AI Service instance, and the like. When you create a connection in AI Foundry, you configure how the workspace should authenticate to the resource (determined by the authType property of the connection) when called by a user. This will most commonly be either Entra ID or an API key. In the example below, you see I have a connection object for an AI Search instance set to use Entra authentication by configuring the authType to AAD.

 {
      "id": "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rgaifeus296/providers/Microsoft.MachineLearningServices/workspaces/aifhaifoundryeus296/connections/connaisaifoundryeus296",
      "location": null,
      "name": "connmysearchservice",
      "properties": {
        "authType": "AAD",
        "category": "CognitiveSearch",
        "createdByWorkspaceArmId": "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rgaifeus296/providers/Microsoft.MachineLearningServices/workspaces/aifhaifoundryeus296",
        "error": "Network Service does not have permission to check resource /subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rgaifeus296/providers/Microsoft.Search/searchServices/aisaifoundryeus296 details. Please consider grant Azure Machine Learning (appId: 0736f41a-0425-4b46-bdb5-1563eff02385) read or contributor access to connected resource.",
        "expiryTime": null,
        "group": "AzureAI",
        "isSharedToAll": true,
        "metadata": {
          "ApiType": "Azure",
          "ApiVersion": "2024-05-01-preview",
          "DeploymentApiVersion": "2023-11-01",
          "ResourceId": "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rgaifeus296/providers/Microsoft.Search/searchServices/mysearchservice"
        },
        "peRequirement": "NotApplicable",
        "peStatus": "NotApplicable",
        "sharedUserList": [],
        "target": "https://mysearchservice.search.windows.net",
        "useWorkspaceManagedIdentity": false
      },
      "systemData": {
        "createdAt": "2025-01-12T23:19:01.8005674Z",
        "createdBy": "d34d51b2-34b4-45d9-b6a8-cc5422eb400a",
        "createdByType": "Application",
        "lastModifiedAt": "2025-01-12T23:19:01.8005674Z",
        "lastModifiedBy": "d34d51b2-34b4-45d9-b6a8-cc5422eb400a",
        "lastModifiedByType": "Application"
      },
      "tags": null,
      "type": "Microsoft.MachineLearningServices/workspaces/connections"
    }

Creating this connection allows me to use the AI Search instance within the AI Foundry hub and projects such as using it within the ChatPlayground Chat With Your Data feature. When the connection object is called, an Entra ID identity will be used. This could be the user’s identity, it could a project’s managed identity, or it could even be a managed-online endpoint’s managed identity. In all cases, the identity will be an Entra ID identity that can be authenticated to the tenant and the actions it is authorized to do are determined by its Azure RBAC assignments. It’s critical to understand that if you choose Entra ID-based authentication, you need to have proper permissions in place.

When a new AI Foundry hub is created, it will either create new storage account or integrate with an existing storage account to be used as the default storage account. During setup via the Portal, in the identity section you’ll see the option to choose credential-based or identity-based authentication when connecting to the default storage account. By default, credential-based access will be used. If you are provisioning via Terraform (which as of right now will require you to use the AzApi resource provider) you would set the properties.systemDatastoresAuthMode property to either accesskey or identity. As of the date of this blog, this property still is not documented in the REST API documentation that I could find, however, it will work when referencing it with API version Microsoft.MachineLearningServices/workspaces@2024-10-01-preview.

Credential or Identity-based access

So why would you choose identity-based access if you have to additionally provision the relevant security principals with access via RBAC? Before I answer that, let me do a quick recap on authorization in Azure. As I cover in my series on Azure authorization, services like storage have both a management plane and data plane. While the management plane is always Entra ID-based authentication and Azure RBAC, the data plane for most services (storage included) can use either Entra ID/Azure RBAC or API keys (via Storage Access Keys and SAS tokens). Usage of any type of static key typically grants the security principal using the key complete access to the data plane. Additionally, determining who is using the key at any given time is mostly impossible. For that reason, choosing to use Entra ID/Azure RBAC should be your preference wherever possible. Entra ID will give your traceability back to the security principal that touched the resource and Azure RBAC will give you the ability to assign granular permissions across the data plane.

Management plane versus data plane

If you instead select credential-based authentication a few things happen. When the new AI Foundry hub is created the connections made to the default storage account will be configured to use a SAS token. Any security principal with read access to the workspace can use that connection information for the storage account from within an AI Foundry project to connect to the storage account using those credentials. This means no audibility about what user is doing what with the storage account. This goes for any connection you share across projects that use an API key. Not good.

Default storage account configured to use credential-based authentication

It’s worth understanding the Key Vault resource used by AI Foundry in this scenario. When selecting credential-based authentication for the default storage account, the storage access keys for the storage account are stored in the Key Vault. Both the AI Foundry hub and projects under the hub are granted access to the secrets via Key Vault access policies. Yuck and yuck. Users do not get access to the Key Vault itself. Foundry simply enables them to exercise the use of the credential via permissions over teh connection object within the Foundry hub or project. When using identity-based authentication and Entra ID for your connections, the Azure Key Vault will be used minimally (such as being used if you deploy a model from the model catalog to managed online endpoint and select key-based authentication) to none.

Hopefully at this point I’ve sold you on the benefits of using the identity-based authentication to the default storage account (and Entra ID for connected resources). As a quick recap, if you care about least privilege and audibility, you’ll choose identity-based authentication. The main consideration of choosing identity-based authentication for the default storage account is that you need to get Azure RBAC right or else shit will break. Oh yes will it break.

If you configure your AI Foundry instance with a SMI (system-assigned managed identity) for the hub and projects, the required permissions on the default storage account will be granted for these identities. This includes:

  • Hub identity
    • Storage Blob Data Contributor
    • Storage File Data Privileged Contributor
  • Project identity
    • Storage Account Contributor
    • Storage Blob Data Contributor
    • Storage File Data Privileged Contributor
    • Storage Table Data Contributor

If you’re nosy like I am, you’ll notice the Azure RBAC assignments for both identities for the hub and project have an ABAC condition attached (yes an actual use case!). I plan on covering ABAC conditions in depth in my authorization series, but essentially they are a way of scoping the access to an attribute of the security principal, resource, or session. Within AI Foundry, they are used to limit the managed identities to accessing the blob containers specific to their underlining AML workspace. This helps to prevent the managed identity of one project from accessing artifacts produced by another project. For example, here are the conditions associated with my hub’s managed identity:

(
 (
  !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete'})
  AND
  !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read'})
  AND
  !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write'})
  AND
  !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/move/action'})
  AND
  !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/add/action'})
 )
 OR 
 (
  @Resource[Microsoft.Storage/storageAccounts/blobServices/containers:name] StringStartsWithIgnoreCase '67b8ddaa-f77e-4d12-b9ca-440326274da9'
 )
)

If you opt to use a UMI (user-assigned managed identity) for the AI Foundry hub you’ll need to manually grant these permissions to the UMI prior to provisioning the hub. You should try to include these conditions.

As I mentioned earlier, there are three primary sets of identities that hit the resources for an AI Foundry. These include the hub/project identity, user identity, and compute identities. If you opt to use identity-based authentication to the storage account, you will need to ensure you grant your users appropriate permissions on the storage account. When a user does something like create a prompt flow, the user’s identity context is used to access the file endpoint in the storage account to create a file share that will contain prompt flows they create.

This typically includes:

  • Storage Blob Data Contributor
  • Storage File Data Privileged Contributor

If you’re spinning up a managed-online endpoint, you will need to grant that managed identity (if using an UMI, these are automatically added if using an SMI):

  • Storage Blob Data Reader
  • Storage Blob Data Contributor

The last thing I want to mention is specific to if you creating Private Endpoints for your default storage account (which for a secure AI Foundry, you should be). Ensure you grant each AI Foundry project managed identity Reader over the private endpoints (both file and blob) for the default storage account. This is required when previewing data from the AI Foundry Portal for use cases like uploading data for fine-tuning a model. I’m not sure where this requirement comes from, but if you don’t include it, your users will run into weird permission errors when attempting to upload data to the default storage account from within AI Foundry.

Let’s sum things up:

  • The default storage account configuration is critical to successful use of the product. Muck up authorization and prepared for pain.
  • Use identity-based authentication for connectivity to the default storage account. This will ensure auditability for who accesses what.
  • Use Entra ID authentication for your AI Foundry connections wherever possible. This will give you auditability and the ability to scope permissions via Azure RBAC.
  • If you using identity-based authentication, ensure you put in place the right permissions for the hub/project (done automatically if using SMI), user, and compute identities.
  • If you’re having trouble with users uploading data for fine-tuning via AI Foundry, your project is probably missing the read permissions over the default storage account private endpoints.
  • If you’re having trouble provisioning a managed online endpoint that is using an UMI, you are probably missing permissions on the default storage account.

That wraps up this post. Thanks folks!