AI Foundry – Encryption with CMK

This is a part of my series on AI Foundry:

Over the past week I’ve been spending time messing around with different AI Foundry implementations. The most recent implementation is one of the more complex. After much struggling with getting it up and running, I figured I’d share my experience (and frustrations) with you all. Let’s delve into the complexities (and pain) of an AI Foundry instance which uses a CMK (customer-managed key) for encryption.

Encryption in Azure

Let’s start by doing a very quick overview of encryption in Azure. Before even that, let’s talk about what encryption is. For that I will source the definition direct from NIST (National Institute for Standards and Technology) “Cryptographic transformation of data (called “plaintext”) into a form (called “ciphertext”) that conceals the data’s original meaning to prevent it from being known or used”. In layman’s terms, “make data look like junk to prevent bad keys from viewing it”. When we talk about encryption, we typically cover two strategies: in use and in transit (there is also encryption in use which is pretty neat if you need some good bedtime material). In transit is all about how we encrypt the data as it flows over a network from point A to point B. When we talk about at rest, we’re talking about the data as it sits on a some type of storage device and waits to be accessed.

Encryption in transit within Azure will come by default for almost all use cases in Azure where you are using PaaS (platform-as-a-service). You’ll be on the hook for using encrypted protocols if you build your own stuff on a virtual machine or the like. Encryption at rest is the strategy relevant to this post. Like most CSPs (Cloud Service Providers), Microsoft will typically encrypt data by default. The data (such as your file in Azure Storage) is encrypted by a symmetric encryption algorithm such as AES256. The secret key to decrypt the data is often referred to as the DEK (Data Encryption Key). The DEK is then encrypted using an asymmetric encryption algorithm such as RSA and this key is referred to as the KEK (Key Encryption Key). These encryption operations, and the keys used to support them, are typically performed on some HSM (hardware security module) the cloud provider manages behind the scenes. This model results in a good balance of performance (key used for bulk operations is small, KEK can be rotated without having to decrypt and re-encrypt the data) and security (larger key used to secure smaller key). This process is called envelope encryption.

Envelope Encryption

You may be asking “What about the encrypted DEK, where is that stored?” Great question! The answer is, it depends. Some CSPs attach it as metadata to the encrypted cipher text (This was the way Amazon would store encrypted DEKs for SSE-CMK for S3 in the past. Not sure if they still do this today) or they may discard it altogether, and retain the required data to re-create it at decryption time (this is the way Azure Storage does it).

For most Azure services (and you should validate this with each service) where there is persistent data, Microsoft is encrypting that data using the method above. The KEK in this instance is managed by Microsoft in some Microsoft-managed Key Vault instance (where operations on the key and storage of the key at rest is performed by Microsoft-managed HSMs) deployed to an Azure subscription in a Microsoft-managed Entra ID tenant. Microsoft manages access to the KEK and rotation of the KEK. If you’re ever curious of how often Microsoft manages its keys, you’ll find that information buried in the compliance documentation in the Microsoft Service Trust Portal. When Microsoft manages the KEK this is sometimes referred to as an MMK (Microsoft Managed Key), PMK (Provider Managed Key, my preferred term), or some similar term. PMKs should be your go to if they align with your organizational controls because it’s the least overhead and will introduce the fewest limitations to your designs.

If your organizational security policies require you to control access to the key, maintain logs of when the keys are used, and control rotation of the keys, you are going to be living in CMK (customer manage key) land. In that land the DEKs created by the platform are going to be encrypted by a KEK (or CMK) living in an instance of Azure Key Vault in an Azure subscription in your Entra ID tenant. It will be on you to grant access to the services that require access to that key and it will be on you to rotate that key. You will also get access to when and by who the key is used by.

I could go on and on about this topic because I find the whole cryptography field (not the math stuff, that goes over my head. There is a reason I dropped CompSci in college) really interesting. But alas, we need to get back to AI Foundry with CMKs.

AI Foundry and CMKs

Like I’ve covered in my earlier posts, every AI Foundry instance is a whole bunch of resources that are looped together with both direct and indirect relationships. AI Foundry is built on top of AML (Azure Machine Learning) and like AML, its core resource is the workspace. Unlike the AML you’re likely used to, instead of standalone workspaces, AI Foundry utilizes AML Hub and Project workspaces. I’ve explained the concept in my first post, but you think of these workspaces as having sort of a parent (hub) and child (project) relationship. This yields some of the benefits I discussed in my first post such as the ability to share connections. Encryption of the workspace and its data is what I will be focusing on.

AI Foundry Resources

It’s important to understand AI Foundry’s relationship with AML because it uses the same patterns for CMK that AML uses. Behind the scenes of every AML / AI Foundry workspace is a CosmosDB, AI Search, and (not the default) Storage Account (likely other components as well but this is what’s important for this conversation). Normally, you don’t see these resources as they’re deployed in an Azure subscription associated to a Microsoft tenant. These resources have the following purpose:

  • Azure Cosmos DB – Stores metadata for Azure AI projects and tools including index names, flow creation timestamps, deployment tabs, and metrics.
  • Azure AI Search – Stores indices that are used to help query AI Foundry content such as an index with all your model deployment names.
  • Azure Storage Account – Stores instructions for customization tasks such as JSON representation of flows you create.

The data stored within these resources is encrypted by a PMK by default. When you opt to create an AI Foundry instance, the data in these resources are what you are encrypting with your CMK. There are two methods to encrypt this data. One method is generally available and should be familar to those of you who have used AML with CMKs in the past and the other method is still in preview.

With the first method the resources above are deployed to a Microsoft-managed resource group within your subscription. This resource group will have a naming convention similar to azureml-YOURWORKSPACENAME_SOMEGUID. Unlike most patterns where Microsoft creates a resource group in your subscription, there are no denyAssignments preventing you from modifying this resources . This means you can muck these resources as much as you please if you’d like a painful experience (which will break shit when you do as we will see later). Take note you will pay additional costs for each of these resources since they now exist within your subscription.

The second method is referred to as “service-side storage”. In this model the Cosmos DB instance and AI Search instance are no longer stored in your subscription and instead are stored in an Azure subscription in a Microsoft-managed Entra ID tenant. There are still additional costs for this option since it still requires dedicated resources.

With both of these methods, you point the AI Foundry instance to a key stored in an instance of Azure Key Vault in your environment. The user-assigned managed identity associated with your hub needs appropriate permissions on the key. One catch here is the vault has to be configured with legacy Key Vault access policies. You cannot use Azure RBAC at this time (someone can correct me if this isn’t true anymore).

Now that you understand the basics of how CMK is handled, let’s focus on the first method since that is what is generally available today. There are lots of fun ways for this one to break.

The Many Ways to Break Shit

Remember when I mentioned you could modify the resources in the managed resource group? Well, that bit me when I did made modifications and it bit me when some automation run by the bosses upstairs ran at night.

One thing that popped out to me was the Cosmos DB and AI Search instances both had the service firewalls set to allow all public traffic. The Azure Storage Account firewall is enabled and configured to allow only trusted Azure services. Given I like to lock public network access down, I decided to see what would happen if I enabled the service firewall on the other two resources.

The first thing I tried to lock down was the AI Search instance. For this I disabled public access and allowed the trusted Azure services exception. After waiting around 10 minutes for the changes to take effect (really Microsoft?) I started to receive the error below indicating that the change seemed to break the service. The diagnostic logs for the AI Search instance showed it throwing 403 forbidden back (yes folks you can enable diagnostic settings for these resources and handle them the same way you would your own resources). Ok, so can’t mess with that firewall setting.

Error when managed AI Search or CosmosDB is busted

After re-enabling public network access (and waiting another ten minutes) to the AI Search instance to get back to steady state, I decided to look at Cosmos DB. When an AI Foundry workspace is configured with a Private Endpoint, Microsoft creates a Private Endpoint for the Cosmos DB instance in some Microsoft-managed virtual network. Reviewing diagnostic logs does confirm the traffic to the Cosmos DB is coming in via RFC1918 traffic so wherever this private endpoint is, it does get used.

This lead me down to the path of disabling public access for the Cosmos DB instance (which took another 10 minutes). I then tested various operations in AI Foundry and didn’t run into an issue. It SEEMS like this is supported, but I’m still waiting on the product group to confirm for me. Once I hear back, I’ll update this post.

Ok, so those were the things that I did with the resources. Now let’s talk about how automation by folks who mean well can cause chaos.

One morning I woke up to do more testing with this deployment and I started to get the same error I received when I locked down AI Search. After blowing up the lab a few times to vet it wasn’t something I broke mucking with the resources, I dug into the diagnostic logs. Within the Cosmos DB data plane diagnostic logs, I noticed that every call was being spit back complaining invalid access key. A quick check of the disableLocalAuth property of the Cosmos DB instance confirmed that some automation run by the bosses upstairs had disabled local authentication in Cosmos which breaks this whole integration.

A bit more digging showed that all three resources rely upon access keys instead of Entra ID authentication and Azure RBAC authorization. This is likely due to the multi-tenant aspects of the solution (some stuff in your tenant and some in Microsoft’s tenant). I wanted to call this one out, because I know many folks out there disable local access keys for data plane access to services. You’ll need to put in the appropriate exceptions for these three managed resources. If you don’t, you will break your Foundry instance. If you’re not sure if this is your issue, review the diagnostic logs of the resources and check to see if your error is similar to the one I provided above.

Lessons Learned

Hopefully this helps you understand a bit better about the benefits and considerations (more so considerations in this scenario) of using a CMK with AI Foundry. Fingers crossed service-side encryption mechanism in preview goes generally available in the near future and makes this painful process obselete. Until then, here are some suggestion:

  1. Remember that these resources can be modified. You should avoid making any modification (exempting diagnostic settings) to these resources.
  2. Enable diagnostic logging on the resources in the managed resource group. If anything, they will be helpful to troubleshoot if something breaks whether it be identity or networking related.
  3. Understand there is additional cost of this implementation which isn’t typical of CMK usage in my experience (minus cost of the key which is standard in every CMK integration).

AI Foundry – The Basics

This is a part of my series on AI Foundry:

Happy New Year! Over the last few months of 2024, I was buried in AI Foundry (FKA Azure AI Studio. Hey marketing needs to do something a few times a year.) proof-of-concepts with my best buddy Jose. The product is very new, very powerful, but also very complex to get up and running in a regulated environment. Through many hours labbing scenarios out and troubleshooting with customers, I figured it was time to share some of what I learned.

So what is AI Foundry? If you want the Microsoft explanation of what it is, read the documentation. Here you’ll get the Matt Felton opinion on what it is (god help you). AI Foundry is a toolset intended to help AI Engineers build Generative AI applications. It allows them to interact with the LLMs (large language models) and build complex workflows (via Prompt Flows) in a no code (like Chat Playground) or low code (prompt flow interface) environment. As you can imagine, this is an attractive tool to get the people who know Generative AI quick access to the LLMs so use cases can be validated before expensive development cycles are spent building the pretty front-end and code required to make it a real application.

Before we get into the guts of the use cases Jose and I have run into, I want to start with the basics of how the hell this service is setup. This will likely require a post or two, so grab your coffee and get ready for a crash course.

AI Foundry is built on top of AML (Azure Machine Learning). If you’re ever built out a locked down AML instance, you have some understanding of the many services that work together to provide the service. AI Foundry inherits these components and provides a sleek user interface on top and typically requires additional resources like Azure OpenAI and AI Search (for RAG use cases). Like AML there are lots and lots of pieces that you need to think about, plan for, and implement in the correct manner to make the secured instance of AI Foundry work.

One specific feature of AML plays an important role within AI Foundry and that is the concept of a hub workspace. A hub is an AML workspace that centrally manages the security, networking, compute resources, and quota for children AML workspaces. These child AML workspaces are referred to as projects. The whole goal of the hub is to make it easier for your various business units to do the stuff they need to do with AML/AI Foundry without having to manage the complex pieces like the security and networking. My guidance would be to think about giving each business unit a Foundry Hub that they can group projects of similar environment (prod or non-prod) and data sensitivity.

General relationship between AI Foundry Hub and AI Foundry Projects

Ok, so you get the basic gist of this. When you deploy an AI Foundry instance, behind the scenes an AML workspace designated as a hub is created. Each project you create is a “child” AML workspace of the hub workspace and will inherit some resources from the hub. Now that you’re grounded in that basic piece, let’s talk about the individual components.

The many resources involved with a secured AI Foundry instance

As you can see in the above image, there are a lot of components of this solution that you will likely use if you want to deploy an AI Foundry instance that has the necessary security and networking controls. Let me give you a quick and dirty explanation of each component. I’ll dive deeper into the identity, authorization, and networking aspects of these components in future posts.

  • Managed identities
    • There are lots of managed identities in use with this product. There is a managed identity for the hub, a managed identity for the project, and managed identities for the various compute. One of the challenges of AI Foundry is knowing which managed identity (and if not a managed identity, the user’s identity) is being used to access a specific resource.
  • Azure Storage Account
    • Just like in AML, there is a default storage account associated with the workspace. Unlike with a traditional AML workspace you may be familiar with, the hub feature allows all project workspaces to leverage the same storage account. The storage account is used by the workspaces to store artifacts like logs in blob storage and artifacts like prompt flows in file storage. The hub and projects isolate their data to specific containers (for blob) and folders (for file) with Azure ABAC (holy f*ck a use case for this feature finally) setup such that the managed identities for the workspaces can only access containers/folders for data related to their specific workspace.
  • Azure Key Vault
    • The Azure Key Vault will store any keys used for connections created from within the AI Foundry project. This could be keys for the default storage account or keys used for API access to models you deploy from the model catalog.
  • Azure Container Registry
    • While this is deemed optional, I’d recommend you plan on deploying it. When deploy a prompt flow there uses certain functionality to the managed compute, the container image used isn’t the default runtime and an ACR instance will be spun up automatically without all the security controls you’ll likely want.
  • Azure OpenAI Service
    • This is used for deployment of OpenAI and some Microsoft chat and embedding models
  • Azure AI Service
    • This can be used as an alternative to the Azure OpenAI Service. It has some additional functionality beyond just hosting the models such as speed and the like.
  • Azure AI Search
    • This will be used for anything RAG related. Most likely you’ll see it used with the Chat With Your Data feature of the Chat Playground.
  • Managed Network
    • This is used to host the compute instances, serverless endpoints and managed online endpoints spun up for compute operations within AI Foundry. I’ll do a deep dive into networking within the service in a future post.
  • Azure Firewall
    • If building a secured instance, you’re going to use a managed virtual network that is locked down to all Internet access via outbound rules. Under the hood an Azure Firewall instance (standard for almost all use cases) will be spun up in the managed virtual network. You interact with it through the creation of the outbound rules and can’t directly administer it. However, you will be paying for it.
  • Role Assignments
    • So so so many role assignments. I’ll cover these in a future post.
  • Azure Private DNS
    • Used heavily for interacting with the AI Foundry instance and models/endpoints you deploy. I’ll cover this in an upcoming post.

Are you frightened yet? If not you should be! Don’t worry though, over this series I’ll walk you through the pain points of getting this service up and running. Once you get past the complex configuration, it’s a crazy valuable service that you’ll have a high demand for from your business units.

In the next post I’ll walk through the complexities of authorization within AI Foundry.