AI Foundry – Encryption with CMK

This is a part of my series on AI Foundry:

Over the past week I’ve been spending time messing around with different AI Foundry implementations. The most recent implementation is one of the more complex. After much struggling with getting it up and running, I figured I’d share my experience (and frustrations) with you all. Let’s delve into the complexities (and pain) of an AI Foundry instance which uses a CMK (customer-managed key) for encryption.

Encryption in Azure

Let’s start by doing a very quick overview of encryption in Azure. Before even that, let’s talk about what encryption is. For that I will source the definition direct from NIST (National Institute for Standards and Technology) “Cryptographic transformation of data (called “plaintext”) into a form (called “ciphertext”) that conceals the data’s original meaning to prevent it from being known or used”. In layman’s terms, “make data look like junk to prevent bad keys from viewing it”. When we talk about encryption, we typically cover two strategies: in use and in transit (there is also encryption in use which is pretty neat if you need some good bedtime material). In transit is all about how we encrypt the data as it flows over a network from point A to point B. When we talk about at rest, we’re talking about the data as it sits on a some type of storage device and waits to be accessed.

Encryption in transit within Azure will come by default for almost all use cases in Azure where you are using PaaS (platform-as-a-service). You’ll be on the hook for using encrypted protocols if you build your own stuff on a virtual machine or the like. Encryption at rest is the strategy relevant to this post. Like most CSPs (Cloud Service Providers), Microsoft will typically encrypt data by default. The data (such as your file in Azure Storage) is encrypted by a symmetric encryption algorithm such as AES256. The secret key to decrypt the data is often referred to as the DEK (Data Encryption Key). The DEK is then encrypted using an asymmetric encryption algorithm such as RSA and this key is referred to as the KEK (Key Encryption Key). These encryption operations, and the keys used to support them, are typically performed on some HSM (hardware security module) the cloud provider manages behind the scenes. This model results in a good balance of performance (key used for bulk operations is small, KEK can be rotated without having to decrypt and re-encrypt the data) and security (larger key used to secure smaller key). This process is called envelope encryption.

Envelope Encryption

You may be asking “What about the encrypted DEK, where is that stored?” Great question! The answer is, it depends. Some CSPs attach it as metadata to the encrypted cipher text (This was the way Amazon would store encrypted DEKs for SSE-CMK for S3 in the past. Not sure if they still do this today) or they may discard it altogether, and retain the required data to re-create it at decryption time (this is the way Azure Storage does it).

For most Azure services (and you should validate this with each service) where there is persistent data, Microsoft is encrypting that data using the method above. The KEK in this instance is managed by Microsoft in some Microsoft-managed Key Vault instance (where operations on the key and storage of the key at rest is performed by Microsoft-managed HSMs) deployed to an Azure subscription in a Microsoft-managed Entra ID tenant. Microsoft manages access to the KEK and rotation of the KEK. If you’re ever curious of how often Microsoft manages its keys, you’ll find that information buried in the compliance documentation in the Microsoft Service Trust Portal. When Microsoft manages the KEK this is sometimes referred to as an MMK (Microsoft Managed Key), PMK (Provider Managed Key, my preferred term), or some similar term. PMKs should be your go to if they align with your organizational controls because it’s the least overhead and will introduce the fewest limitations to your designs.

If your organizational security policies require you to control access to the key, maintain logs of when the keys are used, and control rotation of the keys, you are going to be living in CMK (customer manage key) land. In that land the DEKs created by the platform are going to be encrypted by a KEK (or CMK) living in an instance of Azure Key Vault in an Azure subscription in your Entra ID tenant. It will be on you to grant access to the services that require access to that key and it will be on you to rotate that key. You will also get access to when and by who the key is used by.

I could go on and on about this topic because I find the whole cryptography field (not the math stuff, that goes over my head. There is a reason I dropped CompSci in college) really interesting. But alas, we need to get back to AI Foundry with CMKs.

AI Foundry and CMKs

Like I’ve covered in my earlier posts, every AI Foundry instance is a whole bunch of resources that are looped together with both direct and indirect relationships. AI Foundry is built on top of AML (Azure Machine Learning) and like AML, its core resource is the workspace. Unlike the AML you’re likely used to, instead of standalone workspaces, AI Foundry utilizes AML Hub and Project workspaces. I’ve explained the concept in my first post, but you think of these workspaces as having sort of a parent (hub) and child (project) relationship. This yields some of the benefits I discussed in my first post such as the ability to share connections. Encryption of the workspace and its data is what I will be focusing on.

AI Foundry Resources

It’s important to understand AI Foundry’s relationship with AML because it uses the same patterns for CMK that AML uses. Behind the scenes of every AML / AI Foundry workspace is a CosmosDB, AI Search, and (not the default) Storage Account (likely other components as well but this is what’s important for this conversation). Normally, you don’t see these resources as they’re deployed in an Azure subscription associated to a Microsoft tenant. These resources have the following purpose:

  • Azure Cosmos DB – Stores metadata for Azure AI projects and tools including index names, flow creation timestamps, deployment tabs, and metrics.
  • Azure AI Search – Stores indices that are used to help query AI Foundry content such as an index with all your model deployment names.
  • Azure Storage Account – Stores instructions for customization tasks such as JSON representation of flows you create.

The data stored within these resources is encrypted by a PMK by default. When you opt to create an AI Foundry instance, the data in these resources are what you are encrypting with your CMK. There are two methods to encrypt this data. One method is generally available and should be familar to those of you who have used AML with CMKs in the past and the other method is still in preview.

With the first method the resources above are deployed to a Microsoft-managed resource group within your subscription. This resource group will have a naming convention similar to azureml-YOURWORKSPACENAME_SOMEGUID. Unlike most patterns where Microsoft creates a resource group in your subscription, there are no denyAssignments preventing you from modifying this resources . This means you can muck these resources as much as you please if you’d like a painful experience (which will break shit when you do as we will see later). Take note you will pay additional costs for each of these resources since they now exist within your subscription.

The second method is referred to as “service-side storage”. In this model the Cosmos DB instance and AI Search instance are no longer stored in your subscription and instead are stored in an Azure subscription in a Microsoft-managed Entra ID tenant. There are still additional costs for this option since it still requires dedicated resources.

With both of these methods, you point the AI Foundry instance to a key stored in an instance of Azure Key Vault in your environment. The user-assigned managed identity associated with your hub needs appropriate permissions on the key. One catch here is the vault has to be configured with legacy Key Vault access policies. You cannot use Azure RBAC at this time (someone can correct me if this isn’t true anymore).

Now that you understand the basics of how CMK is handled, let’s focus on the first method since that is what is generally available today. There are lots of fun ways for this one to break.

The Many Ways to Break Shit

Remember when I mentioned you could modify the resources in the managed resource group? Well, that bit me when I did made modifications and it bit me when some automation run by the bosses upstairs ran at night.

One thing that popped out to me was the Cosmos DB and AI Search instances both had the service firewalls set to allow all public traffic. The Azure Storage Account firewall is enabled and configured to allow only trusted Azure services. Given I like to lock public network access down, I decided to see what would happen if I enabled the service firewall on the other two resources.

The first thing I tried to lock down was the AI Search instance. For this I disabled public access and allowed the trusted Azure services exception. After waiting around 10 minutes for the changes to take effect (really Microsoft?) I started to receive the error below indicating that the change seemed to break the service. The diagnostic logs for the AI Search instance showed it throwing 403 forbidden back (yes folks you can enable diagnostic settings for these resources and handle them the same way you would your own resources). Ok, so can’t mess with that firewall setting.

Error when managed AI Search or CosmosDB is busted

After re-enabling public network access (and waiting another ten minutes) to the AI Search instance to get back to steady state, I decided to look at Cosmos DB. When an AI Foundry workspace is configured with a Private Endpoint, Microsoft creates a Private Endpoint for the Cosmos DB instance in some Microsoft-managed virtual network. Reviewing diagnostic logs does confirm the traffic to the Cosmos DB is coming in via RFC1918 traffic so wherever this private endpoint is, it does get used.

This lead me down to the path of disabling public access for the Cosmos DB instance (which took another 10 minutes). I then tested various operations in AI Foundry and didn’t run into an issue. It SEEMS like this is supported, but I’m still waiting on the product group to confirm for me. Once I hear back, I’ll update this post.

Ok, so those were the things that I did with the resources. Now let’s talk about how automation by folks who mean well can cause chaos.

One morning I woke up to do more testing with this deployment and I started to get the same error I received when I locked down AI Search. After blowing up the lab a few times to vet it wasn’t something I broke mucking with the resources, I dug into the diagnostic logs. Within the Cosmos DB data plane diagnostic logs, I noticed that every call was being spit back complaining invalid access key. A quick check of the disableLocalAuth property of the Cosmos DB instance confirmed that some automation run by the bosses upstairs had disabled local authentication in Cosmos which breaks this whole integration.

A bit more digging showed that all three resources rely upon access keys instead of Entra ID authentication and Azure RBAC authorization. This is likely due to the multi-tenant aspects of the solution (some stuff in your tenant and some in Microsoft’s tenant). I wanted to call this one out, because I know many folks out there disable local access keys for data plane access to services. You’ll need to put in the appropriate exceptions for these three managed resources. If you don’t, you will break your Foundry instance. If you’re not sure if this is your issue, review the diagnostic logs of the resources and check to see if your error is similar to the one I provided above.

Lessons Learned

Hopefully this helps you understand a bit better about the benefits and considerations (more so considerations in this scenario) of using a CMK with AI Foundry. Fingers crossed service-side encryption mechanism in preview goes generally available in the near future and makes this painful process obselete. Until then, here are some suggestion:

  1. Remember that these resources can be modified. You should avoid making any modification (exempting diagnostic settings) to these resources.
  2. Enable diagnostic logging on the resources in the managed resource group. If anything, they will be helpful to troubleshoot if something breaks whether it be identity or networking related.
  3. Understand there is additional cost of this implementation which isn’t typical of CMK usage in my experience (minus cost of the key which is standard in every CMK integration).

Deploying Resources Across Multiple Azure tenants

Hello fellow geeks! Today I’m going to take a break from my AI Foundry series and save your future self some time by walking you through a process I had to piece together from disparate links, outdated documentation, and experimentation. Despite what you hear, I’m a nice guy like that!

The Problem

Recently, I’ve been experimenting with AVNM (Azure Virtual Network Manager) IPAM (IP address management) solution which is currently in public preview. In the future I’ll do a blog series on the product, but today I’m going to focus on some of the processes I went through to get a POC (proof-of-concept) working with this feature across two tenants. The scenario was a management and managed tenant concept where the management tenant is the authority for the pools of IP addresses the managed tenant can draw upon for the virtual networks it creates.

Let’s first level set on terminology. When I say Azure tenant, what I’m really talking about is the Entra ID tenant the Microsoft Azure subscriptions are associated with. Every subscription in Azure can be associated with only one Entra ID tenant. Entra ID provides identity and authentication services to associated Azure subscriptions. Note that I excluded authorization, because Azure has its own authorization engine in Azure RBAC (role-based access control).

Relationship between Entra ID tenant and Azure resources

Without deep diving into AVNM, its IPAM feature uses the concepts of “pools” which are collections of IP CIDR blocks. Pools can have a parent and child relationship where one large pool can be carved into smaller pools. Virtual networks in the same regions as the pool can be associated with these pools (either before or after creation of the virtual network) to draw down upon the CIDR block associated with the range. You also have the option of creating an object called a Static CIDR which can be used to represent the consumption of IP space on-premises or another cloud. For virtual networks, as resources are provisioned into the virtual networks IPAM will report how many of the allocated IP addresses in a specific pool are being used. This allows you to track how much IP space you’ve consumed across your Azure estate.

AVNM IPAM Resources

My primary goal in this scenario was to create a virtual network in TenantB that would draw down on an AVNM address pool in TenantA. This way I could emulate a parent company managing the IP allocation and usage across its many child companies which could be spread across multiple Azure tenants. To this I needed to

  1. Create an AVNM instance in TenantA
  2. Setup the cross-tenant AVNM feature in both tenants.
  3. Create a multi-tenant service principal in TenantB.
  4. Create a stub service principal in TenantA representing the service principal in TenantB.
  5. Grant the stub service principal the IPAM Pool User Azure RBAC role.
  6. Create a new virtual network in TenantB and reference the IPAM pool in TenantA.

My architecture would similar to image below.

Multi-Tenant AVNM IPAM Architecture

With all that said, I’m now going to get into the purpose of this post which is focusing steps 3, 4, and 6.

Multi-Tenant Service Principals

Service principals are objects in Entra ID used to represent non-human identities. They are similar to an AWS IAM user but cannot be used for interactive login. The whole purpose is non-interactive login by a non-human. Yes, even the Azure Managed Identity you’ve been using is a service principal under the hood.

Unlike Entra ID users, service principals can’t be added to another tenant through the Entra B2B feature. To make a service principal available across multiple tenants you need to create what is referred to as a multi-tenant service principal. A multi-tenant service principal exist has an authoritative tenant (I’ll refer to this as the trusted tenant) where the service principal exists as an object with a credential. The service principal has an attribute named appid which is a unique GUID representing the app across all of Entra. Other tenants (I’ll refer to these as trusting tenants) can then create a stub service principal in their tenant by specifying this appid at creation. Entra ID will represent this stub service principal in the trusted tenant as an Enterprise Application within Entra.

Multi-Tenant Service Principal in Trusted Tenant

For my use case I wanted to have my multi-tenant service principal stored authoritatively TenantB (the managed tenant) because that is where I would be deploying my virtual network via Terraform. I had an existing service principal I was already using so I ran the command below to update the existing service principal to support multi-tenancy. The signInAudience attribute is what dictates whether a service principal is single-tenant (AzureADMyOrg) or multi-tenant (AzureADMultipleOrgs).

az login --tenant <TENANTB_ID>
az ad app update --id "d34d51b2-34b4-45d9-b6a8-XXXXXXXXXXXX" --set signInAudience=AzureADMultipleOrgs

Once my service principal was updated to a multi-tenant service principal I next had to provision it into TenantA (management tenant) using the command below.

az login --tenant <TENANTA_ID>
az ad sp create --id "d34d51b2-34b4-45d9-b6a8-XXXXXXXXXXXX"

The id parameter in each command is the appId property of the service principal. By creating a new service principal in TenantA with the same appId I am creating the stub service principal for my service principal in TenantB.

Many of the guides you’ll find online will tell you that you need to grant Admin Consent. I did not find this necessary. I’m fairly certain it’s not necessary because the service principal does not need any tenant-wide permissions and won’t be acting on behalf of any user. Instead, it will exercise its direct permissions against the ARM API (Azure Resource Manager) based on the Azure RBAC role assignments created for it.

Once these commands were run, the service principal appeared as an Enterprise Application in TenantA. From there I was able to log into TenantA and create an RBAC role assignment associating the IPAM Pool user role to the service principal.

Creating The New VNet in TenantB with Terraform… and Failing

At this point I was ready to create the new virtual network in TenantB with an address space allocated from an IPAM pool in TenantA. Like any sane human being writing Terraform code, my first stop was to the reference document for the AzureRm provider. Sadly my spirits were quickly crushed (as often happens with that provider) the provider module (4.21.1) for virtual networks does not yet support the ipamPoolPrefixAllocations property. I chatted with the product group and support for it will be coming soon.

When the AzureRm provider fails (as it feels like it often does with any new feature), my fallback was to AzApi provider. Given that the AzApi is a very light overlay on top of the ARM REST API, I was confident I’d be able to use it with the proper ARM REST API version to create my virtual network. I wrote my code and ran my terraform apply only to run into an error.

Forbidden({"error":{"code":"LinkedAuthorizationFailed","message":"The client has permission to perform action 'Microsoft.Network/networkManagers/ipamPools/associateResourcesToPool/action' on scope '/subscriptions/97515654-3331-440d-8cdf-XXXXXXXXXXXX/resourceGroups/rg-demo-avnm/providers/Microsoft.Network/virtualNetworks/vnettesting', however the current tenant '6c80de31-d5e4-4029-93e4-XXXXXXXXXXXX' is not authorized to access linked subscription '11487ac1-b0f2-4b3a-84fa-XXXXXXXXXXXX'."}})

When performing cross-tenant activities via the ARM API, the platform needs to authenticate the security principal to both Entra tenants. From a raw REST API call this can be accomplished by adding the x-ms-authorization-auxiliary header to the headers in the API call. In this header you include a bearer token for the second Entra ID tenant that you need to authenticate to.

Both the AzureRm and AzApi providers support this feature through the auxiliary_tenant_ids property of the provider. Passing that property will cause REST calls to be made to the Entra ID login points for each tenant to obtain an access token. The tenant specified in the auxiliary_tenant_ids has its bearer token passed in the API calls in the x-ms-authorization-auxiliary header. Well, that’s the way it’s supposed to work. However, after some Fiddler captured I noticed it was not happening with AzApi v2.1.0 and 2.2.0. After some Googling I turned up this Github repository issue where someone was reporting this as far back as February 2024. It was supposed resolved in v1.13.1, but I noticed a person posting just a few weeks ago that it was still broken. My testing also seemed to indicate it is still busted.

What to do now? My next thought was to use the AzureRm provider and pass an ARM template using the azurerm_resource_group_deployment module. I dug deep into the recesses of my brain to surface my ARM template skills and I whipped up a template. I typed in terraform apply and crossed my fingers. My Fiddler capture showed both access tokens being retrieved (YAY!) and being passed in the underlining API call, but sadly I was foiled again. I had forgotten that ARM templates to not support referencing resources outside the Entra ID tenant the deployment is being pushed to. Foiled again.

My only avenue left was a REST API call. For that I used the az rest command (greatest thing since sliced bread to hit ARM endpoints). Unlike PowerShell, the az CLI does not support any special option for auxiliary tenants. Instead, I need to run az login to each tenant and store the second tenant’s bearer token in a variable.

az login --service-principal --username "d34d51b2-34b4-45d9-b6a8-XXXXXXXXXXXX" --password "CLIENT_SECRET" --tenant "<TENANTB_ID>"

az login --service-principal --username "d34d51b2-34b4-45d9-b6a8-XXXXXXXXXXXX" --password "CLIENT_SECRET" --tenant "<TENANTA_ID>"

auxiliaryToken=$(az account get-access-token \
--resource=https://management.azure.com/ \
--tenant "<TENANTA_ID>" \
--query accessToken -o tsv)

Once I had my bearer tokens, the next step was to pass my REST API call.

az rest --method put \
--uri "https://management.azure.com/subscriptions/97515654-3331-440d-8cdf-XXXXXXXXXXXX/resourceGroups/rg-demo-avnm/providers/Microsoft.Network/virtualNetworks/vnettesting?api-version=2022-07-01" \
--headers "x-ms-authorization-auxiliary=Bearer ${auxiliaryToken}" \
--body '{
"location": "centralus",
"properties": {
"addressSpace": {
"ipamPoolPrefixAllocations": [
{
"numberOfIpAddresses": "100",
"pool": {
"id": "/subscriptions/11487ac1-b0f2-4b3a-84fa-XXXXXXXXXXXX/resourceGroups/rg-avnm-test/providers/Microsoft.Network/networkManagers/test/ipamPools/main"
}
}
]
}
}
}'

I received a 200 status code which meant my virtual network was created successfully. Sure enough the new virtual network in TenantB and in TenantA I saw the virtual network associated to the IPAM pool.

Summing It Up

Hopefully the content above saves someone from wasting far too much time trying to get cross tenant stuff to work in a non-interactive manner. While my end solution isn’t what I’d prefer to do, it was my only option due to the issues with the Terraform providers. Hopefully, the issue with the Az Api provider is remediated soon. For AVNM IPAM specifically, AzureRm providers will be here soon so the usage of Az Api will likely not be necessary.

What I hope you took out of this is a better understanding of how cross tenant actions like this work under the hood from an identity, authentication, and authorization perspective. You should also have a better understanding of what is happening (or not happening) under the hood of those Terraform providers we hold so dear.

TLDR;

When performing cross tenant deployments here is your general sequence of events:

  1. Create a multi-tenant service principal in Tenant A.
  2. Create a stub service principal in Tenant B.
  3. Grant the stub service principal in Tenant B the appropriate Azure RBAC permissions.
  4. Obtain an access token from both Tenant A and Tenant B.
  5. Package one of the access tokens in the x-ms-authorization-auxiliary header in your request and make your request. You can use the az rest command like I did above or use another tool. Key thing is to make sure you pass it in addition to the standard Authorization header.
  6. ???
  7. Profit!

Thanks for reading!