AI Foundry – Identity, Authentication and Authorization

Posted on January 27, 2025 by mattfeltonma

This is a part of my series on AI Foundry:

Updates:

3/17/2025 – Updated diagrams to include new identities and RBAC roles that are recommended as a minimum

Yes, I’m going to re-use the outline from my Azure OpenAI series. You wanna fight about it? This means we’re going to now talk about one of the most important (as per usual) and complicated (oh so complicated) topic in AI Foundry: identity, authentication, and authorization. If you haven’t read my prior two posts, you should take a few minutes and read through them. They’ll give you the baseline you’ll need to get the most out of this post. So put on your coffee, break out the metal clips to keep your eyes open Clockwork Orange-style, and prepare for a dip into the many ways identity, authN, and authZ are handled within the service.

As I covered in my first post Foundry is made up of a ton of different services. Each of these services plays a part in features within Foundry, some may support multiple forms of authentication, and most will be accessed by the many types of identities used within the product. Understanding how each identity is used will be critical in getting authorization right. Missing Azure RBAC role assignments is the number one most common misconfiguration (right above networking, which is also complicated as we’ll see in a future post).

Let’s start first with identity. There will generally be four types of identities used in AI Foundry. These identities will be a combination of human identities and non-human identities. Your humans will be your AI Engineers, developers, and central IT and will use their Entra ID user identities. Your non-humans will include the AI Foundry hub, project, and compute you provision for different purposes. In general, identities are used in the following way (this is not inclusive of all things, just the ones I’ve noticed):

Humans
- Entra ID Users
  - Actions within Azure Portal
  - Actions within AI Foundry Studio
    - Running a prompt flow from the GUI
    - Using the Chat Playground to send prompts to an LLM
    - Running the Chat-With-Your-Data workflow within the Chat Playground
    - Creating a new project within a hub
  - Actions using Azure CLI such as sending an inference to a managed online endpoint that supports Entra ID authentication
Non-Humans
- AI Foundry Hub Managed Identity
  - Accessing the Azure Key Vault associated with the Foundry instance to create secrets or pull secrets when AI Foundry connections are created using credentials versus Entra ID
  - Modify properties of the default Azure Storage Account such as setting CORS policies
  - Creating managed private endpoints for hub resources if a managed virtual network is used
- AI Foundry Project Managed Identity
  - Accessing the Azure Key Vault associated with the Foundry instance to create secrets or pull secrets when AI Foundry connections are created using credentials versus Entra ID
  - Creating blob containers for project where project artifacts such as logs and metrics are stored
  - Creating file share for project where project artifacts such as user-created Prompt Flow files are stored
- Compute
  - Pulling container image from Azure Container Registry when deploying prompt flows that require custom environments
  - Accessing default storage account project blob container to pull data needed to boot
  - Much much more in this category. Really depends on what you’re doing

Alright, so you understand the identities that will be used and you have a general idea of how they’ll be used to perform different tasks within the Foundry ecosystem. Let’s now talk authentication.

Authentication in Foundry isn’t too complicated (in comparison to identity and authorization). Authenticating to the Azure Portal and the Foundry Studio is always going to be Entra ID-based authentication. Authentication to other Azure resources from the Foundry is where it can get interesting. As I covered in my prior post, Foundry will typically support two methods of authentication: Entra ID and API key based (or credentials as you’ll see it often referred to as in Foundry). If at all possible, you’ll want to lean into Entra ID-based authentication whenever you access a resource from Foundry. As we’ll see in the next section around authorization, this will have benefits. Besides authorization, you’ll also get auditability because the logs will show the actual security principal that accessed the resource.

If you opt to use credential-based authentication for your connections to Azure resources, you’ll lose out in a few different areas. When credential-based authentication is used, users will access connected resources within Foundry using the keys stored in the Foundry connection object. This means the user assumes whatever permissions the key has (which is typically all data-plane permissions but could be more restrictive in instances like a SAS token). Besides the authorization challenges, you’ll also lose out on traceability. AI Foundry (and the underlining Azure Machine Learning) has some authorization (via Azure RBAC roles) that is used to control access to connections, but very little in the way auditing who exercised what connection when. For these reasons, you should use Entra ID where possible.

Ready for authorization? Nah, not yet. Before we get into authorization, it’s important to understand that these identities can be used in generally two ways: direct or indirect (on-behalf-of). For example, let’s say you run a Prompt Flow from AI Foundry interface, while the code runs on a serverless compute provisioned in a Microsoft managed network (more on that in a future post), the identity context it uses to access downstream resources is actually yours. Now if you deploy that same prompt to a managed online-endpoint, the code will run on that endpoint and use the managed identity assigned to the compute instance. Not so simple is it?

So how do you know which identity will be used? Observe my general guidance from up above. If you’re running things from the GUI, likely your identity, if you’re deploying stuff to compute likely the identity associated with the compute. The are exceptions to the rule. For example, when you attempt to upload data for fine-tuning or using the on-your-own-data feature in the Chat Playground, and your default storage account is behind a private endpoint your identity will be used to access the data, but the managed identity associated with the project is used to access the private endpoint resource. Why it needs access to the Private Endpoint? I got no idea, it just does. If you don’t, good luck to you poor soul because you’re going to have hell of time troubleshooting it.

Another interesting deviation is when using the Chat Playground Chat With Your Data feature. If you opt to add your data and build the index directly within AI Foundry, there will be a mixed usage of the user identity, AI Search managed identity (which communicates with the embedding models deployed in the AI Services or Azure OpenAI instance to create the vector representations of the chunks in the index), and AI Services or Azure OpenAI managed identity (creates index and data sources in AI Search). It can get very complex.

The image below represents most of the flows you’ll come across.

**The many AI Foundry authentication flows and identity patterns**

Okay, now authorization? Yes, authorization. I’m not one for bullshitting, so I’ll just tell you up front authorization in Foundry can be hard. It gets even harder when you lock down networking because often the error messages you will receive are the same for blocked traffic and failed authorization. The complexities of authorization is exactly why I spent so much time explaining identity and authentication to you. I wish I could tell you every permission in every scenario, but it would take many, many, posts to do that. Instead, I’d advise you to do (sometimes I fail to do this) which is RTFM (go ahead and Google that). This particular product group has made strong efforts to document required permissions, so your first stop should always be the Foundry public documentation. In some instances, you will also need to access the Azure Machine Learning documentation (again, this is built on top of AML) because there are sometimes assumptions that you’ll do just that because you should know this is a feature its inheriting from AML (yeah, not fair but it’s reality).

In general, at an absolute minimum, the permissions assigned to the identities below will get you started as of the date of this post (updated 3/17/2025).

As I covered in my prior posts, the AI Foundry Hub can use either a system-assigned or user-assigned managed identity. You won’t hear me say this often, but just use the system-assigned managed identity if you can for the hub. The required permissions will be automatically assigned and it will be one less thing for you to worry about. Using the permissions listed above should work for a user-assigned managed identity as well (this is on my backburner to re-validate).

A project will always use a system assigned managed identity. The only permission listed above that you’ll need to manually grant is Reader over the Private Endpoint for the default storage account. This is only required if you’re using private endpoint for your default storage account. There may be additional permissions required by the project depending on the activities you are performing and data you are accessing.

On the user side the permissions above will put you in a good place for your typical developer or AI engineer to use most of the features within Foundry. If you’re interacting with other resources (such as an AI Search Index when using the on-your-own-data feature) you’ll need to ensure the user is granted appropriate permissions to those resources as well (typically Search Service Contributor – management plane to list indexes and create indexes and Search Index Data Contributor – data plane create and view records within an index. If your user is fine-tuning a model that is deployed within the Azure OpenAI or AI Service instance, they may additionally need the Azure OpenAI Service Contributor role (to upload the file via Foundry for fine-tuning). Yeah, lots of scenarios and lots of varying permissions for the user, but that covers the most common ones I’ve run into.

Lastly, we have the compute identities. There is no standard here. If you’ve deployed a prompt flow to a managed identity, the compute will need the permissions to connect to the resources behind the connections (again assuming Entra-ID is configured for the connection, if using credential Azure Machine Learning Workspace Secrets Reader on the project is likely sufficient). Using a prompt flow that requires a custom environment may require an image be pushed to the Azure Container Registry which the compute will pull so it will need the Acr Pull RBAC role assignment on the registry.

Complicated right? What happens when stuff doesn’t work? Well, in that scenario you need to look at the logs (both Azure Activity Log and diagnostic logging for the relevant service such as blob, Search, OpenAI and the like). That will tell you what the user is failing to do (again, only if you’re using Entra ID for your connections) and help you figure out what needs to be added from a permissions perspective. If you’re using credentials for your connections, the most common issue with them is with the default storage account where the storage account has had the storage access keys disabled.

Here are the key things I want you to take away from this:

Know the identity being used. If you don’t know which identity is being used, you’ll never get authorization right. Use the of the downstream service logs if you’re unsure. Remember, management plane stuff in Azure Activity Log and data plane stuff in diagnostic logs.
Use Entra ID authentication where possible. Yeah it will make your Azure RBAC a bit more challenging, but you can scope the access AND understand who the hell is doing what.
RTFM where possible. Most of this is buried in the public documentation (sometimes you need to put on your Indiana Jones hat). Remember that if you don’t find it in Foundry documentation, look to Azure Machine Learning.
Use the above information as general guide to get the basic environment setup. You’ll build from that basic foundation.

Alrighty folks, your eyes are likely heavy. I hope this helps a few souls out there who are struggling with getting this product up and running. If you know me, you know I’m no fan boy, but this particular product is pretty damn awesome to get us non-devs immediately getting value from generative AI. It may take some effort to get this product running, but it’s worth it!

Thanks and see you next post!

AI Foundry – Credential vs Identity Data Stores

Posted on January 13, 2025 by mattfeltonma

This is a part of my series on AI Foundry:

Hello again folks. Today, I’m going to continue my series on AI Foundry. I’ve been scratching my head on how best to tackle this series, because the service consists of so many foundational services plumbed together into a larger solution so there is a lot to talk about. The product can be complicated when implementing it with all the security bells and whistles. Getting it right requires a solid baseline understanding of the foundational components security capabilities (such as Azure Storage, Azure Key Vault, etc) and how these components work together for the purposes of AI Foundry.

**The many components of an AI Foundry deployment**

For the purposes of this post, I’m going to focus in on Azure Storage, specifically the storage account associated with the AI Foundry Hub. I will refer to this storage account as the default storage account. As I covered in my first post, AI Foundry is built on top of Azure Machine Learning. Like Azure Machine Learning, AI Foundry uses the default storage account to store artifacts created by the AI Foundry hub and projects. This includes files for the Prompt Flows you create, files used by the compute provisioned in the managed virtual network, and other artifacts related to the functionality of the product. This storage account is shared across the AI Foundry hub and all projects created within it.

The default storage account is critical to the functionality and if you muck up the identity or networking configuration, the product simply won’t work. The errors you’ll receive won’t always indicate an obvious problem with your storage account configuration. To help you avoid mucking up the identity portion, I’m going to use this post to explain your options for identity integration with the default storage account.

AI Foundry uses workspace connection resources to connect to external resources outside of the workspace. This includes the default storage account, AOAI (Azure OpenAI Service) or AI Service instance, and the like. When you create a connection in AI Foundry, you configure how the workspace should authenticate to the resource (determined by the authType property of the connection) when called by a user. This will most commonly be either Entra ID or an API key. In the example below, you see I have a connection object for an AI Search instance set to use Entra authentication by configuring the authType to AAD.

 {
      "id": "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rgaifeus296/providers/Microsoft.MachineLearningServices/workspaces/aifhaifoundryeus296/connections/connaisaifoundryeus296",
      "location": null,
      "name": "connmysearchservice",
      "properties": {
        "authType": "AAD",
        "category": "CognitiveSearch",
        "createdByWorkspaceArmId": "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rgaifeus296/providers/Microsoft.MachineLearningServices/workspaces/aifhaifoundryeus296",
        "error": "Network Service does not have permission to check resource /subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rgaifeus296/providers/Microsoft.Search/searchServices/aisaifoundryeus296 details. Please consider grant Azure Machine Learning (appId: 0736f41a-0425-4b46-bdb5-1563eff02385) read or contributor access to connected resource.",
        "expiryTime": null,
        "group": "AzureAI",
        "isSharedToAll": true,
        "metadata": {
          "ApiType": "Azure",
          "ApiVersion": "2024-05-01-preview",
          "DeploymentApiVersion": "2023-11-01",
          "ResourceId": "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rgaifeus296/providers/Microsoft.Search/searchServices/mysearchservice"
        },
        "peRequirement": "NotApplicable",
        "peStatus": "NotApplicable",
        "sharedUserList": [],
        "target": "https://mysearchservice.search.windows.net",
        "useWorkspaceManagedIdentity": false
      },
      "systemData": {
        "createdAt": "2025-01-12T23:19:01.8005674Z",
        "createdBy": "d34d51b2-34b4-45d9-b6a8-cc5422eb400a",
        "createdByType": "Application",
        "lastModifiedAt": "2025-01-12T23:19:01.8005674Z",
        "lastModifiedBy": "d34d51b2-34b4-45d9-b6a8-cc5422eb400a",
        "lastModifiedByType": "Application"
      },
      "tags": null,
      "type": "Microsoft.MachineLearningServices/workspaces/connections"
    }

Creating this connection allows me to use the AI Search instance within the AI Foundry hub and projects such as using it within the ChatPlayground Chat With Your Data feature. When the connection object is called, an Entra ID identity will be used. This could be the user’s identity, it could a project’s managed identity, or it could even be a managed-online endpoint’s managed identity. In all cases, the identity will be an Entra ID identity that can be authenticated to the tenant and the actions it is authorized to do are determined by its Azure RBAC assignments. It’s critical to understand that if you choose Entra ID-based authentication, you need to have proper permissions in place.

When a new AI Foundry hub is created, it will either create new storage account or integrate with an existing storage account to be used as the default storage account. During setup via the Portal, in the identity section you’ll see the option to choose credential-based or identity-based authentication when connecting to the default storage account. By default, credential-based access will be used. If you are provisioning via Terraform (which as of right now will require you to use the AzApi resource provider) you would set the properties.systemDatastoresAuthMode property to either accesskey or identity. As of the date of this blog, this property still is not documented in the REST API documentation that I could find, however, it will work when referencing it with API version Microsoft.MachineLearningServices/workspaces@2024-10-01-preview.

So why would you choose identity-based access if you have to additionally provision the relevant security principals with access via RBAC? Before I answer that, let me do a quick recap on authorization in Azure. As I cover in my series on Azure authorization, services like storage have both a management plane and data plane. While the management plane is always Entra ID-based authentication and Azure RBAC, the data plane for most services (storage included) can use either Entra ID/Azure RBAC or API keys (via Storage Access Keys and SAS tokens). Usage of any type of static key typically grants the security principal using the key complete access to the data plane. Additionally, determining who is using the key at any given time is mostly impossible. For that reason, choosing to use Entra ID/Azure RBAC should be your preference wherever possible. Entra ID will give your traceability back to the security principal that touched the resource and Azure RBAC will give you the ability to assign granular permissions across the data plane.

If you instead select credential-based authentication a few things happen. When the new AI Foundry hub is created the connections made to the default storage account will be configured to use a SAS token. Any security principal with read access to the workspace can use that connection information for the storage account from within an AI Foundry project to connect to the storage account using those credentials. This means no audibility about what user is doing what with the storage account. This goes for any connection you share across projects that use an API key. Not good.

**Default storage account configured to use credential-based authentication**

It’s worth understanding the Key Vault resource used by AI Foundry in this scenario. When selecting credential-based authentication for the default storage account, the storage access keys for the storage account are stored in the Key Vault. Both the AI Foundry hub and projects under the hub are granted access to the secrets via Key Vault access policies. Yuck and yuck. Users do not get access to the Key Vault itself. Foundry simply enables them to exercise the use of the credential via permissions over teh connection object within the Foundry hub or project. When using identity-based authentication and Entra ID for your connections, the Azure Key Vault will be used minimally (such as being used if you deploy a model from the model catalog to managed online endpoint and select key-based authentication) to none.

Hopefully at this point I’ve sold you on the benefits of using the identity-based authentication to the default storage account (and Entra ID for connected resources). As a quick recap, if you care about least privilege and audibility, you’ll choose identity-based authentication. The main consideration of choosing identity-based authentication for the default storage account is that you need to get Azure RBAC right or else shit will break. Oh yes will it break.

If you configure your AI Foundry instance with a SMI (system-assigned managed identity) for the hub and projects, the required permissions on the default storage account will be granted for these identities. This includes:

Hub identity
- Storage Blob Data Contributor
- Storage File Data Privileged Contributor
Project identity
- Storage Account Contributor
- Storage Blob Data Contributor
- Storage File Data Privileged Contributor
- Storage Table Data Contributor

If you’re nosy like I am, you’ll notice the Azure RBAC assignments for both identities for the hub and project have an ABAC condition attached (yes an actual use case!). I plan on covering ABAC conditions in depth in my authorization series, but essentially they are a way of scoping the access to an attribute of the security principal, resource, or session. Within AI Foundry, they are used to limit the managed identities to accessing the blob containers specific to their underlining AML workspace. This helps to prevent the managed identity of one project from accessing artifacts produced by another project. For example, here are the conditions associated with my hub’s managed identity:

(
 (
  !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete'})
  AND
  !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read'})
  AND
  !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write'})
  AND
  !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/move/action'})
  AND
  !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/add/action'})
 )
 OR 
 (
  @Resource[Microsoft.Storage/storageAccounts/blobServices/containers:name] StringStartsWithIgnoreCase '67b8ddaa-f77e-4d12-b9ca-440326274da9'
 )
)

If you opt to use a UMI (user-assigned managed identity) for the AI Foundry hub you’ll need to manually grant these permissions to the UMI prior to provisioning the hub. You should try to include these conditions.

As I mentioned earlier, there are three primary sets of identities that hit the resources for an AI Foundry. These include the hub/project identity, user identity, and compute identities. If you opt to use identity-based authentication to the storage account, you will need to ensure you grant your users appropriate permissions on the storage account. When a user does something like create a prompt flow, the user’s identity context is used to access the file endpoint in the storage account to create a file share that will contain prompt flows they create.

This typically includes:

Storage Blob Data Contributor
Storage File Data Privileged Contributor

If you’re spinning up a managed-online endpoint, you will need to grant that managed identity (if using an UMI, these are automatically added if using an SMI):

Storage Blob Data Reader
Storage Blob Data Contributor

The last thing I want to mention is specific to if you creating Private Endpoints for your default storage account (which for a secure AI Foundry, you should be). Ensure you grant each AI Foundry project managed identity Reader over the private endpoints (both file and blob) for the default storage account. This is required when previewing data from the AI Foundry Portal for use cases like uploading data for fine-tuning a model. I’m not sure where this requirement comes from, but if you don’t include it, your users will run into weird permission errors when attempting to upload data to the default storage account from within AI Foundry.

Let’s sum things up:

The default storage account configuration is critical to successful use of the product. Muck up authorization and prepared for pain.
Use identity-based authentication for connectivity to the default storage account. This will ensure auditability for who accesses what.
Use Entra ID authentication for your AI Foundry connections wherever possible. This will give you auditability and the ability to scope permissions via Azure RBAC.
If you using identity-based authentication, ensure you put in place the right permissions for the hub/project (done automatically if using SMI), user, and compute identities.
If you’re having trouble with users uploading data for fine-tuning via AI Foundry, your project is probably missing the read permissions over the default storage account private endpoints.
If you’re having trouble provisioning a managed online endpoint that is using an UMI, you are probably missing permissions on the default storage account.

That wraps up this post. Thanks folks!

AI Foundry – The Basics

Posted on January 8, 2025 by mattfeltonma

This is a part of my series on AI Foundry:

Happy New Year! Over the last few months of 2024, I was buried in AI Foundry (FKA Azure AI Studio. Hey marketing needs to do something a few times a year.) proof-of-concepts with my best buddy Jose. The product is very new, very powerful, but also very complex to get up and running in a regulated environment. Through many hours labbing scenarios out and troubleshooting with customers, I figured it was time to share some of what I learned.

So what is AI Foundry? If you want the Microsoft explanation of what it is, read the documentation. Here you’ll get the Matt Felton opinion on what it is (god help you). AI Foundry is a toolset intended to help AI Engineers build Generative AI applications. It allows them to interact with the LLMs (large language models) and build complex workflows (via Prompt Flows) in a no code (like Chat Playground) or low code (prompt flow interface) environment. As you can imagine, this is an attractive tool to get the people who know Generative AI quick access to the LLMs so use cases can be validated before expensive development cycles are spent building the pretty front-end and code required to make it a real application.

Before we get into the guts of the use cases Jose and I have run into, I want to start with the basics of how the hell this service is setup. This will likely require a post or two, so grab your coffee and get ready for a crash course.

AI Foundry is built on top of AML (Azure Machine Learning). If you’re ever built out a locked down AML instance, you have some understanding of the many services that work together to provide the service. AI Foundry inherits these components and provides a sleek user interface on top and typically requires additional resources like Azure OpenAI and AI Search (for RAG use cases). Like AML there are lots and lots of pieces that you need to think about, plan for, and implement in the correct manner to make the secured instance of AI Foundry work.

One specific feature of AML plays an important role within AI Foundry and that is the concept of a hub workspace. A hub is an AML workspace that centrally manages the security, networking, compute resources, and quota for children AML workspaces. These child AML workspaces are referred to as projects. The whole goal of the hub is to make it easier for your various business units to do the stuff they need to do with AML/AI Foundry without having to manage the complex pieces like the security and networking. My guidance would be to think about giving each business unit a Foundry Hub that they can group projects of similar environment (prod or non-prod) and data sensitivity.

**General relationship between AI Foundry Hub and AI Foundry Projects**

Ok, so you get the basic gist of this. When you deploy an AI Foundry instance, behind the scenes an AML workspace designated as a hub is created. Each project you create is a “child” AML workspace of the hub workspace and will inherit some resources from the hub. Now that you’re grounded in that basic piece, let’s talk about the individual components.

**The many resources involved with a secured AI Foundry instance**

As you can see in the above image, there are a lot of components of this solution that you will likely use if you want to deploy an AI Foundry instance that has the necessary security and networking controls. Let me give you a quick and dirty explanation of each component. I’ll dive deeper into the identity, authorization, and networking aspects of these components in future posts.

Managed identities
- There are lots of managed identities in use with this product. There is a managed identity for the hub, a managed identity for the project, and managed identities for the various compute. One of the challenges of AI Foundry is knowing which managed identity (and if not a managed identity, the user’s identity) is being used to access a specific resource.
Azure Storage Account
- Just like in AML, there is a default storage account associated with the workspace. Unlike with a traditional AML workspace you may be familiar with, the hub feature allows all project workspaces to leverage the same storage account. The storage account is used by the workspaces to store artifacts like logs in blob storage and artifacts like prompt flows in file storage. The hub and projects isolate their data to specific containers (for blob) and folders (for file) with Azure ABAC (holy f*ck a use case for this feature finally) setup such that the managed identities for the workspaces can only access containers/folders for data related to their specific workspace.
Azure Key Vault
- The Azure Key Vault will store any keys used for connections created from within the AI Foundry project. This could be keys for the default storage account or keys used for API access to models you deploy from the model catalog.
Azure Container Registry
- While this is deemed optional, I’d recommend you plan on deploying it. When deploy a prompt flow there uses certain functionality to the managed compute, the container image used isn’t the default runtime and an ACR instance will be spun up automatically without all the security controls you’ll likely want.
Azure OpenAI Service
- This is used for deployment of OpenAI and some Microsoft chat and embedding models
Azure AI Service
- This can be used as an alternative to the Azure OpenAI Service. It has some additional functionality beyond just hosting the models such as speed and the like.
Azure AI Search
- This will be used for anything RAG related. Most likely you’ll see it used with the Chat With Your Data feature of the Chat Playground.
Managed Network
- This is used to host the compute instances, serverless endpoints and managed online endpoints spun up for compute operations within AI Foundry. I’ll do a deep dive into networking within the service in a future post.
Azure Firewall
- If building a secured instance, you’re going to use a managed virtual network that is locked down to all Internet access via outbound rules. Under the hood an Azure Firewall instance (standard for almost all use cases) will be spun up in the managed virtual network. You interact with it through the creation of the outbound rules and can’t directly administer it. However, you will be paying for it.
Role Assignments
- So so so many role assignments. I’ll cover these in a future post.
Azure Private DNS
- Used heavily for interacting with the AI Foundry instance and models/endpoints you deploy. I’ll cover this in an upcoming post.

Are you frightened yet? If not you should be! Don’t worry though, over this series I’ll walk you through the pain points of getting this service up and running. Once you get past the complex configuration, it’s a crazy valuable service that you’ll have a high demand for from your business units.

In the next post I’ll walk through the complexities of authorization within AI Foundry.

DNS in Microsoft Azure – Private DNS Fallback

Posted on December 3, 2024 by mattfeltonma

This is part of my series on DNS in Microsoft Azure.

Updates:

7/30/2025 – Updated blog to reflect feature is now generally available

Hello folks! I wanted to get at least one blog post in before 2025 so today I’m going to bring the conversation back to DNS once again. I’m going to be hitting on an advanced topic today, so if you’re unfamiliar with DNS in Azure, read up on my prior posts. I’m going to be skipping through much of the basics.

Today we’re going to talk about one of the challenges that tends to pop up when customers begin to heavily use PrivateLink Private Endpoints and Azure Private DNS. You will likely run into this challenge at some point (if you haven’t already) when you attempt to collaborate with another organization using Azure, when using services like Azure Fabric where one BU (business unit) manages Azure and another manages Azure Fabric, or when working across multiple Entra ID tenants.

Brew your coffee, we’re about to dive into the weeds!

As I’ve covered in past posts, Microsoft provides you out-of-the-box DNS resolution for each VNet (virtual network) via the Azure-provided DNS service (I’m going to refer to it as the WireServer for the rest of this post). The WireServer can be reached at 168.63.129.16 from endpoints deployed to the virtual network and will route DNS queries to either Microsoft public DNS resolvers or to Private DNS zones. Private DNS Zones allow customers to host internally-facing DNS namespaces and are very commonly used with PrivateLink Private Endpoints for Microsoft PaaS (platform-as-a-service) services due automatic lifecycle management of the A records for the Private Endpoints. Thus our challenge begins to peek its ugly head.

**Example DNS Resolution for Private Endpoints when using Private DNS**

Alrighty, I get it. You know all this and it’s boring you. Let’s get to the good stuff.

What if you need to collaborate with another organization and they also use Private Endpoints? How might this cause some issues?

Let’s take a scenario where Bob works for Contoso and Alice works for Fabrikam. Alice over at Fabrikam produces a daily dump of data from a financial system to an Azure Storage Account as a blob. Bob over at Contoso pulls that data down into his environment for analysis by employees of Contoso. Alice provides this dump to over a hundred customers. Due to this large volume of customers, she has opted to provide it over a public endpoint only.

**Bob living the good life with resolution working as he expects**

This process has been working flawlessly for years and Bob’s life has been good. One day, Bob’s life isn’t good and his automation fails. After lots of troubleshooting involving both Contoso and Fabrikam, it’s determined that DNS resolution is failing when trying to resolve the name of the storage account.

As it turns out, Alice’s Information Security team made it a standard to use Private Endpoints and she turned on a Private Endpoint for the storage account. The creation of the Private Endpoint creates a CNAME for the storage account in public DNS for fabrikam.privatelink.blob.core.windows.net. Since Contoso has this Private DNS Zone configured in its environment, Bob’s query gets redirect to Contoso’s Private DNS Zone which doesn’t have the record and instead returns an NXDOMAIN.

**Bob having a bad day with the DNS resolution failing due to Fabrikam turning on a Private Endpoint for the storage account**

Historically, this has been a pain to deal with. Customers have had to work around it by creating local host records (yuck), defining the FQDN (fully-qualified domain name) for the storage account as a zone, or creating conditional forwarders for specific FQDNs in their on-premises DNS service. While both will work, it can become a real headache at scale and can make troubleshooting resolution a complete nightmare. Yes, there is always the option of the 3rd party injecting a Private Endpoint into your virtual network, but I rarely see this occur across my customer base in situations where 3rd parties are servicing a large number of customers. Likely due to complexity and cost (yes Private Endpoints and the data transferring through them do have costs and can add up with large amounts of data).

Microsoft introduced a new feature in 2024 called “Fallback to Internet for Private DNS” which seeks to address this problem once and for all. With this feature customers can configure whether resolution should fallback to public DNS on a per virtual network link basis for each Private DNS Zone. This means you can pick which Private DNS Zones fallback to public DNS. Maybe you want to do it for privatelink.blob.core.windows not, but privatelink.database.windows.net. If you use different resolution paths (meaning separate virtual network links) for production and non-production, you can choose to fallback only for non-production while keeping today’s behavior for production. This gives you a ton of flexibility in how you handle resolution.

In the Azure Portal you will see an option in a virtual network link called Enable fallback to Internet. When you enable this option Azure DNS will fallback to public DNS resolution if it can’t find a record in a Private DNS Zone. With fallback off it’s set to the value of Default and if fallback is on it’s set to the value of NxDomainRedirect.

**New option in Azure Portal to enable DNS fallback**

If we revisit Bob’s challenge. He can now resolve this by enabling fallback on the virtual network link used by his endpoint’s resolution path for the privatelink.blob.core.windows.net. When the WireServer receives back an NXDOMAIN, it will then try to resolve it via public DNS yielding the public endpoint IP Bob needs for Fabrikam’s storage account.

**DNS resolution with fallback in place**

This feature makes dealing with the scenario way more straightforward. I haven’t heard a good reason to not enable this by default. If you have one in mind, definitely post in the comments.

So your key takeaways:

The usage of Private Endpoints across organizations can create split-brain DNS-like scenarios that require lots of DNS record management overhead.
This feature will help to address those scenarios. You should use it where it makes sense, but it shouldn’t be your default.

Thanks for reading!

Journey Of The Geek

The chronicles of a Bostonian tech geek navigating through life and technology