DNS in Microsoft Azure – DNS Security Policies

This is part of my series on DNS in Microsoft Azure.

Hi there folks! After a busy July packed with a vacation and an insane amount of work , I’m back with a new post. Today I’m going to cover a new feature that has been years coming. Yes folks, DNS query logging is now native to the platform with the introduction of DNS Security Policies into GA (generally available) last month. No longer will you have to solution around this long painful gap. In this post I’ll walk through what this new resource is, what it can do (beyond DNS query logging), cover the use cases I’ve tested with it, show you some samples of the logs, and finally cover some potential designs to incorporate it. Let’s dive in!

A long time coming

If you’ve ever spent time troubleshooting a connection error or trying to detect, block, and analyze malware you are likely familiar with the value of DNS query logs. The former makes it a must for day-to-day operations and the latter a critical piece of data for information security. Historically, it’s been a pain to gather this in Microsoft Azure. The wire server (magic IP, 168 address, whatever your favorite nickname) that is made available within a virtual network to use Azure’s built-in DNS resolution service has lacked the capability to capture DNS queries. This mean queries from compute within your virtual network that were resolving to Azure Private DNS zones or a public DNS zone via Azure-provided DNS weren’t captured. Even the introduction of the Azure Private Resolver didn’t address this gap. This lead to customers with requirements to capture DNS query logs having to get fancy.

The most common pattern customers used to address this gap was to introduce a third-party DNS service like an Infoblox, Bluecat, BIND server, or even Windows DNS Server that all compute running within Azure would use for resolution. While customers were able to use this pattern to get the logs, it meant more virtual machines, more costs, more overhead, and it was typically too expensive to implement for workloads that may require complete isolation and didn’t fit into a typical hub and spoke pattern.

Example design for BYODNS for query logging

When the Azure Private Resolver service got introduced along with DNS Forwarding Rule Sets, customers using Azure Firewall had the option of ditching the third-party DNS service and using Azure Firewall’s DNS proxy service which included DNS query logging (kind odd it went there first, right?). This was another common pattern I saw pop up in that Azure Firewall customer base.

Example design using Azure Firewall for DNS query logging and Azure Private DNS Resolver

Beyond whatever other creative ways customers were addressing this gap, it was a gap and it was costing customers extra money. In comes DNS Security Policies to save the day.

DNS Security Policies Components

DNS Security Policies provide 2 core functions today:

  1. DNS query filtering
  2. DNS query logging

Before I dive into those features in depth, I’m a fan of looking at the resource as a whole from the API layer to get an idea of the components, their purpose, and their relationships.

DNS Security Policies and related resources

DNS Security Policies fall under the Microsoft.Network resource provider and are regional resources. The simplest way to understand a resource provider is to think of a namespace in traditional programming. Within a namespace there are resource types (think classes) with specific resource operations. Within the Microsoft.Network resource provider, the three direct children resources that are key.

You’ll notice the Microsoft Learn documentation uses different terminology from what the API uses for some of the resources. To keep things simple, I’ll be using the Microsoft Learn documentation. Here is a quick cheat sheet:

  • DNS Resolver Policies -> DNS Security Policy
  • DNS Security Rules -> DNS Traffic Rules
  • DNS Resolver Domain Lists -> Domain Lists

Each DNS Security Policy has two children resources: DNS Traffic Rules and Virtual Network Links. DNS traffic rules are the guts of your logic for the DNS Security Policy. Each policy can have up to 10 rules (as of August 2025). Each rule consists of a priority (100 – 65000), action (block, allow, alert), and related domain list (I’ll cover these in a few). You can create multiple rules and order them in priority similar to the screenshot below.

DNS Traffic Rules example

Based on the above logic, when the DNS Security Policy triggers a rule based on the domain matching the associated domain list. If the domain being requested is in the list associated with the priority 100 rule, the query is blocked. If not, it’s then processed by the alert rule (which seems to do nothing in my experience as I’ll cover later). Finally, it will hit the last rule which will allow it through but log it.

As I covered above, each rule is associate with one or more domain list. Domain lists are sibling resources to DNS Security Policies. By being a sibling vs a child, they can be re-used across multiple DNS Security Policies (and whatever other use Microsoft comes up with). This allows you to define your domain lists centrally and re-use them across multiple rule sets if, for example, you wanted to maintain your domains lists consistently across environments (test/qa/prod/etc). Domain lists are pretty simple resources consisting of a domain name or wildcard (denoted by a period). It’s important to understand how the domains will be processed. For example (I’m going to steal this direct from the docs), if you allow contoso.com at rule 100 but block bad.contoso.com at rule 110 the query to bad.contoso.com will be allowed because it falls under contoso.com which was allowed by a higher priority rule.

Example of a domain list

The virtual network link resource is the other child of the DNS Security Policy. This functions similar to the virtual network links with Private DNS Zones as it associates the DNS Security Policy to a virtual network where it will process queries sent through the wire server (Azure-provided DNS). Each virtual network can be linked to one DNS Security Policy but each DNS Security Policy can be linked multiple virtual networks allowing you to use them for those virtual networks connected in a hub and spoke like architecture with centralized DNS as well as those virtual networks that may require complete network isolation.

Example of DNS Security Policy virtual network links

DNS Security Policies support diagnostic logging. This allows you to send each query captured by the policy to storage, event hub, or a log analytics workspace. If using a log analytics workspace, the logs are written to a table named DNSQueryLogs. Log entries will look like the below. You’ll get the key pieces of information such as source IP address of the query and the action taken on it. Here you’ll see the query was denied which is indicated by the ResolverPolicyRuleAction. The values here will be “Deny” for blocks, “None” for alerts, and “Allow” for anything allowed.

Example of DNS Query Logs log entry

When the query is denied, instead of getting back an NXDomain, the machine making the query receives back a CNAME of blockpolicy.azuredns.invalid indicating the query has been blocked by DNS Security Policy. This is much better behavior than a NXDomain because now we know what the culprit for the failed DNS query is.

Example of DNS query being denied by DNS Security Policy

To visualize how the allow and deny works, I threw together two quick and dirty visual representations.

Example of how Allow and Block DNS Traffic Rules work

Scenarios you may be wondering about

Like many of you, I’m curious to see what does work and doesn’t work. I went through and tested a variety of scenarios. Here are a few below and my results when using these policies:

  • Machine using an external DNS server and is not using wire server (magic IP, 168 address, etc)
    • Query is not logged by DNS Security Policies
  • Machine using its wire server in its virtual network
    • Query is captured
  • Machine using Private DNS Resolver in the same virtual network
    • Query is captured
  • Machine using a DNS Proxy which sits in front of the Private DNS Resolver
    • Query is captured
  • Machine queries an A record or PTR record
    • Query is captured
  • Machine queries AAAA record
    • Query is captured
  • Machine queries using TCP-based query instead of UDP-based query
    • Query is captured
  • PaaS Services tested successfully
    • Azure Bastion
    • Azure Firewall

How might you use this?

So now you better understand how the service works and what it does. I’ll now tell you how I’d use it. I’m sure folks smarter than me will come out with more effective ways, but here is how I’m envisioning it now.

Based on the testing I’ve done (and testing done by one of my wonderful peers Chris Jasset) the DNS Security Policies seem to take effect at the wire server. This means you’ll want to link the policies to the virtual networks where DNS packets are directed to the wire server. In a centralized DNS design such as below, this would be linked to the virtual network containing the Azure Private DNS Resolver or 3rd-party DNS solution. You would need one DNS Security Policy per region give they are regional in nature.

Sample design for centralized DNS resolution

If you’re using a distributed DNS model, or have isolated virtual networks, your design would look something more like below. Here the DNS Security Policies are linked to each virtual network to ensure the packet is captured at the wire server of the virtual network where the query originates.

Sample design for distributed DNS Security Policy

As for domain lists, I think most organizations will likely have three separate domain lists. One for block, one for alert (again I don’t find this super useful as of now), and one for allow. These domain lists could be established in a production subscription and shared across lower environments to ensure consistency of blocked domains across environments.

Summing it up

There are a few big takeaways for you this post:

  • It’s time to revisit how you’re capturing DNS query logs. If your only reason for implementing a third-party DNS service was DNS query logging, you may want to revisit that to see to see if this new solution is more cost effective.
  • Just like Azure Private DNS, don’t forget to link your policy to the right virtual network. Whatever virtual network you’re sending DNS queries to the wire server is where these should be linked.
  • DNS query logs are very chatty. You may want to look at ways of optimizing what you capture (if you’re sending it to a third-party logging solution) of how much you retain (if you’re keeping it in a Log Analytics Workspace). This is especially true if you use a wildcard in the allow to capture everything. PaaS especially is very chatty. If you aren’t careful about this, you’ll owe Microsoft a big fat check by the end of that first month.

Lastly, I threw together some samples of the creation of these resources in Terraform if you’re curious. You can find the code here.

Well folks, hopefully you learned something new today. Thanks as always for taking the time to read the content!


Azure Authorization – Azure ABAC (Attribute-based Access Control)

This is part of my series on Azure Authorization.

  1. Azure Authorization – The Basics
  2. Azure Authorization – Azure RBAC Basics
  3. Azure Authorization – actions and notActions
  4. Azure Authorization – Resource Locks and Azure Policy denyActions
  5. Azure Authorization – Azure RBAC Delegation
  6. Azure Authorization – Azure ABAC (Attribute-based Access Control)

Welcome back fellow geeks.

I do a lot of learning and educational sessions with my customer base. The volume pretty much demands reusable content which means I gotta build decks and code samples… and worse maintain them. The maintenance piece typically consists of me mentally promising myself to update the content and kicking the can down the road for a few months. Eventually, I get around to updating the content.

This month I was doing some updates to my content around Azure Authorization and decided to spend a bit more time with Azure ABAC (Attribute-based access control). For those of you unfamiliar with Azure ABAC, well it’s no surprise because the use cases are so very limited as of today. Limited as the use cases are, it’s a worthwhile functionality to understand because Microsoft does use it in its products and you may have use cases where it makes sense.

The Dream of ABAC

Let’s first touch briefly on the differences between (RBAC) role-based access control and (ABAC) attribute-based access control. Attribute-based access control has been the dream for the security industry for as long as I can remember. RBAC has been the predominant authorization mechanism in a majority of applications over the years. The challenge with RBAC is it has typically translated to basic group membership where an application authorizes a user solely on whether or not the user is in a group. Access to these groups would typically come through some type of request for membership and implementation by a central governance team. Those processes have tended to be not super user friendly and the access has tended to be very course-grained.

ABAC meanwhile promised more fine-grained access based upon attributes of the security principal, resource, or whatever your mind can dream up. Sounds awesome right? Well it is, but it largely remained a dream in the mainstream world with a few attempts such as Windows Dynamic Access Control (Before you comment, yeah I get you may have had some cool apps doing this stuff years ago and that is awesome, but let’s stick with the majority). This began to change when cloud came around with the introduction of more modern protocols and standards such as SAML, OIDC, and OAuth. These protocols provide more flexibility with how the identity provider packages attributes about the user in the token delivered to the service provider/resource provider/what have you.

When it came to the Azure cloud, Microsoft went the traditional RBAC path for much of the platform. User or group gets placed in Azure RBAC role and user(s) gets access. I explain Azure RBAC in my other posts on RBAC. There is a bit of flexibility on the Entra ID side for the initial access token via Entra ID Conditional Access, but RBAC in the Azure realm. This was the story for many years of Azure.

In 2021 Microsoft decided something more flexible was needed and introduced Azure ABAC. The world rejoiced… right? Nah, not really. While the introduction of ABAC was awesome, its scope of use was and still is extremely limited. As of the date of this blog, ABAC is only usable for Azure Storage blob and queue operations. All is not lost though, there are some great use cases for this feature so it’s important to understand how it works.

How does ABAC work?

Alright, history lesson and complaining about limited scope aside, let’s now explore how the feature works.

ABAC is facilitated through an additional property on Azure RBAC Role Assignment resources. I’m going to assume you understand the ins and out of role assignments. If you don’t, check out my prior post on the topic. In its most simple sense, an Azure RBAC role assignment is the association of a role to a security principal granting that principal the permissions defined in the role over a particular scope of resources. As I’ve covered previously, role assignments are an Azure resource that have defined sets of properties. The properties we care about for the scope of this discussion are the conditionVersion and condition properties. The conditionVersion property will always have a value of 2.0 for now. The condition property is where we work our ABAC magic.

The condition property is made up of a series of conditions which each consist of an action and one or more expressions. The logic for conditions is kinda weird, so I’m walk you through it using some of the examples from documentation as well as complex condition I throw together. First, let’s look at the general structure.

Structure of conditions used in ABAC

In the above image you can see the basic building blocks of a condition. Looks super confusing and complicated right? I know it did to me at first. Thankfully, the kind souls who write the public documentation broke this down in a more friendly programming-like way.

Far more simple explanation of conditions

In each condition property we first have the action line where the logic looks to see if the action being performed by the security principal doesn’t (note the exclamation point which negates whats in the parentheses) match the action we’re applying the conditions to. You’ll commonly see a line like:

!(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read'} AND !SubOperationMatches{'Blob.List'})

This line is saying if the action isn’t blobs/read (which would be data plane call to read the contents of the blob) then the line should evaluate to true. If it evaluates to true, then the access is allowed and the expressions are not evaluated any further.

After this line we have the expression which is only evaluated when the first line evaluates to false (which in the example I just covered would mean the security principal is trying to read the content of a blob). The expressions support four categories of what Microsoft refers to as condition features. There are currently four features in various states of GA (general availability) and preview (refer to the documentation for those details). These four categories include:

  • Requests
  • Environment
  • Resource
  • Principal (security principal)

These four categories give you a ton of flexibility. Requests covers the details of the request to storage, for example such as limiting a user to specific blob prefixes based on the prefix within the request. Environment can be used to limit the user to accessing the resource from a specific Private Link Private Endpoint over Private Link in general (think defense-in-depth here). The resource feature exposes properties of the resource being accessed, which I find the most flexible thing to be blob index tags. Lastly, we have security principal and this is where you can muck around with custom security attributes in Entra ID (very cool feature if you haven’t touched it).

In a given condition we can have multiple expressions and within the condition property we can string together multiple conditions with AND and OR logic. I’m a big believer in going big or going home, so let’s take a look at a complex condition.

Diving into the Deep End

Let’s say I have a whole bunch of data I need to make available via a blobs in an Azure Storage Account. I have a strict requirement to use a single storage account and the blobs I’m going to store have different data classifications denoted by a blob index tag key named access_level. Blobs without this key are accessible by everyone while blobs classified high, medium, or low are only accessible by users with approval for the same or higher access levels (example: user with high access level can access high, medium, low, and data with no access level). Lastly, I have a requirement that data at the high access level can only be accessed during business hours.

I use a custom security attribute in Entra ID called accesslevel under an attribute set named organization to denote a user’s approved access level.

Here is how that policy would break down.

My first condition is built to allow users to read any blobs that don’t have the access_level tag.

# Condition that allows users within scope of the assignment access to documents that do not have an access level tag
(
  (
    # If the action being performed doesn't match blobs/read then result in true and allow access
    !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read'} AND !SubOperationMatches{'Blob.List'})
  )
  OR 
  (
    # If the blob doesn't have a blob index tag with a key of access_level then allow access
    NOT @Resource[Microsoft.Storage/storageAccounts/blobServices/containers/blobs/tags&$keys$&] ForAnyOfAnyValues:StringEquals {'access_level'}
  )
)

If the blob does have an access tag, I want to start incorporating my logic. The next condition I include allows users with the accesslevel security attribute set to high to read blobs with a blob index tag of access_level equal to low or medium. I also also allow them to read blobs tagged with high if it’s between 9AM and 5PM EST.

# Condition that allows users within scope of the assignment to access medium and low tagged data if they have a custom 
# security attribute of accesslevel set to high. High data can also be read within working hours
OR
(
 (
   # If the action being performed doesn't match blobs/read then result in true and allow access
   !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read'} AND !SubOperationMatches{'Blob.List'})
 )
 OR 
 (
   # If the blob has an index tag of access_level with a value of medium or low allow the user access if they have a custom security
   # attribute of organization_accesslevel set to high
   @Resource[Microsoft.Storage/storageAccounts/blobServices/containers/blobs/tags:access_level<$key_case_sensitive$>] ForAnyOfAnyValues:StringEquals {'medium', 'low'}
   AND
   @Principal[Microsoft.Directory/CustomSecurityAttributes/Id:organization_accesslevel] StringEquals 'high'
 )
 OR
 (
   # If the blob has an index tag of access_level with a value of high allow the user access if they have a custom security
   # attribute of organization_accesslevel set to high and it's within working hours
   @Resource[Microsoft.Storage/storageAccounts/blobServices/containers/blobs/tags:access_level<$key_case_sensitive$>] ForAnyOfAnyValues:StringEquals {'high'}
   AND
   @Principal[Microsoft.Directory/CustomSecurityAttributes/Id:organization_accesslevel] StringEquals 'high'
   AND
   @Environment[UtcNow] DateTimeGreaterThan '2025-06-09T12:00:00.0Z'
   AND
   @Environment[UtcNow] DateTimeLessThan '2045-06-09T21:00:00.0Z'
 )
)

Next up is users with medium access level. These users are granted access to data tagged medium or low.

# Condition that allows users within scope of the assignment to access medium and low tagged data if they have a custom 
# security attribute of accesslevel set to medium
OR
(
  (
    # If the action being performed doesn't match blobs/read then result in true and allow access
    !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read'} AND !SubOperationMatches{'Blob.List'})
  )
  OR 
  (
    # If the blob has an index tag of access_level with a value of medium or low allow the user access if they have a custom security
    # attribute of organization_accesslevel set to medium
    @Resource[Microsoft.Storage/storageAccounts/blobServices/containers/blobs/tags:access_level<$key_case_sensitive$>] ForAnyOfAnyValues:StringEquals {'medium', 'low'}
    AND
    @Principal[Microsoft.Directory/CustomSecurityAttributes/Id:organization_accesslevel] StringEquals 'medium'
 )
)

Finally, I allow users with low access level to access data tagged as low.

# Condition that allows users within scope of the assignment to access low tagged data if they have a custom 
# security attribute of accesslevel set to low
OR
(
 (
   # If the action being performed doesn't match blobs/read then result in true and allow access
   !(ActionMatches{'Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read'} AND !SubOperationMatches{'Blob.List'})
 )
 OR 
 (
   # If the blob has an index tag of access_level with a value of low allow the user access if they have a custom security
   # attribute of organization_accesslevel set to low
   @Resource[Microsoft.Storage/storageAccounts/blobServices/containers/blobs/tags:access_level<$key_case_sensitive$>] ForAnyOfAnyValues:StringEquals {'low'}
   AND
   @Principal[Microsoft.Directory/CustomSecurityAttributes/Id:organization_accesslevel] StringEquals 'low'
 )
)

Notice how I separated each condition using OR. If the first condition resolves to false, then the next condition is evaluated until all access is granted or all conditions are exhausted. Neat right?

Summing it up

So why should you care about this if its use case is so limited? Well, you should care because that is ABAC’s use case today, and it would be expanded in the future. Furthermore, ABAC allows you to be more granular in how you grant access to data in Azure Storage (again, blob or queue only). You likely have use cases where this can provide another layer of security to further constrain a security principal’s access. You’ll also see these conditions used in Microsoft’s products such as AI Foundry.

The other reason it’s helpful to understand this language used for the condition, is conditions are expanding into other services such as Azure RBAC Delegation (which if you aren’t using you should be). While the language can be complex, it does make sense if you muck around with it a bit.

A final bit of guidance here, don’t try to write conditions by hand. Use the visual builder in the Azure Portal as seen below. It will help you get some basic conditions in place that you can further modify directly via the code view.

Azure Portal Condition Builder

Next time you’re locking down an Azure storage account, think about whether or not you can further restrict humans and non-humans alike based on the attributes discussed today. The main places I’ve seen this used are for user profiles, further restricting user access to specific subsets of data (similar to the one I walked through above), or even adding an additional layer of network security baked directly into the role assignment itself.

See you next post!

Simple Patterns for Chatting with Your Data – Using the Microsoft public backbone

Hello again folks! Recently, I’ve been working with some far more intelligent peers (such as my buddy Jose Medina Gomez, you should definitely check out his repos because he has some awesome stuff there) on getting some new-to-Azure customers up and running in the GenAI (generative AI) space. Specifically, these customers had some custom data they wanted to use the LLMs (large language models) to reason on and answer questions about. This called for using a RAG (retrieval-augmented generation) pattern to provide an LLM access to an external knowledge base. I thought it would be helpful to other folks out there like myself that are new to this world to document some simple patterns to do this type of stuff that keep security in mind. I’ll cover these over a few posts with this being the first.

The Pattern

The first pattern I want to cover is what I call the “Microsoft public backbone” pattern. This pattern is ideal for customers to minimal to no Azure presence who need something up and running quickly with some basic security guardrails. The pattern looks like what you see below:

Microsoft-backbone Pattern

The key benefits of this pattern are:

  • All traffic between Microsoft PaaS (platform-as-a-service) services flows over the Microsoft public backbone and the organization’s application communicates with the services over the Microsoft public backbone.
  • All Microsoft PaaS services use the built-in service firewall to control inbound traffic.
  • Microsoft PaaS services that support outbound network controls use those controls to mitigate the risk of data exfiltration.
  • Authentication between each component uses Entra ID based authentication and Azure RBAC authorization.
  • Minimizes costs by choosing more affordable SKUs where possible.
  • Captures logs where available.

What I like about this pattern is it is super simple to get up and running (can take less than one hour) and provides decent security controls with minimal headache. It’s by no means production-ready pattern for reasons I’ll discuss further in this post, but for a quick proof-of-concept or for getting your feet wet with RAG-like patterns, this is a great choice.

I’ll now spend a few moments providing detail to each of the benefits I outlined above.

The Benefits and Considerations

Benefit 1: All traffic between Microsoft PaaS (platform-as-a-service) services flows over the Microsoft public backbone and the organization’s application communicates with the services over the Microsoft public backbone

Simplicity is the name of the game here folks. By keeping keeping all communication to the Microsoft public backbone you avoid the complexities of Private Link integration. For organizations that are new to Azure and don’t have a platform landing zone (hybrid connectivity, network inspection, Internet egress support for Azure resources, DNS forwarding for Private Link) this pattern can be done without much effort. As an added benefit, PaaS to PaaS stays over the Microsoft public backbone providing you with the the security controls Microsoft provides across their public backbone.

Benefit 2: All Microsoft PaaS services use the built-in service firewall to control inbound traffic

Almost all (I’m sure there are some exceptions that aren’t top of mind) Microsoft PaaS services provide a basic built-in firewall I refer to as the service firewall. The service firewall is off by default, but can be toggled on to restrict inbound traffic to the public endpoint for the PaaS (which ever PaaS has). Most commonly (every PaaS service seems to work a bit differently) you can create exceptions to the firewall based on IP address or allow “trusted” Microsoft services to bypass the firewall. Additionally, Azure Storage has is its capability which allows to configure specific resource instances to bypass the firewall based on the resource instance identifier and its managed identity.

The “trusted” Microsoft service exception needs a bit more explaining. Most Azure PaaS (again, there are always exceptions because Microsoft loves its snowflakes) has a checkbox in the Portal with text like seen in the screenshot below. This checkbox allows traffic from a specific set of Azure services (identified by their public IP addresses) to bypass the service firewall. Today, this will be a box you will often need to check whenever you are doing PaaS to PaaS. The key thing to understand about this checkbox is it’s all public IPs associated with whatever the “trusted” services are the specific product group identifies. This means instances not owned by you could be allowed to bypass the service firewall (making authentication and authorization critical). Thankfully, the upcoming Network Security Perimeters feature will likely address this gap and make this box something of the past.

Trusted services bypass option

Benefit 3: Microsoft PaaS services that support outbound network controls use those controls to mitigate the risk of data exfiltration

While controlling inbound traffic for a PaaS is typically a Private Endpoint or service firewall (or eventually Network Security Perimeters) use case, controlling outbound traffic tends a bit a bit more tricky. For many compute-based services (AKS, App Services, Azure Container Services, etc) you are able to force outbound traffic through your virtual network allowing you to get visibility into the traffic and control what that service can make outbound network connections to.

With PaaS services like the ones used in this architecture, these types of virtual network integration aren’t an option. For most non-compute-based PaaS you are essentially SOL (I’ll let you figure out this acronym yourself). The services that fall under the Cognitive Services framework (such as Azure OpenAI Service and AI Services) support outbound traffic controls. You can check out my prior post for the details on those controls. In this architecture we use Azure OpenAI Services so we can take advantage of those outbound controls.

Restricting outbound access in Cognitive Services

Controlling outbound access from a PaaS will be another place Network Security Perimeters will become the predominant control mechanism.

Benefit 4: Authentication between each component uses Entra ID based authentication and Azure RBAC authorization

In this pattern Entra ID-based authentication and Azure RBAC authorization is used at each hop for human-to-service and service-to-service communication. Users interacting with these service will use their Entra ID user identities which are typically synchronized from an on-premises Windows Active Directory. Non-humans (applications and services) will use Entra ID service principals to authenticate to each other. This will either be a standard service principal, identified by a client id and client secret (for you AWS folks this is essentially your IAM User), or a special type of service principal called a managed identity (for those of you coming from AWS, this is as close to an IAM Role as Azure gets).

Azure RBAC roles are assigned with least privilege in mind. Users (or groups) are assigned the minimal permissions they need to upload data to the storage account and perform needed functions with the PaaS to load and query the data. Services are provided the necessary permissions they need to interact.

Benefit 5: Minimizes costs by choosing more affordable SKUs where possible

Costs are already pretty cheap with this pattern. This pattern minimizes costs further by sacrificing the Shared Private Access feature of AI Search. Yeah, you lose on the fuzzy feeling of the communication between AI Search and Azure Storage or Azure OpenAI Service happening over Private Link, but you save some money with the more basic SKU and still get the security of the Microsoft public backbone and the service firewalls.

Note that this design choice is made to optimize costs. Performance within the Basic SKU may not be sufficient for your use case.

Benefit 6: Captures logs where available

Finally, let’s look at logging. In this pattern you’ll get your management plane activities (actions on the resources) via Azure Activity Logs and you’ll get data plane (actions on the data held by the resources) activities via diagnostic settings delivering logs to a Log Analytics Workspace.

Each of these resources has a selection of logs available. Some are “ok” (Azure OpenAI Service) and some are “meh” (Azure AI Search). However, you will want all of these logs for both security and operational use.

The Considerations

There can’t only be benefits, right? The major considerations of this pattern is it’s very much built for proof-of-concept. You get basic network security controls with the service firewall, but no inspection of traffic unless you have an inspection point on-premises in front of the developer. Additionally, before communication from the developer gets to Azure it will have to traverse the public Internet before it gets to the Microsoft public backbone. While all of your communication will happen over TLS, you don’t get the security benefits of wrapping that encrypted session with an IPSec tunnel or funneling it over a known path and operational benefits of consistent latency with ExpressRoute.

Scalability of AI Search is another consideration. The Basic SKU will offer you a limited amount of scale.

On the LLM front, this pattern only allows you to deploy models available within an Azure OpenAI Service (or AI Services) instance (thanks to Jose for highlighting this consideration). There are options to adjust this pattern to use other LLMs, but it will require the introduction of AI Foundry which is quite the beast.

There are likely others I’m missing, but this is still a great little pattern to see what the LLMs can do that comes wrapped with decent security controls and requires minimal coding.

Loading Data

So you’ve decided that the benefits and considerations make sense to you and you want to move ahead, or maybe you’re just dipping your toes into this world and you want to muck around with things. Now you’re left wondering, “Ok I set this thing up like you documented above, how the heck do I use it?”

Alrighty, I’m going to show you the the quick and dirty way. Do not assume the way I’m going to show you is the only way to do this pattern. There are lots of variations, especially on how you chunk and load the data into AI Search. My advice to you in that department would be to work with the data folks at your organization or engage a Microsoft solutions architect on the optimal way to chunk and load your data. Do it wrong, and the responses from the LLMs will be crappy. After watching my buddy Jose and many of his peers, it’s very much an art form that requires experience and experimentation.

For the less experienced folks like myself, there is a built-in wizard within AI Search that helps to chunk and vectorize the data. If you open the Azure Portal you’ll see an option called Import and Vectorize as seen in the screenshot below.

The easy button

Clicking that option will open up the wizard (yes Microsoft still loves its wizards). On the first screen you’ll select the Azure Blob Storage option. On the next screen you’ll configure the options below. If you’ve set things up as I’ve outlined them in the initial pattern diagram (RBAC and network controls) this will work like a champ (don’t forget to deploy a chat model like gpt-4o and embedding model like text-embedding-3-large to the AOAI (Azure OpenAI Service) instance). I’m assuming you already created a container in the Azure Storage account and uploaded data (like some PDFs). I’ve found this useful when referencing and consuming RFCs to confirm my understanding.

Here you’re specifying that the AI Search instance grab the data you’ve uploaded to the blob container using its system-assigned managed identity.

Connect to your data

The next screen provides us with the options to vectorize (or create embeddings) for our data. We can then use AI Search to query both the text-based chunks and vectors to optimize the results we return to the LLM. Here I’m selecting to use an embedding model deployed to an Azure OpenAI instance. More advanced scenarios you may choose to incorporate other embeddings you’ve built yourself or sourced from the model marketplace and deployed to AI Foundry (thanks to Jose for mentioning this).

I also select to use the deployed text-embedding-3-large embedding model and am again using the AI Search managed identity to call the Azure OpenAI service to create the embeddings.

Vectorize your text

The data I’m using (10K financial reports) doesn’t have any images so I ignore the Vectorize and enrich your images option.

Finally, I opt to use the semantic ranker (great article on this) to improve the results of my queries to AI Search and leave the other options as the default since this is a one time operation. If you are doing this regularly, getting a good data pipeline in place (either push or pull) is mission critical (another learning from my buddy Jose). Someone smarter than me can help you with that.

Review the settings and opt to create the indexer. The full run will only take a few minutes if you don’t have a ton of content. For larger data volume, get yourself some coffee and work on something else while you wait. If you have any failures at this step, it will likely be you don’t have the networking controls setup correctly. Review the image I posted at the beginning of this post and get busy with the resource logs (it’s good experience!).

Indexer in progress

Once it’s complete, you’ll see a screen like this if you select the indexer that was created. It will will show you how many of the docs it pulled from the container and how many were successfully imaged.

Successful run. Yay!

Next you can navigate to the index and run a test search. Here you’ll get back the relevant records and you can muck around with direct searches against the index to get a feel for the structure of the chunked data. If you don’t get an responses it’s likely an RBAC or networking issue. For RBAC, ensure you granted yourself both management plane (Search Service Contributor) and data plane (Search Data Index Reader or Contributor).

Directly searching chunked data

Chatting with your data

Alright, your data has been pulled into AI Search. How do you go about extending this knowledge base to the LLM? There are a ton of ways to do it, but for something quick and dirty, I’m a fan of either writing some simple Python code or using the Chat Playground via your Azure OpenAI Services or AI Services instance. For this blog, I’m going to be lazy and focus on the latter.

For this you’ll want to navigate to the AOAI instance and select the “Explore Azure AI Foundry portal” link. No this isn’t actually AI Foundry and is instead the rebranded (and standardized) Azure OpenAI Playground incorporated into a Foundry-like interface.

Entering the Azure AI Foundry portal

Once entering the new portal you’ll be dropped into the Chat playground. Here you’ll want to use the Add your data link and then Add a data source link as seen below.

Importing index to Chat Playground

On the next screen I choose to add the index I created earlier during my data import also choosing to use vector-based searches to improve the quality of the search results returned to the LLM. This is where the embedding model I deployed earlier comes into play as seen in the image below.

Adding data source

On the next screen I opt to do a hybrid + semantic search to ensure I get the best results out of a typical keyword search, vector search, and semantic search. A default semantic search configuration was created for you when you imported the data into AI Search.

Data management screen

Lastly, I choose to use the system-assigned managed identity of the AOAI instances when calls are made from the AOAI instance to AI Search. This is where the Azure RBAC assignments I show in the original diagram come into play. Any missing permissions on the managed identity will pop up for you here during the validation stage.

Data Connection screen

After saving and closing I’m good to go! In the chat window I can ask a question such as “How many shares did Microsoft buy back in 2024?” The LLM optimizes my query for AI Search, creates vector-based embeddings of my question, performs the hybrid and semantic search against the AI Search instance, summarizes the results, and returns them to the Chat Playground.

Chat with your data process

Below you see the answer to my question with citations back to the original chunked data in AI Search. Cool shit right?

LLM results

If you’re just dipping your toes into this world or you’re an organization validating that the Azure platform’s AI Services can do what you need them to do before you invest heavily into the platform, this is a great pattern to mess around with. It’s super easy to get up and running, doesn’t require a deep understanding of Azure, and still provides foundational security controls that every organization should have in place. All this in a quick few hours of work.

In upcoming posts I’ll showcase some variations of this pattern such as the incorporation of Private Link and using CoPilot Studio as a frontend to build a quick and simple Teams bot using a small variation of this pattern (this was a really fun one Denis Rougeau, Aga Shirazi, Jose and I have been rolling out to a few customers. Super excited to talk more about that one!).

Until next time!

Azure OpenAI Service – Controlling Outbound Access

Hello again folks! Work has been complete insanity and has been preventing me from posting as of late. Finally, I was able to carve out some time to get a quick blog post done that I’ve been sitting on for while.

I have blogged extensively on the Azure OpenAI Service (AOAI) and today I will continue with another post in that series. These days you folks are more likely using AOAI through the new Azure AI Services resource versus using the AOAI resource. The good news is the guidance I will provide tonight will be relevant to both resources.

One topic I often get asked to discuss with customers is the “old person” aspects of the service such as the security controls Microsoft makes available to customers when using the AOAI service. Besides the identity-based controls, the most common security control that pops up in the regulated space is the available networking controls. While incoming network controls exercised through Private Endpoints or the service firewall (sometimes called the IP firewall) are common, one of the often missed controls within the AOAI service is the outbound network controls. Oddly enough, this is missed often in non-compute PaaS.

You may now be asking yourself, “Why the heck would the AOAI service need to make its own outbound network connections and why should I care about it?”. Excellent question, and honestly, not one I thought about much when I first started working with the service because the use cases I’m going to discuss didn’t come up because the feature set didn’t exist or it wasn’t commonly used. There are two main use cases I’m going to cover (note there are likely others and these are simply the most common):

The first use case is easily the most common right now. The “chat with your data” feature is a feature within AOAI that allows you to pass some extra information in your ChatCompletion API call that will instruct the AOAI service to query an AI Search index that you have populated with data from your environment to extend the model’s knowledge base without completely retraining it. This is essentially a simple way to muck with a retrieval augmented generation (RAG) pattern without having to write the code to orchestrate the interaction between the two services such as detailed in the link above. Instead, the “chat with your data” feature handles the heavy lifting of this for you if you have a populated index and want to add a few additional lines of code. In a future article I’ll go into more depth around this pattern because understanding the complete network and identity interaction is pretty interesting and often misconfigured. For now, understand the basics of it with the flow diagram below. Here also is some sample code if you want to play around with it yourself.

The second use case is when using a multimodal model like GPT-4 or GPT-4o. These models allow for you to pass them other types of data besides text such as images and audio. When requesting an image be analyzed, you have the option of passing the image as base64 or you can pass it a URL. If you pass it a URL the AOAI service will make an outbound network connection to the endpoint specified in the URL to retrieve the image for analysis

 response = client.chat.completions.create(
            ## Model must be a multimodal model
            model=os.getenv('LLM_DEPLOYMENT_NAME'),
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "Describe the image"
                        },
                        {
                            "type": "image_url",
                            "image_url": {"url": "{{SOME_PUBLIC_URL}}"}
                        }
                    ]
                }
            ],
            max_tokens=100         
        )

In both of these scenarios the AOAI service establishes a network connection from the Microsoft public backbone to the resource (such as AI Search in scenario 1 or a public blob in scenario 2). Unlike compute-based PaaS (App Services, Functions, Azure Container Apps, AKS, etc), today Microsoft does not provide a means for you to send this outbound traffic from AOAI through your virtual network with virtual network injection or virtual network integration. Given that you can’t pass this traffic through your virtual network, how can you mitigate potential data exfiltration risks or poisoning attacks? For example, let’s say an attacker compromises an application’s code and modifies it such that the “chat with your data” feature uses an attacker’s instance of AI Search to capture sensitive data in the queries or poisoning the responses back to the user with bad data. Maybe an attacker decides to use your AOAI instances to process images stolen from another company and placed on a public endpoint. I’m sure someone more creative could come up with a plethora of attacks. Either way, you want to control what your resources communicate with. The positive news is there is a way to do this today and likely an even better way to do this tomorrow when it comes to the AOAI service.

The AOAI (and AI Services) resources fall under the Cognitive Services framework. The benefit of being within the scope of this framework is they inherit some of the security capabilities of this framework. Some examples include support for Private Endpoints or disabling local key-based authentication. Another feature that is available to AOAI is an outbound network control. On an AOAI or AI Services resource, you can configure two properties to lock down the services’ ability to make outbound network calls. These two properties are:

  • restrictOutboundNetworkAccess – Boolean and set to true to block outbound access to everything but the exceptions listed in the allowedFqdnList property
  • allowedFqdnList – A list of FQDNs the service should be able to communicate with for outbound network calls

Using these two controls you can prevent your AOAI or AI Services resource from making outbound network calls except to the list of FQDNs you include. For example, you might whitelist your AI Search instance FQDN for the “chat with your data” feature or your blob storage account endpoint for image analysis. This is a feature I’d highly recommend you enable by default on any new AOAI or AI Service you provision moving forward.

The good news for those of you in the Terraform world is this feature is available directly within the azurerm provider as seen in a sample template below.

resource "azurerm_cognitive_account" "openai" {
  name                = "${local.openai_name}${var.purpose}${var.location_code}${var.random_string}"
  location            = var.location
  resource_group_name = var.resource_group_name
  kind                = "OpenAI"

  custom_subdomain_name = "${local.openai_name}${var.purpose}${var.location_code}${var.random_string}"
  sku_name              = "S0"

  public_network_access_enabled = var.public_network_access
  outbound_network_access_restricted = true
  fqdns = var.allowed_fqdn_list

  network_acls {
    default_action = "Deny"
    ip_rules = var.allowed_ips
    bypass = "AzureServices"
  }

  identity {
    type = "SystemAssigned"
  }

  tags = var.tags

  lifecycle {
    ignore_changes = [
      tags["created_date"],
      tags["created_by"]
    ]
  }
}

If a user attempts to circumvent these controls they will receive a descriptive error stating that outbound access is restricted. For those of you operating in a regulated environment, you should be slapping this setting on every new AOAI or AI Service instance you provision just like you’re doing with controlling inbound access with a Private Endpoint.

Alright folks, that sums up this quick blog post. Let me summarize the lessons learned:

  1. Be aware of which PaaS Services in Azure are capable of establishing outbound network connectivity and explore the controls available to you to restrict it.
  2. For AOAI and AI Services use the restrictOutboundNetworkAccess and allowedFqdnList properties to block outbound network calls except to the endpoints you specify.
  3. Make controlling outbound access of your PaaS a baseline control. Don’t just focus on inbound, focus on both.

Before I close out, recall that I mentioned a way today (the above) and a way in the future. The latter will be the new feature Microsoft announced into public preview Network Security Perimeters. As that feature matures and its list of supported services expands, controlling inbound and outbound network access for PaaS (and getting visibility into it via logs) is going to get far easier. Expect a blog post on that feature in the near future.

Thanks folks!