A look at the Azure DNS Private Resolver

Hello again!

Today I’m going to cover the new Azure DNS Private Resolver feature that recently went into public preview. I’ve written extensively about Azure DNS in the past and I recommend reading through that series if you’re new to the platform. It has grown to be significantly important in Azure architectures due to its role in name resolution for Private Endpoints. A common pain point for customers using Private Endpoints from on-premises is the requirement to have a VM in Azure capable of acting as a DNS proxy. This is explained in detail in this post. The Azure DNS Private Resolver seeks to ease that pain by providing a managed DNS solution capable of acting as a DNS proxy and conditional forwarder facilitating hybrid DNS resolution (for those of you coming from AWS, this is Azure’s Route 53 Resolver). Alexis Plantin beat me to the punch and put together a great write-up on the basics of the feature so my focus instead be on some additional scenarios and a pattern that I tested and validated.

I’m a big fan of keeping infrastructure services such as DNS centralized and under the management of central IT. This is one reason I’m partial to a landing zone with a dedicated shared services virtual network attached to the transit virtual network as illustrated in the image below. In this shared services virtual network you put your DNS, patching/update infrastructure, and potentially identity services such as Windows Active Directory. The virtual network and its resources can then be dropped into a dedicated subscription and locked down to central IT. Additionally, as an added bonus, keeping the transit virtual network dedicated to firewalls and virtual network gateways makes the eventual migration to Azure Virtual WAN that must easier.

Common landing zone design

The design I had in mind would place the Private Resolver in the shared services virtual network and would funnel all traffic to and from the resolver and on-premises or another spoke through the firewall in the transit virtual network. This way I could control the conversation, inspect the traffic if needed, and centrally log it. The lab environment I built to test the design is pictured below.

Lab environment

The first question I had was whether or not the inbound endpoint would obey the user defined routes in the custom route table I associated with the inbound endpoint subnet. To test this theory I made a DNS query from the VM running in spoke 2 to resolve an A record in a Private DNS Zone. This Private DNS Zone was only linked to the virtual network where the Private Resolvers were. If the inbound endpoint wasn’t capable of obeying the custom routes, then the return traffic would be dropped and my query would fail.

Result of query from VM in another spoke

Success! The inbound endpoint is returning traffic back through the firewall. Logs on the firewall confirm the traffic flowing through.

Firewall logs showing DNS traffic from spoke

Next I wanted to see if traffic from the outbound endpoint would obey the custom routes. To test this, I configured a DNS forwarding rule (conditional forwarding component of the service) to send all DNS queries for jogcloud.com back to the domain controller running in my lab. I then performed a DNS query from the VM running in spoke 2.

Firewall logs showing DNS traffic to on-premises

Success again as the query was answered! The traffic from the outbound endpoint is seen traversing the firewall on its way to my domain controller on-premises. This confirmed that both the inbound and outbound endpoints obey custom routing making the design I presented above viable.

Beyond the above, I also confirmed the Private Resolver is capable of resolving reverse lookup zones (for PTR records). I was happy to see reverse zones weren’t forgotten.

One noticeable gap today is the Private Resolver does not yet offer DNS query logging. If that is important to you, you may want to retain your existing DNS Proxy. If you happen to be using Azure Firewall, you could make use of the DNS Proxy feature which allows for logging of DNS queries. Azure Firewall could then be configured to use the Private Resolver as its resolver providing that conditional forward capability Azure Firewall’s DNS Proxy feature lacks.

That wraps up this post. Keep in mind that this feature is still in public preview, so do not go deploying it into production. I ran into some bugginess when I initially deployed it where it locked up a virtual network after a failed deployment, so you have been warned. Fingers crossed the existing capabilities make it to GA and DNS query logging comes in the future.

Thanks!

Private Endpoints Revisited: NSGs and UDRs

Private Endpoints Revisited: NSGs and UDRs

Welcome back fellow geeks!

It’s been a while since my last post. For the past few months I’ve been busy renewing some AWS certificates and putting together some Azure networking architectures on GitHub. A new post was long overdue, so I thought it would be fun to circle back to Private Endpoints yet again. I’ve written extensively about the topic over the past few years, yet there always seems more to learn.

There have historically been two major pain points with Private Endpoints which include routing complexity when trying to inspect traffic to Private Endpoints and a lack of NSG (network security groups) support. Late last year Microsoft announced in public preview a feature to help with the routing and support for NSGs. I typically don’t bother tinkering with features in public preview because the features often change once GA (generally available) or never make it to GA. Now that these two features are further along and likely close to GA, it was finally a good time to experiment with them.

I built out a simple lab with a hub and spoke architecture. There were two spoke VNets (virtual networks). The first spoke contained a single subnet with a VM, a private endpoint for a storage account, and a private endpoint for a Key Vault. The second spoke contained a single VM. Within the hub I had two subnets. One subnet contained a VM which would be used to route traffic between spokes, and the other contained a VM for testing routing changes. All VMs ran Ubuntu. Private DNS zones were defined for both Key Vault and blob storage and linked to all VNets to keep DNS simple for this test case.

Lab setup

Each spoke subnet had a custom route table assigned with a route to the other spoke set with a next hop as the VM acting as a router in the hub.

Once the lab was setup, I had to enable the preview features in the subscription I wanted to test with. This was done using the az feature command.

az feature register --namespace Microsoft.Network --name AllowPrivateEndpointNSG

The feature took about 30 minutes to finish registering. You can use the command below to track the registration process. While the feature is registering the state will report as registering, and when complete the state will be registered.

az feature show --namespace Microsoft.Network --name AllowPrivateEndpointNSG

Once the feature was ready to go, I first decided to test the new UDR feature. As I’ve covered in a prior post, creation of a private endpoint in a VNet creates a /32 route for the private endpoint’s IP address in both the VNet it is provisioned into as well as any directly peered VNets. This can be problematic when you need to route traffic coming from on-premises through a security appliance like a Palo Alto firewall running in Azure to inspect the traffic and perform IDS/IPS. Since Azure has historically selected the most specific prefix match for routing, and the GatewaySubnet in the hub would contain the /32 routes, you would be forced to create unique /32 UDRs (user-defined routes) for each private endpoint. This can created a lot of overhead and even risked hitting the maximum of 400 UDRs per route table.

With the introduction of these new features, Microsoft has made it easier to deal with the /32 routes. You can now create a more summarized UDR and that will take precedence over the more specific system route. Yes folks, I know this is confusing. Personally, I would have preferred Microsoft had gone the route of a toggle switch which would disable the /32 route from propagating into peered VNets. However, we have what we have.

Let’s take a look at it in action. The image below shows the effective routes on the network interface associated with the second VM in the hub. The two /32 routes for the private endpoints are present and active.

Effective routes for second VM in the hub

Before the summarized UDR can be added, you need to set a property on the subnet containing the Private Endpoint. There is a property on each subnet in a virtual network named PrivateEndpointNetworkPolicies. When a private endpoint is created in a subnet this property is set disabled. This property needs to be set to enabled. This is done by setting the –disable-private-endpoint-network-policies parameter to false as seen below.

az network vnet subnet update \
  --disable-private-endpoint-network-policies false \
  --name snet-pri \
  --resource-group rg-demo-routing \
  --vnet-name vnet-spoke-1

I then created a route table, added a route for 10.1.0.0/16 with a next hop of the router at 10.0.0.4, and assigned the route table to the second VM’s subnet. The effective routes on the network interface for the second VM now show the two /32s as invalid with new route now active.

Effective routes for second VM in the hub after routing change

A quick tcpdump on the router shows the traffic flowing through the router as we have defined in our routes.

tcpdump of router

For fun, let’s try that same wget on the Key Vault private endpoint.

Uh-oh. Why aren’t I getting back a 404 and why am I not seeing the other side of conversation on the router? If you guessed asymmetric routing you’d be spot on! To fix this I would need to setup the iptables on my router to NAT (network address translation) to the router’s address. The reason attaching a route table to the spoke 1 subnet the private endpoints wouldn’t work is because private endpoints do not honor UDRs. I imagine you’re scratching your head asking why it worked with the storage account and not with Key Vault? Well folks, it’s because Microsoft does something funky at the SDN (software defined networking) layer for storage that is not done for any other service’s private endpoints. I bring this up because I wasted a good hour scratching my head as to why this was working without NAT until I came across that buried issue in the Microsoft documentation. So take this nugget of knowledge with you, storage account private endpoint networking works differently from all other PaaS service private networking in Azure. Sadly, when things move fast, architectural standards tend to be one of the things that fall into the “we’ll get back to that on a later release” bucket.

So wonderful, the pain of /32 routes is gone! Sure we still need to NAT because private endpoints still don’t honor UDRs attached to their subnet, but the pain is far less than it was with the /32 mess. One thing to take from this gain is there is a now a disclaimer to Azure routing precedence. When it comes to routing with private endpoints, UDRs take precedence over the system route even if the system route is more specific.

Now let’s take a look at NSG support. I next created an NSG and associated it to the private endpoint’s subnet. I added a deny rule blocking all https traffic from the VM in spoke 2.

NSG applied to private endpoint subnet

Running a wget on the VM in spoke 2 to the blob storage endpoint on the private endpoint in spoke 1 returns the file successfully. The NSG does not take effect simply by enabling the feature. The PrivateEndpointNetworkPolicies property I mentioned above must be set to enabled.

After setting the property, the change takes about a minute or two to complete. Once complete, running another wget from the VM in spoke 2 failed to make a connection validating the NSG is working as expected.

NSG blocking connection

One thing to be aware of is NSG flow logs will not log the connection at this time. Hopefully this will be worked out by GA.

Well folks that’s it for this post. The key things you should take aware are the following:

  • Testing these features requires registering the feature on the subscription AND setting the PrivateEndpointNetworkPolicies property to enabled on the subnet. Keep in mind setting this property is required for both UDR summarization and enabling NSG support. (Thank you to my peer Silvia Wibowo for pointing out that it is also required for UDR summarization).
  • NAT is still required to ensure symmetric routing when traffic is coming from on-premises or another spoke. The only exception are private endpoints for Azure Storage because it operates different at the SDN.
  • The UDR feature for private endpoints makes less-specific UDRs take precedence over the more specific private endpoint system route.
  • NSG and summarized UDR support for private endpoints are still in public preview and are not recommended for production until GA.
  • NSG Flow Logs do not log connection attempts to private endpoints at this time.

See you next post!

A Deep Dive Into Azure Route Server

Hello again fellow geeks.

I recently had a customer reach out to me with an interest in learning more about ARS (Azure Route Server). The customer hoped it might ease some of the burden of managing routing across a large Azure implementation. I had yet to mess around with ARS since it only recently went GA (generally available), so I figured this would be a great opportunity to build out a lab and give it a whirl. Let’s get to it!

ARS is one of Microsoft’s new networking offerings in Azure offering a managed routing service that is hooked into Azure’s SDN (software defined network). The way I like to think of the service is a couple of VMs (virtual machines) that are managed by Microsoft, running a BGP (Border Gateway Protocol) service, and which have the ability to program routes directly into VNets (virtual networks). In addition to introducing some pretty cool new networking patterns such as an SD WAN and ExpressRoute pattern and dual-homed network, the feature that most interested my customer was its capability of BGP peering with a customer’s NVA (network virtual appliance) such as a Palo Alto or Cisco appliance and injecting those learned dynamic routes into the VNet. This is the feature which I’ll cover in this blog post.

I did some thinking about how I wanted to lab this out and what I wanted to use as an NVA. Since the behavior I wanted to test was primarily BGP, I figured I’d keep it simple and run a Linux VM which would host a lightweight BGP service. I found a wonderful post from Adam Stuart (seriously a great read and really cool Azure networking pattern in his post) which mentioned Exabgp. After doing a bit of research on it, it looked relatively easy to setup and use so the choice was made (Thanks Adam!) In addition to the NVA, I decided to build out a hub-and-spoke architecture and make use of my home pfSense appliance for S2S VPN (site-to-site virtual private network) connectivity and to replicate an on-premises environment. The result was the lab pictured below (image 1).

Image 1 – Azure Route Server Lab

One thing you may notice in the above image is ARS has a public IP address associated to it. Remember when I mentioned this is a couple of Microsoft-managed VMs? Well this public IP facilitates Microsoft management of the VMs similar to other managed services such as Azure SQL MI (Managed Instance). Before you ask, no you can’t associate an NSG (Network Security Group) to the subnet ARS is provisioned into, and the subnet has to be named RouteServerSubnet.

I setup a S2S VPN connection and BGP peering between my pfSense appliance and Azure VPN Gateway and advertised the set of routes documented in the lab diagram (image 1). I then provisioned a couple of Ubuntu VMs with two in the hub, one in the first spoke, and one in the second spoke.

Exabgp was a bit painful to setup because the documentation out there is a bit sparse and I had to piece it together from multiple sources. Given that, I’m going to spend a few minutes walking through the setup.

I first setup my ARS using the instructions in this link. Ensure you use a valid ASN for your NVA (autonomous system number).

Once ARS is setup, you can begin setting up Exabgp on the VM using the commands below.

sudo apt update
sudo apt install exabgp

You’ll then need to modify the service file located in /lib/systemd/system/exabgp.service to uncomment the lines below.

...
[Service]
#User=exabgp
#Group=exabgp
Environment=exabgp_daemon_daemonize=false
PermissionsStartOnly=true
ExecStartPre=-mkfifo /run/exabgp.in
...

Now you’ll need to create a configuration file for exabgp and save it to /etc/exabgp/exabgp.conf. Below is the configuration file I used for the lab.

neighbor 10.0.2.4 {
        router-id 10.0.0.4;
        local-address 10.0.0.4;
        local-as 65010;
        peer-as 65515;

        static {
                route 192.168.10.0/24 next-hop 10.0.0.4;
        #       route 192.168.1.0/24 next-hop 10.0.0.4 as-path [65010 65010] community [65010:2];
                route 192.168.1.0/24 next-hop 10.0.0.4 community [65010:2];
                route 10.1.0.0/16 next-hop 10.0.0.4;
                route 10.10.0.0/16 next-hop 10.0.0.4;
                route 0.0.0.0/0 next-hop 10.0.0.4;
        }
}
neighbor 10.0.2.5 {
        router-id 10.0.0.4;
        local-address 10.0.0.4;
        local-as 65010;
        peer-as 65515;

        static {
                route 192.168.10.0/24 next-hop 10.0.0.4;
        #       route 192.168.1.0/24 next-hop 10.0.0.4 as-path [65010 65010] community [65010:2];
                route 192.168.1.0/24 next-hop 10.0.0.4 community [65010:2];
                route 10.1.0.0/16 next-hop 10.0.0.4;
                route 10.10.0.0/16 next-hop 10.0.0.4;
                route 0.0.0.0/0 next-hop 10.0.0.4;
        }
}

Each instance of ARS is configured to be highly available and is deployed across availability zones if the region supports it. It comes with two BGP peering IPs you’ll need to peer with and advertise your routes to. If you only peer one with or advertise different routes, you’ll get funky behaviors. In the configuration file above, each JSON object represents one of these peers. You’ll notice I’m advertising a number of routes which will demonstrate different behaviors I’ll walk through with this post.

Once you’ve done the above you can start the service and validate that it successfully started.

sudo systemctl start exabgp
sudo systemctl status exabgp

Give it a minute and then you can run the following command to output the routes the exabgp service has learned from ARS.

sudo exabgpcli show adj-rib in extensive

In the output below (image 2) you can see ARS advertising the address space of the hub and spoke1 Vnet.

Image 2 – Exabgp output

So why isn’t spoke 2 being advertised? Well ARS requires the peerings between the VNets to be configured with UseRemoteGateway and AllowGatewayTransit properties (referred to in the Portal as Use the remote virtual network’s gateway or Route Server and Use this virtual network’s gateway or Route Server) in order for ARS to propagate routes between the Vnets. As you’ll note from the lab image, I enabled these settings for spoke 1 but not spoke 2.

This requirement does present a challenge in that when it’s enabled ARS propagates routes to the peered Vnets, but so does your ExpressRoute or VPN Gateway. My customer base typically requires all traffic leaving the spoke to flow through a security appliance. If you do not enable this setting the spoke only knows about itself and the hub VNet. To direct the traffic through an appliance in a hub you only need two UDRs (one for 0.0.0.0/0 and one for the hub VNet address space). This makes it easy to stamp out spokes from a networking perspective and optionally audit and enforce with Azure Policy.

Looking at the effective routes of the VM in spoke 1 (image 3), you can see routes highlighted in red are routes being propagated into the spoke by my VPN Gateway which is receiving them from my pfSense appliance. To ensure traffic between on-premises and Azure flows through a security appliance you’d need to define all of the routes you’re propagating over your ExpressRoute/VPN in your NVA. This way ARS propagates them to the peered VNets and overrides the ones coming in from the gateway.

Image 3 – Spoke 1 Effective Routes

Recall from my lab image (image 1), I’m sending 192.168.0.1/24, 192.168.2.0/24, 192.168.3.0/24, and 192.168.4.0/24 over BGP to Azure. In the image above (image 3) you’ll notice three out of four of these routes are coming from my VPN Gateway (10.0.3.5), but 192.168.0.1/24 is coming from Exabgp VM (10.0.0.4). The documentation says that when two routes for the same address space but different AS PATH lengths are received from different NVAs, ARS will only program the route with the shorter AS PATH. Apparently this isn’t just a trait of ARS and is a trait of the Azure SDN.

To validate this is the case, I edited my Exabgp config file and appended a longer AS PATH length to the route as seen below.

Exabgp config with modified AS PATH

After about a minute, the route table on my spoke 1 VM now displays the route for 192.168.0.1 coming from the VPN Gateway because the route coming from my Exabgp now has a longer AS PATH. This indicates that when two routes for the same address space are advertised, the route with the shortest AS PATH gets programmed to the VNet while the route with the longer AS PATH is seemingly discarded. I would have expected to see the 192.168.0.1/24 coming from on-premises with an Active value of Invalid, but instead it’s discarded.

Image 4 – VPN Gateway with shorter AS PATH

The next route I want to look at it is 10.1.0.0/16 which is the address space of spoke 1. I configured the Exabgp VM to advertise this route to ARS. Looking at the effective routes for the VM in the hub (VM-DEMO) the only route for this address space is the system route for the peering. This is because system routes for the VNet, VNet peering, or service endpoints are the preferred routes, even if BGP routes are more specific. This means you can’t use ARS to push routes that would force traffic from a spoke to flow through an NVA in the hub because of the VNet peering system route. For that use case it looks like you will need to continue to use UDRs (user defined routes)

I have the default route of 0.0.0.0/0 which I am advertising from the Exabgp VM. You can see in the above images is propagating to both the hub and spoke. Take note that if you advertise this route, you’re going to want to ensure you place a route table and UDR on the subnet your NVAs are in. Otherwise you’ll run into a scenario where you’ll have a routing loop because the default route would be received by the NVAs subnet.

Let’s pull the VPN Gateway into the mix. ARS does support BGP peering with an ExpressRoute or VPN Gateway. This allows you to propagate the routes ARS is learning from the NVA back on-premises. You enable this functionality by enabling the Branch-to-branch feature of ARS. In my lab environment, I commented out the default route in my Exabgp conf file because I didn’t want to send that back on-premises because I don’t know how to filter out routes in FRRoute (the BGP service running on pfSense). I then ran an az network vnet-gateway list-learned-routes and confirmed my VNG is now receiving routes from ARS as seen in the image below (image 5).

Image 5 – VPN Gateway learned routes

On the pfSense side I was now able to see the routes coming from my NVA coming over the VPN Gateway and back my on-premises appliance. These are the routes with an AS PATH of 65510 65515 65010 in the image below (image 6).

Image 6 – pfSense learned routes

The above shows that the routes coming from the NVA are being propagated back on-premises. What about the routes coming from on-premises? Are they being propagated all the way to the NVA? Let’s check it out!

To verify this I printed the routes ARS is advertising to the NVA about using az network routeserver peering list-advertised-routes. The routes in bold are the routes coming from the pfSense appliance showing that ARS is receiving the routes and advertising them to the Exabgp VM.

RouteServiceRole_IN_0:
- asPath: '65515'
  localAddress: 10.0.2.4
  network: 10.0.0.0/16
  nextHop: 10.0.2.4
  origin: Igp
  weight: 0
- asPath: '65515'
  localAddress: 10.0.2.4
  network: 10.1.0.0/16
  nextHop: 10.0.2.4
  origin: Igp
  weight: 0
- asPath: 65515-65510-65501
  localAddress: 10.0.2.4
  network: 192.168.4.0/24
  nextHop: 10.0.2.4
  origin: Igp
  weight: 0
- asPath: 65515-65510-65501
  localAddress: 10.0.2.4
  network: 192.168.2.0/24
  nextHop: 10.0.2.4
  origin: Igp
  weight: 0
- asPath: 65515-65510-65501
  localAddress: 10.0.2.4
  network: 192.168.3.0/24
  nextHop: 10.0.2.4
  origin: Igp
  weight: 0

Lastly, I wanted to see the routes and details on the Exabpg VM. For that I ran the command below on the VM.

sudo exabgpcli show adj-rib in extensive
Image 7 – Exabgp learned routes

In the image above (image 7) the on-premises routes for the 192.168.0 address spaces have been learned and have the AS PATH leading back on-premises. Note the community value of 65517:65517 being attached to the routes. That is not coming from my pfSense appliance but rather the VPN Gateway. The community tag is used to identify these routes as coming from a VPN Gateway and are used by Microsoft for filtering.

Well folks that about wraps it up. The key takeaways of my time with ARS were the following:

  • ARS requires the UseRemoteGateway and AllowGatewayTransit properties be configured on VNet peering. This makes managing traffic flow between an spoke and on-premises a bit more complicated. Instead of defining a 0.0.0.0/0 route and calling it a day, you would need to define all the routes in your NVA and propagate them using ARS.

    This isn’t necessarily a bad thing, you’re just shifting management of routing outside the Azure control plane and into the NVA data plane. That may be your preference.
  • When multiple routes for the same address space and different length AS PATHs are propagated to a VNet, the Azure SDN will only program the route the shortest AS PATH. The other route with the longer AS PATH is discarded.
  • You can’t use an NVA and ARS to propagate the hub and spoke VNet address spaces back on-premises because system routes for the VNet, VNet peering, and Service Endpoints supersede routes propagated via BGP. Instead, you could use a larger summarized route encompassing the entire address space for the VNets in your region and propagate that back on-premises using the NVA and ARS.
  • You can use an NVA and ARS to propagate routes from Azure back to on-premises. A use case might be that you want all Internet-bound traffic to egress out of Azure because you get better performance than you’re getting from your ISP on-premises. I’ve never seen it, but nevertheless. 🙂

Hopefully some of the above helps you become more familiar with Azure Route Server and what the benefits and considerations are.

See you next post!

Behavior of Azure Event Hub Network Security Controls

Behavior of Azure Event Hub Network Security Controls

Welcome back fellow geeks.

In this post I’ll be covering the behavior of the network security controls available for Azure Event Hub and the “quirk” (trying to be nice here) in enforcement of those controls specifically for Event Hub. This surfaced with one of my customers recently, and while it is publicly documented, I figured it was worthy of a post to explain why an understanding of this “quirk” is important.

As I’ve mentioned in the past, my primary customer base consists of customers in regulated industries. Given the strict laws and regulations these organizations are subject to, security is always at the forefront in planning for any new workload. Unless you’ve been living under a rock, you’ve probably heard about the recent “issue” identified in Microsoft’s CosmosDB service, creatively named ChaosDB. I’ll leave the details to the researchers over at Wiz. Long story short, the “issue” reinforced the importance of practicing defense-in-depth and ensuring network controls are put in place where they are available to supplement identity controls.

You may be asking how this relates to Event Hub? Like CosmosDB and many other Azure PaaS (platform-as-a-service) services, Event Hub has more than one method to authenticate and authorize to access to both the control plane and the data plane. The differences in these planes are explained in detail here, but the gist of it is the control plane is where interactions with the Azure Resource Manager (ARM) API occurs and involves such operations as creating an Event Hub, enabling network controls like Private Endpoints, and the like. The data plane on the other hand consists of interactions with the Event Hub API and involves operations such as sending an event or receiving an event.

Azure AD authentication and Azure RBAC authorization uses modern authentication protocols such as OpenID Connect and OAuth which comes with the contextual authorization controls provided by Azure AD Conditional Access and the granularity for authorization provided by Azure RBAC. This combination allows you to identify, authenticate, and authorize humans and non-humans accessing the Event Hub on both planes. Conditional access can be enforced to get more context about a user’s authentication (location, device, multi-factor) to make better decisions as to the risk of an authentication. Azure RBAC can then be used to achieve least privilege by granting the minimum set of permissions required. Azure AD and Azure RBAC is the recommended way to authenticate and authorize access to Event Hub due to the additional security features and modern approach to identity.

The data plane has another method to authenticate and authorize into Event Hubs which uses shared access keys. There are few PaaS services in Azure that allow for authentication using shared access keys such as Azure Storage, CosmosDB, and Azure Service Bus. These shared keys are generated at the creation time of the resource and are the equivalent to root-level credentials. These keys can be used to create SAS (shared access signatures) which can then be handed out to developers or applications and scoped to a more limited level of access and set with a specific start and expiration time. This makes using SAS a better option than using the access keys if for some reason you can’t use Azure AD. However, anyone who has done key management knows it’s an absolute nightmare of which you should avoid unless you really want to make your life difficult, hence the recommendation to use Azure AD for both the control plane and data plane.

Whether you’re using Azure AD or SAS, the shared access keys remain a means to access the resource with root-like privileges. While access to these keys can be controlled at the control plane using Azure RBAC, the keys are still there and available for use. Since the usage of these keys when interacting with the data plane is outside Azure AD, it means conditional access controls aren’t an option. Your best bet in locking down the usage of these keys is to restrict access at the control plane as to who can retrieve the access keys and use the network controls available for the service. Event Hubs supports using Private Endpoints, Service Endpoint, and the IP-based restrictions via what I will refer to as the service firewall.

If you’ve used any Azure PaaS service such as Key Vault or Azure Storage, you should be familiar with the service firewall. Each instance of a PaaS service in Azure has a public IP address exposes that service to the Internet. By default, the service allows all traffic from the Internet and the method of controlling access to the service is through identity-based controls provided by the supported authentication and authorization mechanism.

Access to the service through the public IP can be restricted to a specific set of IPs, specific Virtual Networks, or to Private Endpoints. I want quickly to address a common point of confusion for customers; when locking down access to a specific set of Virtual Networks such as available in this interface, you are using Service Endpoints. This list should be empty because there are few very situations where you are required to use a Service Endpoint now because of the availability of Private Endpoints. Private Endpoints are the strategic direction for Microsoft and provide a number of benefits over Service Endpoints such as making the service routable from on-premises over an ExpressRoute or VPN connection and mitigating the risk of data exfiltration that comes with the usage of Service Endpoints. If you’ve been using Azure for any length of time, it’s worthwhile auditing for the usage of Service Endpoints and replacing them where possible.

Service Firewall options

Now this is where the “quirk” of Event Hubs comes in. As I mentioned earlier, most of the Azure PaaS services have a service firewall with a similar look and capabilities as above. If you were to set the the service firewall to the setting above where the “Selected Networks” option is selected and no IP addresses, Virtual Networks, or Private Endpoints have been exempted you would assume all traffic is blocked to the service right? If you were talking about a service such as Azure Key Vault, you’d be correct. However, you’d be incorrect with Event Hubs.

The “quirk” in the implementation of the service firewall for Event Hub is that it is still accessible to the public Internet when the Selected Networks option is set. You may be thinking, well what if I enabled a Private Endpoint? Surely it would be locked down then right? Wrong, the service is still fully accessible to the public Internet. While this “quirk” is documented in the public documentation for Event Hubs, it’s inconsistent to the behaviors I’ve observed in other Microsoft PaaS services with a similar service firewall configuration. The only way to restrict access to the public Internet is to add a single entry to the IP list.

Note in public documentation

So what does this mean to you? Well this means if you have any Event Hubs deployed and you don’t have a public IP address listed in the IP rules, then your Event Hub is accessible from the Internet even if you’ve enabled a Private Endpoint. You may be thinking it’s not a huge deal since “accessible” means open for TCP connections and that authentication and authorization still needs to occur and you have your lovely Azure AD conditional access controls in place. Remember how I covered the shared access key method of accessing the Event Hubs data plane? Yeah, anyone with access to those keys now has access to your Event Hub from any endpoint on the Internet since Azure AD controls don’t come into play when using keys.

Now that I’ve made you wish you wore your brown pants, there are a variety of mitigations you can put in place to mitigate the risk of someone exploiting this. Most of these are taken directly from the security baseline Microsoft publishes for the service. These mitigations include (but are not limited to):

  • Use Azure RBAC to restrict who has access to the share access keys.
  • Take an infrastructure-as-code approach when deploying new Azure Event Hubs to ensure new instances are configured for Azure AD authentication and authorization and that the service firewall is properly configured.
  • Use Azure Policy to enforce Event Hubs be created with correctly configured network controls which include the usage of Private Endpoints and at least one IP address in the IP Rules. You can use the built-in policies for Event Hub to enforce the Private Endpoint in combination with this community policy to ensure Event Hubs being created include at least one IP of your choosing. Make sure to populate the parameter with at least one IP address which could be a public IP you own, a non-publicly routable IP, or a loopback address.
  • Use Azure Policy to audit for existing Event Hubs that may be publicly available. You can use this policy which will look for Event Hub hubs with a default action of Allow or Event Hubs with an empty IP list.
  • Rotate the access keys on a regular basis and whenever someone who had access to the keys changes roles or leaves the organization. Note that rotating the access keys will invalidate any SAS, so ensure you plan this out ahead of time. Azure Storage is another service with access keys and this article provides some advice on how to handle rotating the access keys and the repercussions of doing it.

If you’re an old school security person, not much of the above should be new to you. Sometimes it’s the classic controls that work best. 🙂 For those of you that may want to test this out for yourselves and don’t have the coding ability to leverage one of the SDKs, take a look at the Event Hub add-in for Visual Studio Code. It provides a very simplistic interface for testing sending and receiving messages to an Event Hub.

Well folks, I hope you’ve found this post helpful. The biggest piece of advice I can give you is to ensure you read documentation thoroughly whenever you put a new service in place. Never assume one service implements a capability in the same way (I know, it hurts the architect in me as well), so make sure you do your own security testing to validate any controls which fall into the customer responsibility column.

Have a great week!