Revisiting UDR improvements for Private Endpoints

Revisiting UDR improvements for Private Endpoints

Hello folks! It’s been a busy past few months. I’ve been neck deep in summer activities, customer work, and building some learning labs for the wider Azure community. I finally had some time today to dig into the NSG and improved routing features for Private Endpoints that finally hit GA (general availability) last month. While I had written about the routing changes while the features were in public preview, I wanted to do a bit more digging now that it is officially GA. In this post I’ll take a closer look at the routing changes and try to clear up some of the confusion I’ve come across about what this feature actually does.

If you work for a company using Azure, likely you’ve come across Private Endpoints. I’ve written extensively about the feature over the course of the past few years covering some of the quirks that are introduced using it at scale in an enterprise. I’d encourage you to review some of those other posts if you’re unfamiliar with Private Endpoints or you’re interested in knowing the challenges that drove feature changes such as the NSG and improved routing features.

At the most basic level, Private Endpoints are a way to control network access to instances of PaaS (platform-as-a-service) services you consume in Microsoft Azure (they can also be used for PrivateLink Services you build yourself). Like most public clouds, every instance of a PaaS service in Azure is by default available over a public IP. While there are some basic controls layer 3 controls, such as IP restrictions offered for Azure App Services or the basic firewall that comes with Azure Storage, the service is only accessible directly via its public IP address. From an operations perspective, this can lead to inconsistencies with performance when users access the services behind Private Endpoints since the access is over an Internet connection. On the security side of the fence, it can make requirements to inspect and mediate the traffic with full featured security appliances problematic. There can even be a risk of data exfiltration if you are forced to allow access to the Internet for an entire service (such as *.blog.windows.net). Additionally, you may have internal policies driven by regulation that restrict sensitive data to being accessible only within your more heavily controlled private network.

PaaS with no Private Endpoint

Private Endpoints help solve the issues above by creating a network endpoint (virtual network interface) for the instance of your PaaS service inside of your Azure VNet (virtual network). This can help provide consistent performance when accessing the application because the traffic can now flow over an ExpressRoute Private Peering versus the user’s Internet connection. Now that traffic is flowing through your private network, you can direct that traffic to security appliances such as a Palo Alto to centrally mediate, log, and optionally inspect traffic up to and including at layer 7. Each endpoint is also for a specific instance of a service, which can mitigate the risk of data exfiltration since you could block all access to a specific Azure PaaS service if accessed through your Internet connection.

PaaS with Private Endpoint

While this was possible prior to the new routing improvements that went into GA in August, it was challenging to manage at scale. I cover the challenge in detail in this post, but the general gist of it is the Azure networking fabric creates a /32 system route in each subnet within the virtual network where the Private Endpoint is placed as well as any directly peered VNets. If you’re familiar with the basics of Azure routing you’ll understand how this could be problematic in the situation where the traffic needs to be routed through a security appliance for mediation, logging, or inspection. To get around this problem customers had to create /32 UDRs (user-defined route) to override this system route. In a hub and spoke architecture with enough Private Endpoints, this can hit the limit of routes allowed on a route table.

An example of an architecture that historically solved for this is shown below. If you have user on-premises (A) trying to get to a Private Endpoint in the spoke (H) through the Application Gateway (L) and you have a requirement to inspect that traffic via a security appliance (F, E), you need to create a /32 route on the Application Gateway’s subnet to direct the traffic back to the security appliance. If that traffic is instead for some other type of service that isn’t fronted by an App Gateway (such as Log Analytics Workspace or Azure SQL instance), those UDRs need to be placed on the route table of the Virtual Network Gateway (B). The latter scenario is where scale and SNAT (see my other post for detail on this) can quickly become a problem.

Common workaround for inspection of Private Endpoint traffic

To demonstrate the feature, I’m going to use my basic hub and spoke lab with the addition of an App Service running a very basic Python Flask application I wrote to show header and IP information from a web request. I’ve additionally setup a S2S VPN connection with a pfSense appliance I have running at home which is exchanging routes via BGP with the Virtual Network Gateway. The resulting lab looks like the below.

Lab environment

Since Microsoft still has no simple way to enumerate effective routes without a VM’s NIC being in the subnet, and I wanted to see the system routes that the Virtual Network Gateway was getting (az network vnet-gateway list-learned-routes will not do this for you), I created a new subnet and plopped a VM into it. Looking at the route table, the /32 route for the Private Endpoint was present.

Private Endpoint /32 route

Since this was temporary and I didn’t want to mess with DNS in my on-premises lab, I created a host file entry on the on-premises machine for the App Service’s FQDN pointing to the Private Endpoint IP address. I then accessed the service from a web browser on that machine. The contents of the web request show the IP address of my machine as expected because my traffic is entering the Azure networking plane via my S2S VPN and going immediately to the Private Endpoint for the App Service.

Request without new Private Endpoint features turned on

As I covered earlier, prior to these new features being introduced, to get this traffic going through my Azure Firewall instance I would have had to create /32 UDR on the Virtual Network Gateway’s route table and I would have had to SNAT at the firewall to ensure traffic symmetry (the SNAT component is covered in a prior post). The new feature lifts the requirement for the /32 route, but in a very interesting way.

The golden rule for networking has long been the most specific route is the preferred route. For example, in Azure the /32 system route for the Private Endpoint will the preferred route even if you put in a static route for the subnet’s CIDR block (/24 for example). The new routing feature for Private Endpoints does not follow this rule as we’ll see.

Support for NSGs and routing improvements for Private Endpoints is disabled by default. There is a property of each subnet in a VNet called privateEndpointNetworkPolicies which is set to disabled by default. Swapping this property from disabled to enabled kicks off the new features. One thing to note is you only have to enable this on the subnet containing the Private Endpoint.

In my lab environment I swapped the property for the snet-app subnet in the workload VNet. Looking back at the route table for the VM in the transit virtual network, we now see that the /32 route has been made invalid. The /16 route pointing all traffic to the workload VNet to the Azure Firewall is now the route the traffic will take, which allows me to mediate and optionally inspect the traffic.

Route table after privateEndpointNetworkPolicies property enabled on Private Endpoint subnet

Refreshing the web page from the on-premises VM now shows a source IP of 10.0.2.5 which is one of the IPs included in the Azure Firewall subnet. Take note that I have an application rule in place in Azure Firewall which means it uses its transparent proxy feature to ensure traffic symmetry. If I had a network rule in place, I’d have to ensure Azure Firewall is SNATing my traffic (which it won’t do by default for RFC1918 traffic). While some services (Azure Storage being one of them) will work without SNAT with Private Endpoints, it’s best practice to SNAT since all other services require it. The requirement will likely be addressed in a future release.

Request with new routing features enabled

While the support for NSGs for Private Endpoints is awesome, the routing improvements are a feature that shouldn’t be overlooked. Let me summarize the key takeaways:

  • Routing improvements (docs call it UDR support which I think is a poor and confusing description) for Private Endpoints are officially general available.
  • SNAT is still required and best practice for traffic symmetry to ensure return traffic from Private Endpoints takes the same route back to the user.
  • The privateEndpointNetworkPolicies property only needs to be set on the subnet containing the Private Endpoints. The routing improvements will then be active for those Private Endpoints for any route table assigned to a subnet within the Private Endpoint’s VNet or any directly peered VNets.
  • Even though the /32 route is still there, it is now invalidated by a less specific UDR when this setting is set on a Private Endpoints subnet. You could create a UDR for the subnet CIDR containing the Private Endpoints or the entire VNet as I did in this lab. Remember this an exception to the route specificity rule.

Well folks, that sums up this post. Hopefully got some value out of it!

VirtualNetwork Service Tag and Network Security Groups

Hello fellow geeks!

Earlier this week I was messing around with Kubernetes SSHing into the nodes and I ran into an interesting quirk of NSGs (Network Security Groups). I noticed that traffic I did not expect to be allowed through the NSG was making it through. A bit of digging let me down the path of a documented, but not well known, behavior of the VirtualNetwork service tag when used in NSG security rules. Today I’m going to walk through that behavior, why you should care, and what you can do to avoid being surprised like I was.

NSGs are layer four stateful firewalls that operate at the SDN (software-defined network). They serve a similar purpose and function in much the same way as AWS Security Groups. NSGs are used for microsegmentation within and across Virtual Networks typically supplementing the centralized control that is provided by a security appliance such as Azure Firewall or a Palo Alto firewall. They are associated to a subnet (best practice) or NIC (network interface) (few use cases for this). Each contains a collection of security rules, which includes default rules and user-defined rules. NSG security rules are processed by priority and are matched based on a 5-tuple.

As described in the previous link, service tags can be used within NSG security rules to simplify access to Azure resources. Service tags contain a summarized list of IPs that is managed by Microsoft. This makes life far easier, because whitelisting the IPs to something like Azure Storage Rules would be a nightmarish task that would require customer-created automation to keep up to date as IPs are added or removed to the underlining service. The benefit of service tags does come with a consideration as we’ll see in this post.

Each subnet or NIC can have one NSG applied to it, but the NSG can be applied to multiple subnets or NICs. In the instance of NSGs being applied at both the subnet and NIC, the processing for inbound traffic is detailed here and for outbound here.

Now that you know the basics of NSGs, let me talk a bit about the lab. For this lab I used my simple hub and spoke lab with a few modifications. I have added an Ubuntu VM running in the application subnet (snet-app) in the workload spoke virtual network. I’ve also temporarily removed the UDR from the custom route table on the application subnet. The NSG applied to the spoke contains only the default NSG rules. The lab architecture can be seen below.

Lab environment

Reviewing the NSG applied to the application subnet, the three default inbound rules are present as expected. The rule I’m going to look more deeply at is the AllowVnetInBound rule highlighted below. Specifically, I’m going to show you how to look at the IPs behind a service tag.

Default Inbound NSG Security Rules

To see the IPs associated with a service tag, I’m going to use the Effective security rules tool in Azure’s Network Watcher. If you’re unfamiliar with Network Watcher, you’re missing out. It contains a plethora of useful tools to help diagnose network connectivity. The Effective security rules tool looks at the NSGs applied to a NIC at both the subnet and NIC level to provide you with a holistic view of the what traffic is allowed and combined between NSGs applied at each level.

Effective security rules tool in Network Watcher

One of the lesser known features of the tool is it gives you the ability to look at the IPs included within a service tag for a specific NSG security rule. In the image below you will see that the IPs included in the VirtualNetwork service tag are the workload virtual network IP range (10.2.0.0/16), the peered transit virtual network IP range (10.0.0.0/16), and the Azure “magic IP” 168.63.129.16. This is likely what you expected to see in the VirtualNetwork tag.

VirtualNetwork service tag contents without UDR

Remember when I said I removed the UDR for the default route from the custom route table applied to the application subnet? I then added that route back in, pointed it to the Azure Firewall, waited about 2 minutes, then re-ran the Effective security rules tool.

VirtualNetwork service tag contents with UDR of default route

My first reaction to seeing all IP addresses now allowed through the VirtualNetwork tag was pretty much the Scanners head explosion GIF (classic if you haven’t seen it). It turns out this behavior is documented. The VirtualNetwork service tag has the following explanation:

The virtual network address space (all IP address ranges defined for the virtual network), all connected on-premises address spaces, peered virtual networks, virtual networks connected to a virtual network gateway, the virtual IP address of the host, and address prefixes used on user-defined routes. This tag might also contain default routes.

https://docs.microsoft.com/en-us/azure/virtual-network/service-tags-overview#available-service-tags

The part of that excerpt you need to care about is the piece about it includes the address prefixes on user-defined routes. This means that the prefixes in the UDRs you place on a custom route table applied to the subnet are added to the VirtualNetwork service tag in the NSG security rules used by the NSGs applied to your resource. I’m not sure why this behavior was implemented, but it can impact separation of duties where you’d have a networking team managing the routing within route tables and the security team managing which traffic is allowed in or out with NSGs. If someone has control over the routing tables, they can influence the VirtualNetwork service tag prefixes, which will influence the behavior of the default NSG security rules and others using that tag.

If you’re like me, your first level of panic was around the risk of this allowing traffic from the public Internet inbound to the resource if the resource had a public IP. You can rest easy in that my testing showed this is not possible even with an additional UDR in place to assure symmetric flow of traffic to the Internet endpoint coming in directly via the public IP. It’s likely Microsoft is doing some type of filtering at the SDN layer excluding traffic identified as being sourced from the Internet from being included in this security rule.

It gets more interesting when you use the IP Flow Verify tool in Network Watcher. Here I picked a random public IP and tested an inbound flow. The tool reports the flow as being allowed by the default AllowVnetInBound rule. Take note of this behavior because it could lead to confusion with your Information Security team or third-party auditors.

IP Flow Verify showing flow is allowed

The second level of panic I had was that this rule would allow any endpoint that has connectivity to my Virtual Network (such as other Virtual Networks attached as spokes to the hub Virtual Network) full connectivity to the endpoints behind the NSG. This concern is actually legitimate and was the reason I originally went down the rabbit hole. Traffic from a VM in the Shared Services Virtual Network is allowed full network connectivity the VM in the application subnet since the Virtual Network service tag includes the all IPv4 addresses (note this traffic was allowed through the Azure Firewall).

So why should you care about any of this? You should care because the programmed behavior of adding prefixes from UDRs to the VirtualNetwork service tag means those with control over the custom route tables (typically the networking team) have the ability to affect which traffic is allowed through an NSG if any NSG security rules use the VirtualNetwork service tag. From a separation of duties perspective, this is very far from optimal. Additionally, since most hub and spoke architectures use a UDR with a default route of 0.0.0.0/0, unless you have a user-defined deny security rule in place, you are affected by this. Lastly, it goes to show that tools such as IP Flow Verify which work on evaluating the SDN rule set can produce confusing results.

There are some great ways to mitigate this risk thankfully. You could use Azure Policy to audit, deny, or remediate NSGs that are deployed without a default deny option. There are some great examples of remediation in the community GitHub. Funneling workload-to-workload and user-to-workload traffic through a security appliance such as Azure Firewall running in the transit Virtual Network is another great risk mitigator. Lastly, tightly controlling access to your route tables and limiting use of the VirtualNetwork service tags are other options.

Well folks, that wraps up this post. Hopefully the information was useful and you can leverage some of it to more tightly secure your Azure environment.

Have a great week!

A look at the Azure DNS Private Resolver

10/12/22 Update – Private Resolver is now Generally Available!

Hello again!

Today I’m going to cover the new Azure DNS Private Resolver feature that recently went into public preview. I’ve written extensively about Azure DNS in the past and I recommend reading through that series if you’re new to the platform. It has grown to be significantly important in Azure architectures due to its role in name resolution for Private Endpoints. A common pain point for customers using Private Endpoints from on-premises is the requirement to have a VM in Azure capable of acting as a DNS proxy. This is explained in detail in this post. The Azure DNS Private Resolver seeks to ease that pain by providing a managed DNS solution capable of acting as a DNS proxy and conditional forwarder facilitating hybrid DNS resolution (for those of you coming from AWS, this is Azure’s Route 53 Resolver). Alexis Plantin beat me to the punch and put together a great write-up on the basics of the feature so my focus instead be on some additional scenarios and a pattern that I tested and validated.

I’m a big fan of keeping infrastructure services such as DNS centralized and under the management of central IT. This is one reason I’m partial to a landing zone with a dedicated shared services virtual network attached to the transit virtual network as illustrated in the image below. In this shared services virtual network you put your DNS, patching/update infrastructure, and potentially identity services such as Windows Active Directory. The virtual network and its resources can then be dropped into a dedicated subscription and locked down to central IT. Additionally, as an added bonus, keeping the transit virtual network dedicated to firewalls and virtual network gateways makes the eventual migration to Azure Virtual WAN that must easier.

Common landing zone design

The design I had in mind would place the Private Resolver in the shared services virtual network and would funnel all traffic to and from the resolver and on-premises or another spoke through the firewall in the transit virtual network. This way I could control the conversation, inspect the traffic if needed, and centrally log it. The lab environment I built to test the design is pictured below.

Lab environment

The first question I had was whether or not the inbound endpoint would obey the user defined routes in the custom route table I associated with the inbound endpoint subnet. To test this theory I made a DNS query from the VM running in spoke 2 to resolve an A record in a Private DNS Zone. This Private DNS Zone was only linked to the virtual network where the Private Resolvers were. If the inbound endpoint wasn’t capable of obeying the custom routes, then the return traffic would be dropped and my query would fail.

Result of query from VM in another spoke

Success! The inbound endpoint is returning traffic back through the firewall. Logs on the firewall confirm the traffic flowing through.

Firewall logs showing DNS traffic from spoke

Next I wanted to see if traffic from the outbound endpoint would obey the custom routes. To test this, I configured a DNS forwarding rule (conditional forwarding component of the service) to send all DNS queries for jogcloud.com back to the domain controller running in my lab. I then performed a DNS query from the VM running in spoke 2.

Firewall logs showing DNS traffic to on-premises

Success again as the query was answered! The traffic from the outbound endpoint is seen traversing the firewall on its way to my domain controller on-premises. This confirmed that both the inbound and outbound endpoints obey custom routing making the design I presented above viable.

Beyond the above, I also confirmed the Private Resolver is capable of resolving reverse lookup zones (for PTR records). I was happy to see reverse zones weren’t forgotten.

One noticeable gap today is the Private Resolver does not yet offer DNS query logging. If that is important to you, you may want to retain your existing DNS Proxy. If you happen to be using Azure Firewall, you could make use of the DNS Proxy feature which allows for logging of DNS queries. Azure Firewall could then be configured to use the Private Resolver as its resolver providing that conditional forward capability Azure Firewall’s DNS Proxy feature lacks.

That wraps up this post.

Thanks!

Private Endpoints Revisited: NSGs and UDRs

Private Endpoints Revisited: NSGs and UDRs

Update September 2022 – Route summarization and NSGs are now generally available for Private Endpoints!

Welcome back fellow geeks!

It’s been a while since my last post. For the past few months I’ve been busy renewing some AWS certificates and putting together some Azure networking architectures on GitHub. A new post was long overdue, so I thought it would be fun to circle back to Private Endpoints yet again. I’ve written extensively about the topic over the past few years, yet there always seems more to learn.

There have historically been two major pain points with Private Endpoints which include routing complexity when trying to inspect traffic to Private Endpoints and a lack of NSG (network security groups) support. Late last year Microsoft announced in public preview a feature to help with the routing and support for NSGs. I typically don’t bother tinkering with features in public preview because the features often change once GA (generally available) or never make it to GA. Now that these two features are further along and likely close to GA, it was finally a good time to experiment with them.

I built out a simple lab with a hub and spoke architecture. There were two spoke VNets (virtual networks). The first spoke contained a single subnet with a VM, a private endpoint for a storage account, and a private endpoint for a Key Vault. The second spoke contained a single VM. Within the hub I had two subnets. One subnet contained a VM which would be used to route traffic between spokes, and the other contained a VM for testing routing changes. All VMs ran Ubuntu. Private DNS zones were defined for both Key Vault and blob storage and linked to all VNets to keep DNS simple for this test case.

Lab setup

Each spoke subnet had a custom route table assigned with a route to the other spoke set with a next hop as the VM acting as a router in the hub.

Once the lab was setup, I had to enable the preview features in the subscription I wanted to test with. This was done using the az feature command.

az feature register --namespace Microsoft.Network --name AllowPrivateEndpointNSG

The feature took about 30 minutes to finish registering. You can use the command below to track the registration process. While the feature is registering the state will report as registering, and when complete the state will be registered.

az feature show --namespace Microsoft.Network --name AllowPrivateEndpointNSG

Once the feature was ready to go, I first decided to test the new UDR feature. As I’ve covered in a prior post, creation of a private endpoint in a VNet creates a /32 route for the private endpoint’s IP address in both the VNet it is provisioned into as well as any directly peered VNets. This can be problematic when you need to route traffic coming from on-premises through a security appliance like a Palo Alto firewall running in Azure to inspect the traffic and perform IDS/IPS. Since Azure has historically selected the most specific prefix match for routing, and the GatewaySubnet in the hub would contain the /32 routes, you would be forced to create unique /32 UDRs (user-defined routes) for each private endpoint. This can created a lot of overhead and even risked hitting the maximum of 400 UDRs per route table.

With the introduction of these new features, Microsoft has made it easier to deal with the /32 routes. You can now create a more summarized UDR and that will take precedence over the more specific system route. Yes folks, I know this is confusing. Personally, I would have preferred Microsoft had gone the route of a toggle switch which would disable the /32 route from propagating into peered VNets. However, we have what we have.

Let’s take a look at it in action. The image below shows the effective routes on the network interface associated with the second VM in the hub. The two /32 routes for the private endpoints are present and active.

Effective routes for second VM in the hub

Before the summarized UDR can be added, you need to set a property on the subnet containing the Private Endpoint. There is a property on each subnet in a virtual network named PrivateEndpointNetworkPolicies. When a private endpoint is created in a subnet this property is set disabled. This property needs to be set to enabled. This is done by setting the –disable-private-endpoint-network-policies parameter to false as seen below.

az network vnet subnet update \
  --disable-private-endpoint-network-policies false \
  --name snet-pri \
  --resource-group rg-demo-routing \
  --vnet-name vnet-spoke-1

I then created a route table, added a route for 10.1.0.0/16 with a next hop of the router at 10.0.0.4, and assigned the route table to the second VM’s subnet. The effective routes on the network interface for the second VM now show the two /32s as invalid with new route now active.

Effective routes for second VM in the hub after routing change

A quick tcpdump on the router shows the traffic flowing through the router as we have defined in our routes.

tcpdump of router

For fun, let’s try that same wget on the Key Vault private endpoint.

Uh-oh. Why aren’t I getting back a 404 and why am I not seeing the other side of conversation on the router? If you guessed asymmetric routing you’d be spot on! To fix this I would need to setup the iptables on my router to NAT (network address translation) to the router’s address. The reason attaching a route table to the spoke 1 subnet the private endpoints wouldn’t work is because private endpoints do not honor UDRs. I imagine you’re scratching your head asking why it worked with the storage account and not with Key Vault? Well folks, it’s because Microsoft does something funky at the SDN (software defined networking) layer for storage that is not done for any other service’s private endpoints. I bring this up because I wasted a good hour scratching my head as to why this was working without NAT until I came across that buried issue in the Microsoft documentation. So take this nugget of knowledge with you, storage account private endpoint networking works differently from all other PaaS service private networking in Azure. Sadly, when things move fast, architectural standards tend to be one of the things that fall into the “we’ll get back to that on a later release” bucket.

So wonderful, the pain of /32 routes is gone! Sure we still need to NAT because private endpoints still don’t honor UDRs attached to their subnet, but the pain is far less than it was with the /32 mess. One thing to take from this gain is there is a now a disclaimer to Azure routing precedence. When it comes to routing with private endpoints, UDRs take precedence over the system route even if the system route is more specific.

Now let’s take a look at NSG support. I next created an NSG and associated it to the private endpoint’s subnet. I added a deny rule blocking all https traffic from the VM in spoke 2.

NSG applied to private endpoint subnet

Running a wget on the VM in spoke 2 to the blob storage endpoint on the private endpoint in spoke 1 returns the file successfully. The NSG does not take effect simply by enabling the feature. The PrivateEndpointNetworkPolicies property I mentioned above must be set to enabled.

After setting the property, the change takes about a minute or two to complete. Once complete, running another wget from the VM in spoke 2 failed to make a connection validating the NSG is working as expected.

NSG blocking connection

One thing to be aware of is NSG flow logs will not log the connection at this time. Hopefully this will be worked out by GA.

Well folks that’s it for this post. The key things you should take aware are the following:

  • Testing these features requires registering the feature on the subscription AND setting the PrivateEndpointNetworkPolicies property to enabled on the subnet. Keep in mind setting this property is required for both UDR summarization and enabling NSG support. (Thank you to my peer Silvia Wibowo for pointing out that it is also required for UDR summarization).
  • NAT is still required to ensure symmetric routing when traffic is coming from on-premises or another spoke. The only exception are private endpoints for Azure Storage because it operates different at the SDN.
  • The UDR feature for private endpoints makes less-specific UDRs take precedence over the more specific private endpoint system route.
  • NSG and summarized UDR support for private endpoints are still in public preview and are not recommended for production until GA.
  • NSG Flow Logs do not log connection attempts to private endpoints at this time.

See you next post!