Last year Microsoft announced the general availability of Private Endpoint support for App Services. This was a feature I was particularly excited about because when combined with regional Vnet integration, it offered an alternative to an ASE (App Service Environment) for customers who wanted to run internally-facing applications or who didn’t feel comfortable exposing their publicly facing application directly via a public IP.
Over the past year I created a few different labs that demonstrated these features including a web app and function app. Each lab was built around a hub and spoke architecture where the application was serving as an internally-facing application. Given my recent post on Azure Firewall Premium, and the fact I still had the lab environment up and running, I thought it would be interesting to switch the virtual machine out and switch App Services in, which resulted in the lab environment pictured below.
I established the following goals for the lab environment:
Perform IDPS on traffic to the web app
Mediate and inspect traffic initiated from the web app to third-party APIs
Expose a web application to the Internet without giving it a direct public IP address
To accomplish these goals I was able to re-use much of the pattern I described in my last post.
The first step in the process was to create a very simple hub and spoke architecture as pictured above. The environment would consist of a transit Vnet (virtual network), shared services Vnet, and spoke Vnet. The transit Vnet would contain the guts of the solution with an Azure Firewall Premium SKU instance and Azure Application Gateway v2 instance. In the shared services Vnet I used a Windows Server VM (virtual machine) running the DNS Server service and providing DNS resolution for the environment. I could have taken a shortcut here and used Azure Firewall’s DNS proxy capability, but I like to leave the options open for conditional forwarding which Azure Firewall and Azure Private DNS do not support at this time. Finally, in the spoke Vnet I created two subnets. One subnet hosted the private endpoint for App Services and the other subnet was delegated to App Services to support regional Vnet integration.
Once all components were in place, I deployed a simple Python web application I wrote. All it does it queries two public APIs, one for the current time and the other for a random Breaking Bad quote (greatest show of all time!). I took the cheating way out and deployed the app directly via Visual Studio using the Azure App Service add-in prior to deploying the private endpoint and regional Vnet integration capabilities. This allowed me to test the app to validate it was still working as intended while also allowing me to deploy the app from my home machine. Deploying a private endpoint for app services locks down not only access to the running application but also to the SCM (source control management) endpoint that is used for deploying code to the app.
The next step was to create the Azure Private DNS Zone for app services which is named privatelink.azurewebsites.net. I then linked this zone to my shared services Vnet and setup the DNS server running in that Vnet with a standard forward to 220.127.116.11. This setup allows DNS queries made to the Private DNS zone to be resolved by the DNS server. I’ve written extensively about how DNS works with Azure Private Link so I won’t go into any more detail on that flow. Last step in the DNS process is to configure each Vnet to use the DNS server IP in its DNS Server settings. This also needs to configured directly on the Azure Firewall.
Now that the infrastructure would be able to resolve queries to the Private DNS Zone used by App Services, it was time to create the private endpoint for the web app. This registered the appropriate DNS records in the Private DNS Zone for my web app. With the private endpoint created, the last step was to enable Vnet integration.
At this point I had all the guts of my solution and it was time to configure traffic to flow the way I wanted it to flow. Before I could begin work in Azure I needed to do the work outside of Azure. This involved creating an A record in my DNS hosting provider to point the public DNS name of my web app to the AGW’s (Application Gateway) public IP. This name can be whatever you want as long as you provide a certificate to the AGW such that it will identify itself as that name to the user. The AGW configuration is very straightforward and this tutorial will get you most of the way there. One thing to note is you’ll need to set the backend to point to the name given to the app service. This allows you to take advantage of the certificate Microsoft automatically provisions to the given app service.
Since I wanted to route traffic destined for the web app through the Azure Firewall Premium instance, I needed to ensure the AGW trusted the certificate served up by Azure Firewall. This is done by modifying the HTTP setting used in the AGW rule for the app. Here you can upload the root certificate that issued the Azure Firewall Premium intermediate certificate.
Now that the AGW is configured, I needed to create a route table for the subnet the AGW has its private IP address in. In this route table I disabled BGP (border gateway protocol) propagation to ensure the default route since this AGW v2 requires a default route pointing directly to the Internet. Now this is where it gets interesting from a routing perspective. Whenever you create a private endpoint for a service, a system route is added to the route table of all the subnets within the private endpoint’s vnet (virtual network) as well as any vnet the private endpoint’s vnet is peered with. If I want traffic from the AGW to route through the Azure Firewall, I need to override that system route with a /32 UDR. As you can imagine, this can become extremely tedious at scale and even risks hitting the max routes per route table depending on the scale we’re talking about. On the positive side, the issues around this are something Microsoft is aware of so hopefully that means this will be addressed at some point. In the meantime you’ll need to use the /32 route in a pattern such as this.
Azure Firewall Premium needs to be configured to allow traffic from the AGW to the web app and inspect this traffic. You can check out my last post for instructions on configured Azure Firewall to perform these activities.
Excellent, work is done right? Nope! If we influence routing on one side, we have to ensure the other side routes the same way. Toss a route table on the private endpoint subnet you say? Sorry, that isn’t supported my friend. Instead I needed to enable the Azure Firewall instance to SNAT (source NAT) traffic to the spoke Vnet. This ensures routing is symmetric and will eliminate any potential connection issues by creating only the UDR on the AGW subnet.
Lastly, I needed to create and apply a route table to the subnet I delegated to App Services for regional Vnet integration. This route table can be very simple and be configured with a single UDR for the default route pointing to the Azure Firewall private IP. This results in the traffic flow below:
This one was another fun pattern to solve. I got a chance to mess around more with AGW and also get a pattern I had theorized would work actually working. I find implementing the pretty pictures I draw helps drive home the benefits and considerations of such a pattern. With this pattern specifically, it suffers from similar considerations as the pattern in my last post. These include:
Challenges with observability
Operational overhead of certificate management
Possible latency issues depending on latency requirements and traffic patterns
This pattern tacks on more operational overhead by requiring that /32 route be added to the AGW route table each time a new app service is provisioned with a private endpoint. Observability further suffers when tracing a packet flow end-to-end due to the additional layer of SNAT required via Azure Firewall. One thing to note about this is you may be able to get around the SNAT requirement for web-specific traffic because of the transparent proxy functionality behind Azure Firewall application rules. I want to highlight I have not tested this myself and I typically tend to SNAT for this use case in my lab because many of my customers may use similar patterns but with a 3rd party firewall.
So there you go folks, another pattern to add to your inventory. Hopefully we’ll see the /32 issue issue private endpoints resolved sometime in the near future. Have a great weekend!
I recently had a customer that was interested in staying as purely cloud native as possible, which included any centralized firewall that would be in use. Microsoft has offered up Azure Firewall for a while now and it is a great solution if you’re looking for a very basic fully-managed firewall. Here are some of the neater features of the solution:
Unfortunately this basic feature set rarely satisfied the more regulated customer base I tend to work with. Many of these customers went with the full featured security appliances such as those offered by Palo Alto, Fortigate, and the like. One of the largest gaps in Azure Firewall when compared to the 3rd party vendors was the lack DPI (deep packet inspection) and IDS (intrusion detection system) / IPS (intrusion prevention system) capabilities. Microsoft heard the feedback from its customers and back in February of 2021 made the Azure Firewall Premium SKU available in public preview with a collection of features such as TLS (transport layer security) Inspection, IDPS (intrusion detection prevention system), URL filtering, and improved web category filtering. The addition of these capabilities now has made Azure Firewall a much more appealing cloud-native solution.
I had yet to spend any significant time experimenting with the Premium SKU (I make it a habit to not invest a ton of time into preview features). However, this customer gave me the opportunity to dive into the TLS inspection and IDPS capabilities. These capabilities will be the subject of this post and I’ll spend some time describing the architectural pattern I built out and experimented with.
This particular customer had a requirement to perform DPI and IDPS on incoming web-based traffic from the Internet. I asked the customer to provide the control set they needed to satisfy such that we could map those controls to the technical controls available in Azure-native services. The hope was, since this was web-based traffic only and used across multiple regions, we might be able to satisfy all the controls via a WAF (web application firewall) such as Azure Front Door and supplement with layer 7 load balancing with Azure Application Gateway within a given region. The rest of the traffic, non-web, would be delivered to a firewall running in parallel. This pattern is becoming more common place as the WAFs grow in functionality and feature set.
Unfortunately the above pattern was not an option for the customer because they wanted to maintain a centralized funnel for all traffic via a firewall. This is not an uncommon ask. This meant I had to get the traffic coming in from the WAF to funnel through Azure Firewall. For this pattern to work end to end I would need layer 7 load balancing so that meant I needed an Application Gateway as well. The question was do I place the Azure Firewall before or after the Application Gateway? For the answer to this question I went to the Microsoft documentation. Typically the public documentation leaves a lot to be desired when it comes to identifying the benefits and considerations of a particular pattern (oh how I long for the days of Technet-quality documentation), however the documentation around these patterns is stellar.
After quickly reading the benefits and considerations about the two patterns, the decision looked like it was made for me. The pattern where the firewall is placed after the application gateway aligned with my customer’s use case. It specifically covered TLS inspection and IDPS through Azure Firewall Premium. Curious as to why this TLS inspection at Azure Firewall wasn’t mentioned in the other use case where Azure Firewall is placed in front of Application Gateway, I went down a rabbit hole.
My first stop was the public documentation for the Azure Firewall Premium SKU. Since the feature is still public preview there are a fair amount of limitations but none of the limitations that weren’t planned to be fixed by GA (general availability) looked like show stoppers. However, in the section for TLS Inspection, I noticed this blurb, “Azure Firewall Premium terminates outbound and east-west TLS connections.” I reached out to some internal communities within Microsoft and confirmed that “at this time” Azure Firewall isn’t capable of performing TLS Inspection on traffic coming in the public interface. This limitation meant that I had to get the traffic received from the WAF over to the internal interface and the best way to do this was to intake it from the WAF through the Application Gateway. This would be the pattern I’d experiment with.
To keep things simple, I focused on a single region and didn’t include a WAF. Load balancing across regions could be done with the customer’s 3rd party WAF where the WAF would resolve to the appropriate regional Application Gateway v2 instance public IP depending on the load balancing pattern (such as geo-location) the customer was using. Once the traffic is received from the WAF, the Application Gateway terminates the TLS session so that it can inspect the URL and host headers and direct the traffic to the appropriate backend, which in this case was the single web server running IIS (Internet Information Services) in a peered spoke virtual network.
To ensure the traffic leaving the Application Gateway funnels through the Azure Firewall instance, I attached a route table to the Application Gateway. This route table was configured with BGP propagation disabled (to ensure a default route couldn’t be accidentally propagated in) and with a single UDR (user-defined route) which contained the spoke’s virtual network CIDR (Classless Inter-Domain Routing) block with a next hop of the Azure Firewall private IP address. Since UDRs take precedence over system routes, this route would invalidate the system route for the peering. On the web server subnet in the spoke I had a similar route table with a single UDR which contained the transit virtual network CIDR block with a next hop of the Azure Firewall private IP address. This ensured that any communication between the two would flow through the Azure Firewall.
To ensure DNS would resolve, I created an A record in public DNS for sample1.geekintheweeds.com pointing to the public IP address of the Application Gateway. Within Azure, I built a Windows server, installed the DNS service, and created a forward lookup zone for geekintheweeds.com with an A record named sample1 pointing to the web server. Within each virtual network I configured the Windows Server IP address in the DNS Server settings (note that if you do this after you provision Application Gateway, you’ll need to stop and start it). Within the Azure Firewall, I set it to use the server as its DNS server.
Now that the necessary plumbing was in place I needed to put in the appropriate certificates for the Application Gateway, Azure Firewall, and the Web Server. This is where this setup can get ugly. Since this was a lab, I generated all of my certificates from a private CA (certificate authority) I have running in my home lab. Since these CAs are only used for testing the CA issues all certificates without a CDP (certificate revocation distribution point) to keep things simple by avoiding requiring any of those network flows for revocation check lookups. In a production environment you’d want to issue the certificates for the Application Gateway from a trusted public CA so you don’t have to worry about CDPs/OCSP (online certificate status protocol) exposing the endpoints for these flows. For Azure Firewall and the web server you’d be fine using certificates issued by a private CA as long as you ensured appropriate validation endpoints were available.
The Application Gateway and web server will use your standard web server certificate. The Azure Firewall is a different story. To support TLS Interception you’ll need to provide it with an intermediary certificate. This type of certificate allows Azure Firewall to generate certificates on the fly to impersonate the services it’s intercepting traffic for. This link explains the finer details of the requirements. Also note that the certificate needs to be imported to an instance of Key Vault which Azure Firewall accesses using a user-assigned managed identity.
Once the Application Gateway was configured in a similar manner as outlined here, I was good to test. Accessing the server from a machine running in my home lab successfully displayed the standard IIS website as hoped! When I viewed the Azure Firewall logs the full URL being accessed by the user was visible proving TLS inspection was working as expected. Success!
The patterns works but there are a number of considerations.
The biggest consideration being Azure Firewall Premium is still in public preview. Regardless of what you may hear from those within Microsoft or outside of Microsoft, DO NOT USE PUBLIC PREVIEW FEATURES IN PRODUCTION. I’d go as far as cautioning against even using them in planned designs. By the time these new products or features make it to GA, they can and often do change, sometimes for the better and sometimes for the worst. If you choose to use these products or features in an upcoming design, make sure to have a plan B if the product doesn’t hit GA or hits GA without the features you need. Remember that Microsoft’s targeted release dates are often moving targets that accelerate or decelerate depending on the feedback from public preview. Unless you have a contractual agreement with a vendor to deliver upon a specific date and there are real penalties to the vendor for not doing that, you should have a fully vetted and tested plan B ready to go.
Outside of my ranting about usage of public preview products and features, here are some other considerations with this pattern:
Challenges with observability
Operational overhead of certificate management
Possible latency issues depending on latency requirements and traffic patterns
In this design the Application Gateway will be SNATing the traffic it receives from the Internet users. To understand the session end to end and correlate the logs from the WAF to the Application Gateway, through the Firewall, to the web server, you’ll need to ensure you’re capturing the x-forwarded-for header and using it to identify the user’s original source IP. This will definitely add complexity to the observability of the environment. Tack on the many mediation points, and identifying where traffic is getting rejected (WAF, App Gateway, firewall, NSG, local machine firewall) will require a strong logging and correlation system.
This pattern is going to require at least 3 separate certificates which will likely be a mix of certificates issued by both public CAs and private CAs. Certificate lifecycle management is a significantly challenging operational task and is often the cause of service outages. If you opt to use a pattern such as this, you’ll need to ensure your operational monitoring and alerting processes around certificate lifecycle management are solid. In addition, you will also need to manage revocation network flows. In a past life, these were the flows I observed that would often bite organizations.
Lastly, this pattern involves a lot of hops where the traffic is being decrypted and re-encrypted. This requires compute time which could add to latency. These latency issues could be impactful depending on the latency requirements of the application and what type and volume of data is flowing between the user and web server.
I have to admit I enjoyed labbing this on out. These days I usually spend a majority of my time in governance conversations focusing on people and process. Getting back into the technology and spending some time playing with the Azure Firewall Premium SKU and Application Gateway was a great learning experience. It will be interesting to see over time how well it will compete against the behemoths of the industry such as Palo Alto.
Over the next week I’ll be making some small tweaks to this design to see whether I can stick the Key Vault behind a Private Endpoint (documentation is unclear as to whether or not this is supported) and messing with the logs provided by both the Azure Firewall and Application Gateway to see how challenging correlation of sessions is.
I hope you have a great long weekend and see you next post!
Azure Arc is Microsoft’s attempt to extend the Azure management plane and Azure capabilities to resources running on-premises and other clouds. As the date of this blog these resources include Windows and Linux machines and Kubernetes clusters running on-premises or in another cloud like AWS (Amazon Web Services). Integrating these resources with Azure Arc projects them into the Azure management plane making them resources of Azure.
One of the interesting capabilities of Azure Arc that captured my eye was the ability to use a system-assigned management identity. I’ve written in the past about managed identities so I’ll stick to covering the very basics of them today. A managed identity is simply some Microsoft-managed automation on top of an Azure AD service principal, which is best explained as a security principal used for non-humans. One of the primary benefits managed identities provider over traditional service principals is automatic credential rotation. If you come from the AWS world, managed identities are similar to AWS IAM roles.
Those of you that manage service principals of scale are surely familiar with the pain of having to work with application owners to rotate the credentials of service principals in use within corporate applications. Beyond the operational overhead, the security risk exists for an attacker to obtain the service principal credentials and use them outside of Azure. Since service principals are not yet subject to Azure AD Conditional access, rotation of credentials becomes a critical control to have in place. Both this operational overhead and security risk make the automated rotation of credentials capability of managed identities a must have.
Managed identities are normally only available for use by resources running in Azure like Azure VMs, App Service instances, and the like. In the past, if you were coming from outside of Azure such as on-premises or another cloud like AWS, you’d be forced to use a service principal and struggle with the challenges I’ve outlined above. Azure Arc introduces the ability to leverage managed identities from outside of Azure.
This made me very curious as to how this was being accomplished for a machine running outside of Azure. The official documentation goes into some detail. The documentation explains a few key items:
A system-assigned managed identity is provisioned for the Arc-enabled server when the server is onboarded to Arc
The Azure Instance Metadata Service (IDMS) is configured with the managed identity’s service principal client id and certificate
Code running on the machine can request an access token from the IDMS
The IDMS challenges the application code to prove it is privileged enough to obtain the access token by requiring it provide a secret to attest that it is highly privileged
This information left me with a few questions:
What exactly is the challenge it’s hitting the application with?
Since the service principal is using a certificate-based credential, the private key has to be stored on the machine. If so where and how easy would it be for an attacker to steal it?
How easy would it be to steal the identity of the Azure Arc machine in order to impersonate it on another machine?
To answer these questions I decided to deploy Azure Arc to a number of Windows and Linux machines I have running on my home instance of Hyper-V. That would give me god access to each machine and the ability to deploy whatever toolset I wanted to take a peek behind the curtains. Using the service-principal onboarding process I deployed Azure Arc to a number of Windows and Linux machines running in my home lab. Since I’m stronger in Windows, I decided I’d try to extract the service principal credentials from a Windows machine and see if I could use them on Azure Arc-enabled Linux machine.
The technical overview documentation covers how Windows and Linux machines connect into Azure Arc and communicate. There are three processes that run to support Arc connectivity which include the Azure Hybrid Instance Metadata Service (HIMDS), Guest Configuration Arc Service, and Guest Configuration Extension Service. The service I was most interested in was the HIDMS service which is in charge of authenticating and obtaining access tokens from Azure.
To better understand the service, I first referenced the service in the Services MMC (Microsoft Management Console). This gave me the path to the executable. Poking around the source directory showed some dsc logs and the location of the processes involved with Azure Arc but nothing too interesting.
My next stop was the configuration information. As documented in the technical overview documentation, the configuration data is stored in a json file in the %ProgramData%\AzureConnectedMachineAgent\Config directory. The key pieces of information we can extract from this file are the tenant id, subscription id, certificate thumbprint the service principal is using and most importantly the client id. That’s a start!
I then wondered what else might be stored through this directory hierarchy. Popping up a directory, I came saw the Certs subdirectory. Could it really be this easy? Would this directory contain the private-key certificate? Navigating into the directory does indeed show a certificate file. However, attempting to open the certificate up using the Crypto Shell Extensions resulted in an error stating it’s not a valid certificate. No big deal, maybe the extension is wrong right? I then took the file and ran it through a few Python scripts I have to enumerate certificate data, but unfortunately no certificate format I tried worked. After a quick discussion with my good friend Armen Kaleshian, we both came to the conclusion that the certificate is more than likely wrapped with some type of symmetric encryption.
At this point I decided to take a step back and observe the HIMDS process to ensure this file is being used by the process, and if so, whether I could figure where the key that was used to wrap the certificate was stored. For that I used Process Monitor (the old tools are still the best!). After restarting the HIMDS service and sifting through the capture information, I found the mentions of the process reading the certificate file after reading from the agent config. In the past I’ve seen applications store the symmetric key used to wrap information like this in the registry, but I could not find any obvious calls to the registry indicating this. After a further conversation with Armen, we theorized the symmetric key might be stored in the himds.exe executable itself which would mean every Azure Arc instance would be using the same key to wrap the certificate. This would mean all one would have to do is move the certificate and configuration file to a new machine and one would be able to use the credentials to impersonate the machine’s managed identity (more on that later).
At this point I was fairly certain I had answered question number 2, but I wanted to explore question number 3 a bit more. For this I leveraged the PowerShell code sample provided in the documentation. Breaking down the code, I observed a web request is made to the IDMS endpoint. From the response from the IDMS endpoint, a returned header is extracted which contains the directory name a file is stored in. The contents of this file are extracted and then included as an authorization header using basic authentication. The resulting response from the IDMS contains the access token used to communicate with the relevant Microsoft cloud API.
By default the security principal doesn’t have any permissions in the ARM (Azure Resource Manager) API, so I granted it reader rights on the resource group and wrote some simple Python code to query for a list of resources in the in a resource group. I was able to successfully obtain the access token and got back a list of resources.
This indicated to me that the additional challenge introduced with the IDMS running on Arc requires the process to obtain a secret generated by the IDMS that is placed in the %PROGRAMDATA\AzureConnectedMachineAgent\Tokens directory. This directory is locked down for access to the security principal the himds service runs as, SYSTEM, the Administrators group, and most importantly the Hybrid agent extension applications group. After the secret is obtain from the file in this directory, it is used as a secret to perform basic authentication with the IDMS service to obtain the access token. As the official documentation states, this is the group you would need to add the security principal running any processes you wanted to use the local IDMS. Question 1 had now been answered.
That left me with question 3. Based on what I learned so far, an attacker who compromised the machine or a security principal with sufficient permissions on the machine to access the %PROGRAMDATA%\AzureConnectedMachineAgent\ subdirectories could obtain the configuration of the agent and the encrypted certificate. Compromise could also come in the form of compromise of a process that was running as a security principal which was a member of the Hybrid agent extension applications group. Even with the data collected from the agentconfig.json file and the certificate file, the attacker would still be short of the symmetric key used to unwrap the certificate to make it available for use.
Here I took a gamble and tested the theory Armen and I had that the symmetric key was stored in the himds executable. I copied the agentconfig.json file and certificate file and move them to another machine. To eliminate the possibility of the configuration and credentials file being specific to the Windows operating system, I spun up an Ubuntu VM in my Hyper-V cluster. I then installed and configured the agent with a completely separate Azure AD tenant.
Once that was complete, I moved the agentconfig.json and myCert file from the existing Windows machine to the new Ubuntu VM and replaced the existing files. Re-running the same Python script (substituting out the variables for the actual URIs listed in the technical overview) I received back the same listing of resources from the resource group. This proved that if you’re able to get access to both the agentconfig.json and myCert files on an Azure Arc-enabled machine, you can move them to any other machine (regardless of operating system) and re-use them to impersonate source machine and exercise any permissions its been granted access.
Let’s sum up the findings:
The agentconfig.json file contains sensitive information including the tenant id, subscription id, and client id of the service principal associated with the Azure Arc-enabled machine.
The credential for the service principal is stored as a certificate file in the myCert file in the %PROGRAMDATA%\AzureConnectedMachineAgent\certs directory. This file is probably wrapped with some type of symmetric encryption.
The challenge the documentation speaks to involves the creation of a password by the HIMDS process which is then stored in the %PROGRAMDATA%\AzureConnectedMachineAgent\tokens directory. This directory is locked down to privileged users. The password contained in the file created by the HIMDS process is then passed back in a request to the IDMS service running on the machine using the basic HTTP authentication where it is validated and an access token is returned. This is used a method to prove the process requesting the token is privileged.
The agentconfig.json and myCert file can be moved from one Arc-enabled machine to another to be used to impersonate the source machine. This provided further evidence that the symmetric key used to encrypt the certificate file (myCert) is hardcoded into the executable.
It should come as no surprise to you follow old folks that the resulting findings stress the importance traditional security controls. A few recommendations I’d make are as follows:
If you’re going to use the managed identity available to an Arc-enabled server for applications running on that machine, ensure you’re granting that managed identity only the permissions it requires and nothing beyond it.
Limit the human and non-human actors who have administrator privileges on Azure Arc-enabled machines.
Tightly control which security principals are members of the Hybrid agent extension applications group.
Logging and monitor
Log managed identity sign-ins and monitor for suspicious activity. Until Conditional Access is extended to service principals, it can’t be leveraged to further restrict where these security principals can be used from.
Log and monitor the security events on the Azure Arc-enabled servers
Create a process to review the use case and security risks of applications that want to leverage this functionality.
Perform access reviews to validate any additional permissions granted to these managed identities are still warranted, and if not, remove them.
Nothing in the above is new or fancy but all things you should be doing today. The biggest take away you need here is by extending the Azure management plane you get great new features but you also introduce a potential new risk where a compromised Azure Arc-enabled machine outside of Azure could be used to impact the security of your Azure implementation. If you’re using service principals today (or hell, if you’re simply having users access Azure from their desktops or laptops) you already have this risk, but it’s always worthwhile understanding the different vectors available to attackers.
At the end of the day I love this feature. It is leagues better than how most organizations are handling service principal usage outside of Azure today where the credentials are often hardcoded into code into code, dumped into a file in an unsecured directory, or never rotated for years. There are tradeoffs, but as I said earlier, none of the mitigations are things you shouldn’t already be doing today. Outside of this capability, there are a lot of benefits to Azure Arc and I’ll be very interested to see how Microsoft grows this offering and extends it beyond.
Hi folks! It’s been a while since my last post. This was due to spending a significant amount of cycles building out a new deployable lab environment for Azure that is a simplified version of the Enterprise-Scale Landing Zone. You can check it out the fruits of that labor here.
Recently I had two customers adopt the encryption at host feature of Azure VMs (virtual machines). The quick and dirty on this feature is it exists to mitigate the threat of an attacker accessing the data on the VHDs (virtual hard disks) belonging to your VM in the event that an attacker has compromised a physical host in an Azure datacenter which is running your VM. The VHDs backing your VM are temporarily cached on a physical host when the VM is running. Encryption at host encrypts the cached VHDs for the time they exist cached on that physical host. For more detail check out my peer and friend Sebastian Hooker’s blog post on the topic.
Both customers asked me if there was a simple and easy way to report on which VMs were enabled the feature and which were not. This sounded like the perfect use case for Azure Policy, so down that road I went and hence this blog post. I dove neck deep into Azure Policy about a year and a half ago when it was a newer addition to Azure and had written up a tips and tricks post. I was curious as to how relevant the content was given the state of the service today. Upon review of the content, I found many of the tips still relevant even though the technical means to use them had changed (for the better). Instead of walking through each tip in a generic way, I thought it would be more valuable to walk through the process I took to develop an Azure Policy for this use case.
Step 1 – Understand the properties of the Azure Resource
Before I could begin writing the policy I needed to understand what the data structure returned by ARM (Azure Resource Manager) for a VM enabled with encryption at host looks like. This meant I needed to create a VM and enable it for encryption at host using the process documented here. Once VM was created and configured, I then broke out Azure CLI and requested the properties of the virtual machine using the command below.
az vm show --name vmworkload1 --resource-group RG-WORKLOAD-36FCD6EC
This gave me back the properties of the VM which I then quickly searched through to find the block below indicating encryption at host was enabled for the VM.
When crafting an Azure Policy, always review the resource properties when the feature or property you want to check is both enabled and disabled.
Running the same command as above for a VM without encryption at host enabled yielded the following result.
Now if I had assumed the encryptionAtHost property had been set to false for VMs without the feature enabled and had written a searching for VMs with the encryptionAtHost property equal to False, my policy would not have yielded the correct result. I’d want to use conditional exists instead of equals. This is why it’s critical to always review a sample resource both with and without the feature enabled.
Step 2 – Determine if Azure Policy can evaluate this property
Wonderful, I know my property, I understand the structure, and I know what it looks like when the feature is enabled or disabled. Now I need to determine if Azure Policy can evaluate it. Azure Policy accesses the properties of resources using something something called an alias. A given alias maps back to the path of a specific property of a resource for a given API version. More than likely there is some type of logic behind an alias which makes an API call to get the value of a specific property for a given resource.
Now for the bad news, there isn’t always an alias for the property you want to evaluate. The good news is the number of aliases has dramatically increased since I last looked at them over a year ago. To validate whether the property you wish to validate has an alias you can use the methods outlined in this link. For my purposes I used the AZ CLI command and grepped for encryptionAtHost.
az provider show --namespace Microsoft.Compute --expand "resourceTypes/aliases" --query "resourceTypes.aliases.name" | grep encryptionAtHost
The return value displayed three separate aliases. One alias for the VM resource type and two aliases for the VMSS (virtual machine scale set) type.
Seeing these aliases validated that Azure Policy has the ability to query the property for both VM and VMSS resource types.
Always validate the property you want Azure Policy to validate has an alias presentbefore spending cycles writing the policy.
Next up I needed to write the policy.
Step 3 – Writing the Azure Policy
The Azure Policy language isn’t the most initiative unfortunately. Writing complex Azure Policy is very similar to writing good AWS IAM policies in that it takes lots of practice and testing. I won’t spend cycles describing the Azure Policy structure as it’s very well documented here. However, I do have one link I use as a refresher whenever I return to writing policies. It does a great job describing the differences between the conditions.
You can write your policy using any editor that you wish, but Visual Studio Code has a nice add-in for Azure Policy. The add in can be used to examine the Azure Policies associated with a given Azure Subscription or optionally to view the aliases available for a given property (I didn’t find this to be as reliable as using querying via CLI).
Now that I understood the data structure of the properties and validated that the property had an alias, writing the policy wasn’t a big challenge.
Let me walk through the relevant sections of the policy where I set the value to something for a reason. For the mode section I chose to use the “Full” mode since Microsoft recommends setting this to full unless you’re evaluating tags or locations. In the parameters section I left it blank since there was no relevant parameters that I wanted to add to the policy. You’ll often use parameters when you want to give the person assigning the policy the ability to determine the effect. Other use cases could be allowing the user to specify a list of tags you want applied or a list of allowed VM SKUs.
Finally we have the policyRule section which is the guts of the policy. In the policy I’ve written I have three rules which each include two conditions. The policy will fire the effect of Audit when any of the three rules evaluates as true. For a rule to evaluate as true both conditions must be true. For example, in the rule below I’m checking the resource type to validate it’s a virtual machine and then I’m using the encryption at host alias to check whether or not the property exists on the resource. If the resource is a virtual machine and the encryptionAtHost property doesn’t exist, the rule evaluates as true and the Azure Policy engine marks the resource as non-compliant.
Step 4 – Create the policy definition and assign the policy
Now that I’ve written my policy in my favorite editor, I’m ready to create the definition. Similar to Azure RBAC roles (role based access control) Azure Policies uses the concept of a definition and an assignment. Definitions can be created via the Portal, CLI, PowerShell, ARM templates, or ARM REST API like any Azure resource. For the purposes of simplicity, I’ll be creating the definition and assigning the policy through the Azure Portal.
Before I create the definition I need to plan where I create the definitions. Definitions can be created at the Management Group or Subscription scope. Creating a definition at the Management Group allows the policy to be assigned to the Management Group, child subscriptions of the Management Group, child resource groups of the subscriptions, or child resources of a resource group. I recommend creating the definitions as part of an initiative and create the definitions for the initiatives at a Management Group scope to minimize operational overhead. For this demonstration I’ll be creating a standalone definition at the subscription scope.
To do this you’ll want to search for Azure Policy in the Azure Portal. Once you’re in the Azure Policy blade you’ll want to Definitions under the Authoring section seen below. Click on the + Policy definition link.
On the next screen you’ll define some metadata about the policy and provide the policy logic. Make sure to use a useful display name and description when you’re doing this in a real environment.
Next we need to add our policy logic which you can copy and paste from your editor. The engine will validate that the policy is properly formatted. Once complete click the Save button as seen below.
Now that you’ve created the policy definition you can create the assignment. I’m not a fan of recreating documentation, so use the steps at this link to create the assignment.
Step 5 – Test the policy and review the results
Policy evaluations are performed under the following circumstances. When I first started using policy a year ago, triggering a manual evaluation required an API call. Thankfully Microsoft incorporated this functionality into CLI and PowerShell. In CLI you can use the command below to trigger a manual policy evaluation.
az policy state trigger-scan
Take note that after adding a new policy assignment, I’ve found that running a scan immediately after adding the assignment results in the new assignment not being evaluated. Since I’m impatient I didn’t test how long I needed to wait and instead ran another scan after the first one. Each scan can take up 10 minutes or longer depending on the number of resources being evaluated so prepare for some pain when doing your testing.
Once evaluation is complete I can go to the Azure Policy blade and the Overview section to see a summary of compliance of the resources.
Clicking into the policy shows me the details of the compliant and non-compliant resources. As expected the VMs, VMSS, and VMSS instances not enabled for encryption at host are reporting as non-compliant. The ones set for host-based encryption report to as compliant.
That’s all there is to it!
So to summarize the tips from the post:
Before you create a new policy check to see if it already exists to save yourself some time. You can check the built-in policies in the Portal, the Microsoft samples, or the community repositories. Alternatively, take a crack it yourself then check your logic versus the logic someone else came up with to practice your policy skills.
Always review the resource properties of a resource instance with that you would want to report on and one you wouldn’t want to report on. This will help make sure you make good use of equals vs exists.
Validate that the properties you want to check are exposed to the Azure Policy engine. To do this check the aliases using CLI, PowerShell, or the Visual Studio add-in.
Create your policy definitions at the management group scope to minimize operational overhead. Beware of tenant and subscription limits for Azure Policy.
Perform significant testing on your policy to validate the policy is reporting on the resources you expect it to.
When you’re new to Azure it’s fine to use the built-in policies directly. However, as your organization matures move away from using them directly and instead create a custom policy with the built-in policy logic. This will help to ensure the logic of your policies only change when you change it.
I hope this post helps you in your Azure Policy journey. The policy in this post and some others I’ve written in the past can be found in this repo. Don’t get discouraged if you struggle at first. It takes time and practice to whip up policies and even then some are a real challenge.