Experimenting with Azure Arc

Experimenting with Azure Arc

Hello fellow geeks! Life got busy this year and I haven’t had much time to post. I’ve been been working on some fun side projects including an enterprise-like environment for Azure, some security control mappings, and some reusable artifacts to simplify customer journeys into Azure. Outside of those efforts, I’ve spent some time addressing technical skill gaps, one of which was Azure Arc which is the topic of this post today.

Azure Arc is Microsoft’s attempt to extend the Azure management plane and Azure capabilities to resources running on-premises and other clouds. As the date of this blog these resources include Windows and Linux machines and Kubernetes clusters running on-premises or in another cloud like AWS (Amazon Web Services). Integrating these resources with Azure Arc projects them into the Azure management plane making them resources of Azure.

Azure resources

Once the resources are projected into Azure many capabilities of the platform can be used to assist with managing the resources. Examples include installing VM (virtual machine) extensions Microsoft Monitoring Agent extension to deliver logs and metrics to Azure Monitor, tracking changes and the software inventory of a machine using Azure Automation, or even auditing a machine’s compliance to a specific set of controls using Azure Policy. On the Kubernetes side you can monitor your on-premises clusters using Azure Container Insights or even deploy an App Service on your on-premises cluster. These capabilities continue to grow so check the official documentation for up to date information.

One of the interesting capabilities of Azure Arc that captured my eye was the ability to use a system-assigned management identity. I’ve written in the past about managed identities so I’ll stick to covering the very basics of them today. A managed identity is simply some Microsoft-managed automation on top of an Azure AD service principal, which is best explained as a security principal used for non-humans. One of the primary benefits managed identities provider over traditional service principals is automatic credential rotation. If you come from the AWS world, managed identities are similar to AWS IAM roles.

Those of you that manage service principals of scale are surely familiar with the pain of having to work with application owners to rotate the credentials of service principals in use within corporate applications. Beyond the operational overhead, the security risk exists for an attacker to obtain the service principal credentials and use them outside of Azure. Since service principals are not yet subject to Azure AD Conditional access, rotation of credentials becomes a critical control to have in place. Both this operational overhead and security risk make the automated rotation of credentials capability of managed identities a must have.

Managed identities are normally only available for use by resources running in Azure like Azure VMs, App Service instances, and the like. In the past, if you were coming from outside of Azure such as on-premises or another cloud like AWS, you’d be forced to use a service principal and struggle with the challenges I’ve outlined above. Azure Arc introduces the ability to leverage managed identities from outside of Azure.

This made me very curious as to how this was being accomplished for a machine running outside of Azure. The official documentation goes into some detail. The documentation explains a few key items:

  • A system-assigned managed identity is provisioned for the Arc-enabled server when the server is onboarded to Arc
  • The Azure Instance Metadata Service (IDMS) is configured with the managed identity’s service principal client id and certificate
  • Code running on the machine can request an access token from the IDMS
  • The IDMS challenges the application code to prove it is privileged enough to obtain the access token by requiring it provide a secret to attest that it is highly privileged

This information left me with a few questions:

  1. What exactly is the challenge it’s hitting the application with?
  2. Since the service principal is using a certificate-based credential, the private key has to be stored on the machine. If so where and how easy would it be for an attacker to steal it?
  3. How easy would it be to steal the identity of the Azure Arc machine in order to impersonate it on another machine?

To answer these questions I decided to deploy Azure Arc to a number of Windows and Linux machines I have running on my home instance of Hyper-V. That would give me god access to each machine and the ability to deploy whatever toolset I wanted to take a peek behind the curtains. Using the service-principal onboarding process I deployed Azure Arc to a number of Windows and Linux machines running in my home lab. Since I’m stronger in Windows, I decided I’d try to extract the service principal credentials from a Windows machine and see if I could use them on Azure Arc-enabled Linux machine.

The technical overview documentation covers how Windows and Linux machines connect into Azure Arc and communicate. There are three processes that run to support Arc connectivity which include the Azure Hybrid Instance Metadata Service (HIMDS), Guest Configuration Arc Service, and Guest Configuration Extension Service. The service I was most interested in was the HIDMS service which is in charge of authenticating and obtaining access tokens from Azure.

To better understand the service, I first referenced the service in the Services MMC (Microsoft Management Console). This gave me the path to the executable. Poking around the source directory showed some dsc logs and the location of the processes involved with Azure Arc but nothing too interesting.

HIMDS Service

My next stop was the configuration information. As documented in the technical overview documentation, the configuration data is stored in a json file in the %ProgramData%\AzureConnectedMachineAgent\Config directory. The key pieces of information we can extract from this file are the tenant id, subscription id, certificate thumbprint the service principal is using and most importantly the client id. That’s a start!

{"subscriptionId":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX","resourceGroup":"rg-onpremlab","resourceName":"SERVERWSQL01","tenantId":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX","location":"eastus2","vmId":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX","vmUuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXX","certificateThumbprint":"be90bae2484ff6d495XXXXXXXXXXXXX","clientId":"XXXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX","cloud":"AzureCloud","privateLinkScope":"","namespace":"","correlationId":"XXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXX"}

I then wondered what else might be stored through this directory hierarchy. Popping up a directory, I came saw the Certs subdirectory. Could it really be this easy? Would this directory contain the private-key certificate? Navigating into the directory does indeed show a certificate file. However, attempting to open the certificate up using the Crypto Shell Extensions resulted in an error stating it’s not a valid certificate. No big deal, maybe the extension is wrong right? I then took the file and ran it through a few Python scripts I have to enumerate certificate data, but unfortunately no certificate format I tried worked. After a quick discussion with my good friend Armen Kaleshian, we both came to the conclusion that the certificate is more than likely wrapped with some type of symmetric encryption.

Service Principal certificate

At this point I decided to take a step back and observe the HIMDS process to ensure this file is being used by the process, and if so, whether I could figure where the key that was used to wrap the certificate was stored. For that I used Process Monitor (the old tools are still the best!). After restarting the HIMDS service and sifting through the capture information, I found the mentions of the process reading the certificate file after reading from the agent config. In the past I’ve seen applications store the symmetric key used to wrap information like this in the registry, but I could not find any obvious calls to the registry indicating this. After a further conversation with Armen, we theorized the symmetric key might be stored in the himds.exe executable itself which would mean every Azure Arc instance would be using the same key to wrap the certificate. This would mean all one would have to do is move the certificate and configuration file to a new machine and one would be able to use the credentials to impersonate the machine’s managed identity (more on that later).

Procmon capture of himds.exe

At this point I was fairly certain I had answered question number 2, but I wanted to explore question number 3 a bit more. For this I leveraged the PowerShell code sample provided in the documentation. Breaking down the code, I observed a web request is made to the IDMS endpoint. From the response from the IDMS endpoint, a returned header is extracted which contains the directory name a file is stored in. The contents of this file are extracted and then included as an authorization header using basic authentication. The resulting response from the IDMS contains the access token used to communicate with the relevant Microsoft cloud API.

$apiVersion = "2020-06-01"
$resource = "https://management.azure.com/"
$endpoint = "{0}?resource={1}&api-version={2}" -f $env:IDENTITY_ENDPOINT,$resource,$apiVersion
$secretFile = ""
try
{
    Invoke-WebRequest -Method GET -Uri $endpoint -Headers @{Metadata='True'} -UseBasicParsing
}
catch
{
    $wwwAuthHeader = $_.Exception.Response.Headers["WWW-Authenticate"]
    if ($wwwAuthHeader -match "Basic realm=.+")
    {
        $secretFile = ($wwwAuthHeader -split "Basic realm=")[1]
    }
}
Write-Host "Secret file path: " $secretFile`n
$secret = cat -Raw $secretFile
$response = Invoke-WebRequest -Method GET -Uri $endpoint -Headers @{Metadata='True'; Authorization="Basic $secret"} -UseBasicParsing
if ($response)
{
    $token = (ConvertFrom-Json -InputObject $response.Content).access_token
    Write-Host "Access token: " $token
}

By default the security principal doesn’t have any permissions in the ARM (Azure Resource Manager) API, so I granted it reader rights on the resource group and wrote some simple Python code to query for a list of resources in the in a resource group. I was able to successfully obtain the access token and got back a list of resources.

# Import standard libraries
import os
import sys
import logging
import requests
import json

# Create a logging mechanism
def enable_logging():
    stdout_handler = logging.StreamHandler(sys.stdout)
    handlers = [stdout_handler]
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
        handlers = handlers
    )

def get_access_token(resource, api_version="2020-06-01"):
    logging.debug('Attempting to obtain access token...')
    try:
        response = requests.get(
            url = os.getenv('IDENTITY_ENDPOINT'),
            params = {
                'api-version': api_version,
                'resource': resource
            },
            headers = {
                'Metadata': 'True'
            }
        )
        
        secret_file = response.headers['Www-Authenticate'].split('Basic realm=')[1]
        with open(secret_file, 'r') as file:
            key = file.read()
        
        response = requests.get(
            url = os.getenv('IDENTITY_ENDPOINT'),
            params = {
                'api-version': api_version,
                'resource': resource
            },
            headers = {
                'Metadata': 'True',
                'Authorization': f"Basic {key}"
            }
        )
        logging.debug("Access token obtain successfully")
        return json.loads(response.text)['access_token']

    except Exception:
        logging.error('Unable to obtain access token. Error was: ', exc_info=True)

def main():
    try:

        # Setup variables
        API_VERSION = '2021-04-01'

        # Enable logging
        enable_logging()

        # Obtain a credential from the system-assigned managed identity
        token = get_access_token(resource='https://management.azure.com/')

        # Setup query parameters
        params = {
            'api-version': API_VERSION
        }

        # Setup header
        header = {
            'Content-Type': 'application/json',
            'Authorization': f"Bearer {token}"
        }

        # Make call to ARM API
        response = requests.get(url='https://management.azure.com/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/rg-onpremlab/resources',headers=header, params=params)
        print(response.text)

    except Exception:
        logging.error('Execution error: ', exc_info=True)

if __name__ == "__main__":
    main()

This indicated to me that the additional challenge introduced with the IDMS running on Arc requires the process to obtain a secret generated by the IDMS that is placed in the %PROGRAMDATA\AzureConnectedMachineAgent\Tokens directory. This directory is locked down for access to the security principal the himds service runs as, SYSTEM, the Administrators group, and most importantly the Hybrid agent extension applications group. After the secret is obtain from the file in this directory, it is used as a secret to perform basic authentication with the IDMS service to obtain the access token. As the official documentation states, this is the group you would need to add the security principal running any processes you wanted to use the local IDMS. Question 1 had now been answered.

Tokens directory permissions

That left me with question 3. Based on what I learned so far, an attacker who compromised the machine or a security principal with sufficient permissions on the machine to access the %PROGRAMDATA%\AzureConnectedMachineAgent\ subdirectories could obtain the configuration of the agent and the encrypted certificate. Compromise could also come in the form of compromise of a process that was running as a security principal which was a member of the Hybrid agent extension applications group. Even with the data collected from the agentconfig.json file and the certificate file, the attacker would still be short of the symmetric key used to unwrap the certificate to make it available for use.

Here I took a gamble and tested the theory Armen and I had that the symmetric key was stored in the himds executable. I copied the agentconfig.json file and certificate file and move them to another machine. To eliminate the possibility of the configuration and credentials file being specific to the Windows operating system, I spun up an Ubuntu VM in my Hyper-V cluster. I then installed and configured the agent with a completely separate Azure AD tenant.

Once that was complete, I moved the agentconfig.json and myCert file from the existing Windows machine to the new Ubuntu VM and replaced the existing files. Re-running the same Python script (substituting out the variables for the actual URIs listed in the technical overview) I received back the same listing of resources from the resource group. This proved that if you’re able to get access to both the agentconfig.json and myCert files on an Azure Arc-enabled machine, you can move them to any other machine (regardless of operating system) and re-use them to impersonate source machine and exercise any permissions its been granted access.

Let’s sum up the findings:

  1. The agentconfig.json file contains sensitive information including the tenant id, subscription id, and client id of the service principal associated with the Azure Arc-enabled machine.
  2. The credential for the service principal is stored as a certificate file in the myCert file in the %PROGRAMDATA%\AzureConnectedMachineAgent\certs directory. This file is probably wrapped with some type of symmetric encryption.
  3. The challenge the documentation speaks to involves the creation of a password by the HIMDS process which is then stored in the %PROGRAMDATA%\AzureConnectedMachineAgent\tokens directory. This directory is locked down to privileged users. The password contained in the file created by the HIMDS process is then passed back in a request to the IDMS service running on the machine using the basic HTTP authentication where it is validated and an access token is returned. This is used a method to prove the process requesting the token is privileged.
  4. The agentconfig.json and myCert file can be moved from one Arc-enabled machine to another to be used to impersonate the source machine. This provided further evidence that the symmetric key used to encrypt the certificate file (myCert) is hardcoded into the executable.

It should come as no surprise to you follow old folks that the resulting findings stress the importance traditional security controls. A few recommendations I’d make are as follows:

  • Least Privilege
    • If you’re going to use the managed identity available to an Arc-enabled server for applications running on that machine, ensure you’re granting that managed identity only the permissions it requires and nothing beyond it.
    • Limit the human and non-human actors who have administrator privileges on Azure Arc-enabled machines.
    • Tightly control which security principals are members of the Hybrid agent extension applications group.
  • Logging and monitor
    • Log managed identity sign-ins and monitor for suspicious activity. Until Conditional Access is extended to service principals, it can’t be leveraged to further restrict where these security principals can be used from.
    • Log and monitor the security events on the Azure Arc-enabled servers
  • Patching and Updating
  • Identity and Access Management
    • Create a process to review the use case and security risks of applications that want to leverage this functionality.
    • Perform access reviews to validate any additional permissions granted to these managed identities are still warranted, and if not, remove them.

Nothing in the above is new or fancy but all things you should be doing today. The biggest take away you need here is by extending the Azure management plane you get great new features but you also introduce a potential new risk where a compromised Azure Arc-enabled machine outside of Azure could be used to impact the security of your Azure implementation. If you’re using service principals today (or hell, if you’re simply having users access Azure from their desktops or laptops) you already have this risk, but it’s always worthwhile understanding the different vectors available to attackers.

At the end of the day I love this feature. It is leagues better than how most organizations are handling service principal usage outside of Azure today where the credentials are often hardcoded into code into code, dumped into a file in an unsecured directory, or never rotated for years. There are tradeoffs, but as I said earlier, none of the mitigations are things you shouldn’t already be doing today. Outside of this capability, there are a lot of benefits to Azure Arc and I’ll be very interested to see how Microsoft grows this offering and extends it beyond.

Thanks!

Writing a Custom Azure Policy

Writing a Custom Azure Policy

Hi folks! It’s been a while since my last post. This was due to spending a significant amount of cycles building out a new deployable lab environment for Azure that is a simplified version of the Enterprise-Scale Landing Zone. You can check it out the fruits of that labor here.

Recently I had two customers adopt the encryption at host feature of Azure VMs (virtual machines). The quick and dirty on this feature is it exists to mitigate the threat of an attacker accessing the data on the VHDs (virtual hard disks) belonging to your VM in the event that an attacker has compromised a physical host in an Azure datacenter which is running your VM. The VHDs backing your VM are temporarily cached on a physical host when the VM is running. Encryption at host encrypts the cached VHDs for the time they exist cached on that physical host. For more detail check out my peer and friend Sebastian Hooker’s blog post on the topic.

Both customers asked me if there was a simple and easy way to report on which VMs were enabled the feature and which were not. This sounded like the perfect use case for Azure Policy, so down that road I went and hence this blog post. I dove neck deep into Azure Policy about a year and a half ago when it was a newer addition to Azure and had written up a tips and tricks post. I was curious as to how relevant the content was given the state of the service today. Upon review of the content, I found many of the tips still relevant even though the technical means to use them had changed (for the better). Instead of walking through each tip in a generic way, I thought it would be more valuable to walk through the process I took to develop an Azure Policy for this use case.

Step 1 – Understand the properties of the Azure Resource

Before I could begin writing the policy I needed to understand what the data structure returned by ARM (Azure Resource Manager) for a VM enabled with encryption at host looks like. This meant I needed to create a VM and enable it for encryption at host using the process documented here. Once VM was created and configured, I then broke out Azure CLI and requested the properties of the virtual machine using the command below.

az vm show --name vmworkload1 --resource-group RG-WORKLOAD-36FCD6EC

This gave me back the properties of the VM which I then quickly searched through to find the block below indicating encryption at host was enabled for the VM.

VM properties with encryption at host enabled

When crafting an Azure Policy, always review the resource properties when the feature or property you want to check is both enabled and disabled.

Running the same command as above for a VM without encryption at host enabled yielded the following result.

VM properties with encryption at host not enabled

Now if I had assumed the encryptionAtHost property had been set to false for VMs without the feature enabled and had written a searching for VMs with the encryptionAtHost property equal to False, my policy would not have yielded the correct result. I’d want to use conditional exists instead of equals. This is why it’s critical to always review a sample resource both with and without the feature enabled.

Step 2 – Determine if Azure Policy can evaluate this property

Wonderful, I know my property, I understand the structure, and I know what it looks like when the feature is enabled or disabled. Now I need to determine if Azure Policy can evaluate it. Azure Policy accesses the properties of resources using something something called an alias. A given alias maps back to the path of a specific property of a resource for a given API version. More than likely there is some type of logic behind an alias which makes an API call to get the value of a specific property for a given resource.

Now for the bad news, there isn’t always an alias for the property you want to evaluate. The good news is the number of aliases has dramatically increased since I last looked at them over a year ago. To validate whether the property you wish to validate has an alias you can use the methods outlined in this link. For my purposes I used the AZ CLI command and grepped for encryptionAtHost.

az provider show --namespace Microsoft.Compute --expand "resourceTypes/aliases" --query "resourceTypes[].aliases[].name" | grep encryptionAtHost

The return value displayed three separate aliases. One alias for the VM resource type and two aliases for the VMSS (virtual machine scale set) type.

Aliases for encryption at host feature

Seeing these aliases validated that Azure Policy has the ability to query the property for both VM and VMSS resource types.

Always validate the property you want Azure Policy to validate has an alias present before spending cycles writing the policy.

Next up I needed to write the policy.

Step 3 – Writing the Azure Policy

The Azure Policy language isn’t the most initiative unfortunately. Writing complex Azure Policy is very similar to writing good AWS IAM policies in that it takes lots of practice and testing. I won’t spend cycles describing the Azure Policy structure as it’s very well documented here. However, I do have one link I use as a refresher whenever I return to writing policies. It does a great job describing the differences between the conditions.

You can write your policy using any editor that you wish, but Visual Studio Code has a nice add-in for Azure Policy. The add in can be used to examine the Azure Policies associated with a given Azure Subscription or optionally to view the aliases available for a given property (I didn’t find this to be as reliable as using querying via CLI).

Now that I understood the data structure of the properties and validated that the property had an alias, writing the policy wasn’t a big challenge.

{
    "mode": "All",
    "parameters": {},
    "displayName": "Audit - Azure Virtual Machines should have encryption at host enabled",
    "description": "Ensure end-to-end encryption of a Virtual Machines Managed Disks with encryption at host: https://docs.microsoft.com/en-us/azure/virtual-machines/disk-encryption#encryption-at-host---end-to-end-encryption-for-your-vm-data",
    "policyRule": {
        "if": {
            "anyOf": [
                {
                    "allOf": [
                        {
                            "field": "type",
                            "equals": "Microsoft.Compute/virtualMachines"
                        },
                        {
                            "field": "Microsoft.Compute/virtualMachines/securityProfile.encryptionAtHost",
                            "exists": false
                        }
                    ]
                },
                {
                    "allOf": [
                        {
                            "field": "type",
                            "equals": "Microsoft.Compute/virtualMachineScaleSets"
                        },
                        {
                            "field": "Microsoft.Compute/virtualMachineScaleSets/virtualMachineProfile.securityProfile.encryptionAtHost",
                            "exists": false
                        }
                    ]
                },
                {
                    "allOf": [
                        {
                            "field": "type",
                            "equals": "Microsoft.Compute/virtualMachineScaleSets/virtualMachines"
                        },
                        {
                            "field": "Microsoft.Compute/virtualMachineScaleSets/virtualmachines/securityProfile.encryptionAtHost",
                            "exists": false
                        }
                    ]
                }
            ]
        },
        "then": {
            "effect": "audit"
        }
    }
}

Let me walk through the relevant sections of the policy where I set the value to something for a reason. For the mode section I chose to use the “Full” mode since Microsoft recommends setting this to full unless you’re evaluating tags or locations. In the parameters section I left it blank since there was no relevant parameters that I wanted to add to the policy. You’ll often use parameters when you want to give the person assigning the policy the ability to determine the effect. Other use cases could be allowing the user to specify a list of tags you want applied or a list of allowed VM SKUs.

Finally we have the policyRule section which is the guts of the policy. In the policy I’ve written I have three rules which each include two conditions. The policy will fire the effect of Audit when any of the three rules evaluates as true. For a rule to evaluate as true both conditions must be true. For example, in the rule below I’m checking the resource type to validate it’s a virtual machine and then I’m using the encryption at host alias to check whether or not the property exists on the resource. If the resource is a virtual machine and the encryptionAtHost property doesn’t exist, the rule evaluates as true and the Azure Policy engine marks the resource as non-compliant.

Step 4 – Create the policy definition and assign the policy

Now that I’ve written my policy in my favorite editor, I’m ready to create the definition. Similar to Azure RBAC roles (role based access control) Azure Policies uses the concept of a definition and an assignment. Definitions can be created via the Portal, CLI, PowerShell, ARM templates, or ARM REST API like any Azure resource. For the purposes of simplicity, I’ll be creating the definition and assigning the policy through the Azure Portal.

Before I create the definition I need to plan where I create the definitions. Definitions can be created at the Management Group or Subscription scope. Creating a definition at the Management Group allows the policy to be assigned to the Management Group, child subscriptions of the Management Group, child resource groups of the subscriptions, or child resources of a resource group. I recommend creating the definitions as part of an initiative and create the definitions for the initiatives at a Management Group scope to minimize operational overhead. For this demonstration I’ll be creating a standalone definition at the subscription scope.

To do this you’ll want to search for Azure Policy in the Azure Portal. Once you’re in the Azure Policy blade you’ll want to Definitions under the Authoring section seen below. Click on the + Policy definition link.

On the next screen you’ll define some metadata about the policy and provide the policy logic. Make sure to use a useful display name and description when you’re doing this in a real environment.

Next we need to add our policy logic which you can copy and paste from your editor. The engine will validate that the policy is properly formatted. Once complete click the Save button as seen below.

Now that you’ve created the policy definition you can create the assignment. I’m not a fan of recreating documentation, so use the steps at this link to create the assignment.

Step 5 – Test the policy and review the results

Policy evaluations are performed under the following circumstances. When I first started using policy a year ago, triggering a manual evaluation required an API call. Thankfully Microsoft incorporated this functionality into CLI and PowerShell. In CLI you can use the command below to trigger a manual policy evaluation.

az policy state trigger-scan

Take note that after adding a new policy assignment, I’ve found that running a scan immediately after adding the assignment results in the new assignment not being evaluated. Since I’m impatient I didn’t test how long I needed to wait and instead ran another scan after the first one. Each scan can take up 10 minutes or longer depending on the number of resources being evaluated so prepare for some pain when doing your testing.

Once evaluation is complete I can go to the Azure Policy blade and the Overview section to see a summary of compliance of the resources.

Clicking into the policy shows me the details of the compliant and non-compliant resources. As expected the VMs, VMSS, and VMSS instances not enabled for encryption at host are reporting as non-compliant. The ones set for host-based encryption report to as compliant.

That’s all there is to it!

So to summarize the tips from the post:

  1. Before you create a new policy check to see if it already exists to save yourself some time. You can check the built-in policies in the Portal, the Microsoft samples, or the community repositories. Alternatively, take a crack it yourself then check your logic versus the logic someone else came up with to practice your policy skills.
  2. Always review the resource properties of a resource instance with that you would want to report on and one you wouldn’t want to report on. This will help make sure you make good use of equals vs exists.
  3. Validate that the properties you want to check are exposed to the Azure Policy engine. To do this check the aliases using CLI, PowerShell, or the Visual Studio add-in.
  4. Create your policy definitions at the management group scope to minimize operational overhead. Beware of tenant and subscription limits for Azure Policy.
  5. Perform significant testing on your policy to validate the policy is reporting on the resources you expect it to.
  6. When you’re new to Azure it’s fine to use the built-in policies directly. However, as your organization matures move away from using them directly and instead create a custom policy with the built-in policy logic. This will help to ensure the logic of your policies only change when you change it.

I hope this post helps you in your Azure Policy journey. The policy in this post and some others I’ve written in the past can be found in this repo. Don’t get discouraged if you struggle at first. It takes time and practice to whip up policies and even then some are a real challenge.

See you next post!

What If… Volume 2

Hi there folks!

I’ve been busy lately buried in learning and practicing Kubernetes in preparation for the Certified Kubernetes Administrator exam. Tonight I’m taking a break to bring you another entry into the “What If” series I started a few months back.

Let’s get right to it.

What if I need to access a Private Endpoint in a subscription associated with a different Azure AD tenant and I have an existing Azure Private DNS Zone already?

I’ve been helping a good friend who recently joined Microsoft to support his customer as he gets up to speed on the Azure platform. This customer consists of two very large organizations which have a high degree of independence. Each of these organizations have their own Azure AD tenant and their own Azure footprint. One organization is further along in their cloud journey than the other.

Organization A (new to Azure) needed to consume some data that existed in an Azure SQL database in an Azure subscription associated with Organization B’s tenant. Both organizations have strict security and compliance requirements so they are heavy users of Azure PrivateLink Endpoints. A site-to-site VPN (virtual private network) connection was established between the two organizations to facilitate network communication between the Azure environments.

Customer Environment

The customer environment looked similar to the above where a machine on-premises in Organization A needed to access the Azure SQL database in Organization B. If you look closely, you probably see the problem already. From a DNS perspective, we have two Azure Private DNS Zones for privatelink.database.windows.net. This means we have two authorities for the same zone.

My peer and I went back on forth with a few different solutions. One solution seemed obvious in that organization A would manually create an A record in their Azure Private DNS zone pointing to IP of the PrivateLink Endpoint in Organization B. Since the organizations had connectivity between the two environments, this would technically work. The challenge with this pattern is it would introduce a potential bottleneck depending on the size of the VPN pipe. It could also lead to egress costs for Organization A depending on how the VPN connection was implemented.

The other option we came up with was to create a Private Endpoint in Organization A’s Azure subscription which would be associated with the Azure SQL instance running in Organization B’s Azure subscription. This would avoid any egress costs, we wouldn’t be introducing a potential bottleneck, and we’d avoid the additional operational head of having to manually manage the A record in Organization A’s Azure Private DNS Zone. Neither of us had done this before and while it seemed to be possible based on Microsoft’s documentation, the how was a bit lacking when talking PaaS services.

To test this I used two separate personal tenants I keep to test scenarios that aren’t feasible to test with internal resources. My goal was to build an architecture like the below.

Target architecture

So was it possible? Why yes it was, and an added bonus I’m going to tell you how to do it.

When you create a Private Endpoint through the Azure Portal, there is a Connection Method radio button seen below. If you’re creating the Private Endpoint for a resource within the existing tenant you can choose the Connect to an Azure resource in my directory option and you get a handy guided selection tool. If you want to connect to a resource outside your tenant, you instead have to select the Connect to an Azure resource by resource ID or alias. In this field you would end the full resource ID of the resource you’re creating the Private Endpoint for, which in this case is the Azure SQL server resource id. You’ll be prompted to enter the sub-resource which for Azure SQL is SqlServer. Proceed to create the Private Endpoint.

Private Endpoint Creation

After the Private Endpoint has been created you’ll observe it has a Connection status of Pending. This is part of the approval workflow where someone with control over the resource in the destination tenant needs to approve of the connection to the Azure SQL server.

Private Endpoint in pending status

If you jump over to the other resource in the target tenant and select the Private endpoint connections menu option you’ll see there is a pending connection that needs approval along with a message from the requestor.

Private Endpoint request

Select the endpoint to approve and click the approve button. At that point the Private Endpoint in the requestor tenant and you’ll see it has been approved and is ready for use.

This was a fun little problem to work through. I was always under the assumption this would work, the documentation said it would work, but I’m a trust but verify type of person so I wanted to see and experience it for myself.

I hope you enjoyed the post and learned something new. Now back to practicing Kubernetes labs!

Interesting behaviors with Private Endpoints

Interesting behaviors with Private Endpoints

Update September 2022 – The route summarization feature officially went generally available! This feature allows you to summarize a single address block and route it to your NVA for inspection instead of having to do /32s for each private endpoint. Note that SNAT is still a requirement to ensure symmetric traffic flow.

Hi folks!

Working for and with organizations in highly regulated industries like federal and state governments and commercial banks often necessitates diving REALLY deep into products and technologies. This means peeling back the layers of the onion most people do not. The reason this pops up is because these organizations tend to have extremely complex environments due the length of time the organization has existed and the strict laws and regulations they must abide by. This is probably the reason why I’ve always gravitated towards these industries.

I recently ran into an interesting use case where that willingness to dive deep was needed.

A customer I was working with was wrapping up its Azure landing zone deployment and was beginning to deploy its initial workloads. A number of these workloads used Microsoft Azure PaaS (platform-as-a-service) services such as Azure Storage and Azure Key Vault. The customer had made the wise choice to consume the services through Azure Private Endpoints. I’m not going to go into detail on the basics of Azure Private Endpoints. There is plenty of official Microsoft documentation that can cover the basics and give you the marketing pitch. You can check out my pasts posts on the topic such as my series on Azure Private DNS and Azure Private Endpoints.

This particular customer chose to use them to consume the services over a private connection from both within Azure and on-premises as well as to mitigate the risk of data exfiltration that exists when egressing the traffic to Internet public endpoints or using Azure Service Endpoints. One of the additional requirements the customer had as to mediate the traffic to Azure Private Endpoints using a security appliance. The security appliance was acting as a firewall to control traffic to the Private Endpoints as well to perform deep packet inspection sometime in the future. This is the requirement that drove me down into the weeds of Private Endpoints and lead to a lot of interesting observations about the behaviors of network traffic flowing to and back from Private Endpoints. Those are the observations I’ll be sharing today.

For this lab, I’ll be using a slightly modified version of my simple hub and spoke lab. I’ve modified and added the following items:

  • Virtual machine in hub runs Microsoft Windows DNS and is configured to forward all DNS traffic to Azure DNS (168.63.129.16)
  • Virtual machine in spoke is configured to use virtual machine in hub as a DNS server
  • Removed the route table from the spoke data subnet
  • Azure Private DNS Zone hosting the privatelink.blob.core.windows.net namespace
  • Azure Storage Account named mftesting hosting some sample objects in blob storage
  • Private Endpoint for the mftesting storage account blob storage placed in the spoke data subnet
Lab environment

The first interesting observation I made was that there was a /32 route for the Private Endpoint. While this is documented, I had never noticed it. In fact most of my peers I ran this by were never aware of it either, largely because the only way you would see it is if you enumerated effective routes for a VM and looked closely for it. Below I’ve included a screenshot of the effective routes on the VM in the spoke Virtual Network where the Private Endpoint was provisioned.

Effective routes on spoke VM

Notice the next hop type of InterfaceEndpoint. I was unable to find the next hop type of InterfaceEndpoint documented in public documentation, but it is indeed related to Private Endpoints. The magic behind that next hop type isn’t something that Microsoft documents publicly.

Now this route is interesting for a few reasons. It doesn’t just propagate to all of the route tables of subnets within the Virtual Network, it also propagates to all of the route tables in directly peered Virtual Networks. In the hub and spoke architecture that is recommended for Microsoft Azure, this means that every Private Endpoint you create in a spoke Virtual Network is propagated to as a system route to route tables of each subnet in the hub Virtual Network. Below you can see a screen of the VM running in the hub Virtual Network.

Effective routes on hub VM

This can make things complicated if you have a requirement such as the customer I was working with where the customer wants to control network traffic to the Private Endpoint. The only way to do that completely is to create a /32 UDRs (user defined routes) in every route table in both the hub and spoke. With a limit of 400 UDRs per route table, you can quickly see how this may break down at scale.

There is another interesting thing about this route. Recall from effective routes for the spoke VM, that there is a /32 system route for the Private Endpoint. Since this is the most specific route, all traffic should be routed directly to the Private Endpoint right? Let’s check that out. Here I ran a port scan against the Private Endpoint using nmap using the ICMP, UDP, and TCP protocols. I then opened the Log Analytics Workspace and ran a query across the Azure Firewall logs for any traffic to the Private Endpoint from the VM and lo and behold, there is the ICMP and UDP traffic nmap generated.

Captured UDP and ICMP traffic

Yes folks that /32 route is protocol aware and will only apply to TCP traffic. UDP and ICMP traffic will not be affected. Software defined networking is grand isn’t it? 🙂

You may be asking why the hell I decided to test this particular piece. The reason I followed this breadcrumb was my customer had setup a UDR to route traffic from the VM to an NVA in the hub and attempted to send an ICMP Ping to the Private Endpoint. In reviewing their firewall logs they saw only the ICMP traffic. This finding was what drove me to test all three protocols and make the observation that the route only affects TCP traffic.

Microsoft’s public documentation mentions that Private Endpoints only support TCP at this time, but the documentation does not specify that this system route does not apply to UDP and ICMP traffic. This can result in confusion such as it did for this customer.

So how did we resolve this for my customer? Well in a very odd coincidence, a wonderful person over at Microsoft recently published some patterns on how to approach this problem. You can (and should) read the documentation for the full details, but I’ll cover some of the highlights.

There are four patterns that are offered up. Scenario 3 is not applicable for any enterprise customer given that those customers will be using a hub and spoke pattern. Scenario 1 may work but in my opinion is going to architect you into a corner over the long term so I would avoid it if it were me. That leaves us with Scenario 2 and Scenario 4.

Scenario 2 is one I want to touch on first. Now if you have a significant background in networking, this scenario will leave you scratching your head a bit.

Microsoft Documentation Scenario 2

Notice how a UDR is applied to the subnet with the VM which will route traffic to Azure Firewall however, there is no corresponding UDR applied to the Private Endpoint. Now this makes sense since the Private Endpoint would ignore the UDR anyway since they don’t support UDRs at this time. Now you old networking geeks probably see the problem here. If the packet from the VM has to travel from A (the VM) to B (stateful firewall) to C (the Private Endpoint) the stateful firewall will make a note of that connection in its cache and be expecting packets coming back from the Private Endpoint representing the return traffic. The problem here is the Private Endpoint doesn’t know that it needs to take the C (Private Endpoint) to B (stateful firewall) to A (VM) because it isn’t aware of that route and you’d have an asymmetric routing situation.

If you’re like me, you’d assume you’d need to SNAT in this scenario. Oddly enough, due the magic of software defined routing, you do not. This struck me as very odd because in scenario 3 where everything is in the same Virtual Network you do need to SNAT. I’m not sure why this is, but sometimes accepting magic is part of living in a software defined world.

Finally, we come to scenario 4. This is a common scenario for most customers because who doesn’t want to access Azure PaaS services over an ExpressRoute connection vs an Internet connection? For this scenario, you again need to SNAT. So honestly, I’d just SNAT for both scenario 2 and 4 to make maintain consistency. I have successfully tested scenario 2 with SNAT so it does indeed work as you expect it would.

Well folks I hope you found this information helpful. While much of it is mentioned in public documentation, it lacks the depth that those of us working in complex environments need and those of us who like to geek out a bit want.

See you next post!