DNS in Microsoft Azure – Part 1

DNS in Microsoft Azure – Part 1

Hi everyone,

In this series of posts I’m going to talk about a technology, that while old, still provides a critical foundational service.  Yes folks, we’re going to cover Domain Naming System (DNS).  Specifically, we’re going to look at the options for private DNS in Microsoft Azure and what the positives and negatives are of each pattern.  I’m going to go into this assuming you have a basic knowledge of DNS and understand the namespaces, various record types, forward and reverse lookup zones, recursive and iterative queries, DNS forwarding and conditional forwarding, and other core DNS concepts.  If any of those are unfamiliar to you, take some time to review the basics then come back to this post.

Before we jump into the DNS options in Azure, I first want to cover the 168.63.129.16 address.  If you’ve ever done anything even basic in Azure, you’ve probably run into this address or used it without knowing it.  This public IP address is owned by Microsoft and is presented as a virtual IP address serving as a communication channel to the host node  for a number of platform resources.  It provides functionality such as virtual machine (VM) agent communication of the VM’s ready state, health state, enables the VM to obtain an IP address via DHCP, and you guessed it, enables the VM to leverage Azure DNS services.  The address is static and is the same for any VNet you create in every Azure region.  Fun fact, some geolocation services will report this IP as being based out of Hong Kong and I’m sure you can imagine how that works when something like a WAF is in place with regional IP restrictions.  Fun times. 🙂

Traffic is routed to and from this virtual IP address through the subnet gateway.  If you run a route print on a Windows machine, you can see this route defined in the routing table of the VM.

route

Output of route print on Azure VM

The IP address is also defined in the VirtualNetwork service tag meaning the default rules within a network security group (NSG) allow this traffic to and from the VM.  Given the criticality of the functions the IP plays, Microsoft recommends you allow inbound and outbound communication with it (it’s a requirement for using any of the Azure DNS services we’ll discuss in these posts).

Now that you understand what the 168.63.129.16 virtual IP address is, let’s first cover the very basics of DNS in Azure. You can configure Azure’s DHCP service to push a custom set of DNS servers to Azure VMs or optionally leave the default which is for VMs to use Azure’s DNS services (through the 168.63.129.16 virtual IP address).  This can be configured at the VNet level and then inherited by all virtual network interfaces (VNIs) associated with the VNet, or optionally configured directly on the VNI associated with the VM.

Configure DNS on VNet

Configure DNS on VNet

This brings us to the first option for DNS resolution in Azure, Azure-provided name resolution.  Each time you spin up a virtual network Azure assigns it a unique private DNS namespace using the format <randomly generated>.internal.cloudapp.net.  This namespace is pushed to the machine via DHCP Option 15 thus each VM has an fully qualified domain name of <vm_host_name>.<randomly generated>.internal.cloudapp.net and each VM in the VNet can resolve IP addresses of one another.

Let’s look at an example with a single VNet.  I’ve created a single VNet named vnet1.  I’ve assigned the CIDR block of 10.101.0.0/16 and created a single subnet assigned the 10.101.0.0/24 block.  Two Windows Server 2016 VMs have been created named azuredns and azuredns1 with the IP addresses 10.101.0.4 and 10.101.0.5.  Azure has assigned the a namespace of r0b5mqxog0hu5nbrf150v3iuuh.bx.internal.cloudapp.net to the VNet.  Notes the DHCP Server and DNS Server settings in the ipconfig output of the azuredns vm shown below.

ipconfig

IPConfig output of Azure VM

If we ping azuredns1 from azuredns we can see the in below Wireshark capture that prior to executing the ping, azuredns performs a DNS query to the 168.63.129.16 VIP and gets back a query response with the IP address of azuredns1.

wireshark

Wireshark packet capture of DNS query

The resolution process is very simple as seen in the diagram below.

simple_reso

DNS Resolution within single VNet

Well that’s all well and good for very basic DNS resolution, but who the heck has a single VNet in anything but a test environment?  So can we expand Azure-provided DNS to multiple VNets?  The answer is yes, but it’s ugly.  Recall that each VNet has its own private DNS namespace.  The only way to resolve names contained within that namespace is for a VM in that VNet to send the query to the 168.63.129.16 address.  Yes folks, this means you would need to drop a DNS server in each VNet in order to resolve the Azure-provided DNS host names assigned to VMs within that VNet by another VMs in another VNet as illustrated in the diagram below.

multi_vnet_reso

Multiple VNet resolution

You can see as the number of VNets increases the scalability of this solution quickly breaks down.  Take note that if you wanted to resolve these host names from on-premises you could use a similar conditional forwarder pattern.

Let’s sum up the positives and negatives of Azure-provided DNS.

  • Positives
    • No need to provision your own DNS servers and worry about high availability or scalability
    • DNS service provided by Azure automatically scales
    • VMs within a VNet can resolve each other’s IP addresses out of the box
  • Negatives
    • Solution doesn’t scale with multiple VNets
    • You’re stuck with the namespace assigned to the VNet
    • WINS and NetBIOS are not supported
    • Only A records that are automatically registered by the service are supported (no manual registration of records)
    • No reverse DNS support
    • No query logging

As you can see from the above the negatives far outweigh the positives.  Personally, I see Azure-provided DNS only being useful for bare bones test environments with a single VNet.  If anyone has any other scenarios where it comes in handy, I’d love to hear them.

In my next post I’ll cover Azure’s new offering in the DNS space, Azure Private DNS Zones.  I’ll walk through how it works and how we can combine it with BYO DNS to create some pretty neat patterns.

See you then!

Capturing Azure Management Group Activity Logs Using Azure Automation – Part 2

Welcome back fellow geeks!

This post will be the second post in a series covering how to use Azure Automation to capture Azure Management Group Activity Logs.  In the first post I walked through what management groups are and the problems that they solve.  The key takeaway of that post is that management groups have their own Activity Logs and (at this time) they’re only accessible from within the Portal and over the Azure REST API.  Given that management groups are where we’re applying our Azure Policy for governance and compliance and our access controls via Azure RBAC, the Activity Logs are pretty critical.  So what is a geek to do?

In this post I’ll cover a solution I put together to solve the problem.  It uses an Azure Automation PowerShell Runbook to iterate through the management groups within an Azure Active Directory tenant, write the logs to Azure Storage, and optionally deliver the logs to Azure Monitor or Azure EventHubs.  The architecture is pictured below.

Capture.PNG

If you’re not familiar with Azure Automation it’s a service that provides a number of key capabilities within Azure such as configuration management, update management, and process automation.  If you’re coming from AWS, I’d compare it to a service somewhat similar to AWS Systems Manager.  For the purposes of this series of posts I’m going to focus on the process automation capability of the service delivered through Runbooks.  I’m not going to go too in-depth into Azure Automation, but I’ll provide a brief overview of the service features and tweaks relevant to the solution.

Runbooks are modules of code that can be strung together to perform a series of tasks such as performing maintenance on a collection of VMs.  The modules can be authored using either PowerShell or Python.  At this time only Python 2 is supported which makes me a sad panda.  Given that Python 2 enters end of life in two months, I’d recommend doing anything Python related in Azure Functions.  I could devote an entire blog post complaining about the lack of Python 3 in the year 2019, but I’ll spare you.  You’re going to want to author your Runbooks in PowerShell until/if Python 3 is supported is supported in the future.

The Azure Automation account acts as a logical container for the Runbooks created within it.  An Azure Automation Account can be provided with a RunAs account, which is simply a service principal in Azure Active Directory.   The service principal is configured with a certificate credential which is used by the Automation Account to authenticate to Azure AD and access Azure resources within the tenant.  Any Runbooks you create within the Automation account can assume the identity to execute tasks across your Azure resources.

You can automatically provision the RunAs account when the Azure Automation Account is provisioned, just be aware that the service principal will be granted the Contributor role on the Azure Subscription.  This is probably going to be way more permissions than are needed so I’d recommend removing that role assignment, creating a custom RBAC role, and assigning it at the appropriate scope.

Automation Accounts have a number of assets which are relevant for Runbooks.  These include variables, connections, credentials, and certificates.  The links I provided will give you detailed information on these assets, so I’ll summarize the relevant content to the solution.  Variables can come in a variety of types including strings and integers and can also optionally be encrypted.  For this solution I use encrypted variables to store the Event Hub connection string, Log Analytics Workspace Id, and Log Analytics Workspace Key.  Connections contain information required to connect to an external service or application.  The only connection asset used with this solution is the AzureServicePrincipal which is used by the RunAs account.  You can retrieve the  connection to get information such as the Azure AD tenant Id and application id (client id in the OAuth world).  Lastly, we have the certificate asset, which as the name describes, can be used to securely store a certificate that is used for authentication.  This solution uses the AzureRunAsCertificate certificate which contains the certificate asset used to authenticate the Automation Account RunAs account.

Each Automation Account comes with a predefined set of PowerShell modules and .NET libraries.  You can add additional modules and libraries by importing them to the Automation Account.  For this solution I added a number of .NET libraries including the ADAL and some libraries required to communicate with Event Hubs.  While PowerShell does a wonderful job of handling things at the management plane of Azure, it is severely lacking in the data plane requiring you to fall back on incorporating .NET code into your PowerShell script.

The above (including the links) should give you the bare minimum you need to understand to use this solution.  Let’s deep dive into the code.  Since this is a fairly lengthy script I’m not going to paste every line of code.  Instead I’m going to call out key sections of code that were particularly relevant or interesting to write.

The first function in the script is called Get-AdalToken and uses the .NET ADAL library to retrieve a token from Azure AD.  When I code in Python I typically use the MSAL library since I find it to be a bit more slick, but found the .NET version too cumbersome and difficult to use in in PowerShell.  If you’ve ever used .NET libraries in your PowerShell scripts, you know where I’m coming from.

The token retrieved by the function is used for calls to the Azure Management REST API.  The reason I went with ADAL vs pulling the access token from a session created using Add-AzAccount method as demonstrated here is I wanted code I could reuse for other purposes outside of the Azure REST API.

Once the token is retrieved it is stored in a variable for later use in the script.

adal

Next up we have the Get-AllManagementGroups function.  This function calls the Azure REST API to get a full listing of management groups.  Oddly enough there is an AzureRM cmdlet included in the AzureRm.Resources module that comes preinstalled with every new Automation Account.  However, even after updating the modules within the account (this link tells you how to do this and I highly recommend doing it whenever you create a new automation account) the cmdlet only ever reported back the tenant root group.  This occurred even when following the instructions to spit back all Management Groups.  I chalked it up to there being an issue with the cmdlet or user error on my part.  Either way, it was simple enough to whip up a call to the REST API.

Following the Get-AllManagementGroups function we have the Get-ManagementGroupActivityLog function.  Let me tell you folks, this one was an absolute pain to write.  According to this Azure feedback thread these logs have been accessible over the API since back in March of this year, but the REST API reference documentation doesn’t look to have been updated to reflect this.  I’m going to save you all a ton of headaches and hours of experimentation and searching the web.  When you want to get Activity Logs over the REST API you are going to use the following endpoint:


https://management.azure.com/providers/Microsoft.Management/

managementGroups/mgmtGroupId/providers/microsoft.insights/
eventtypes/management/values

The mgmtGroupId variable would be the name of your management group.  If your management group is named production then the value in that URL would be production.  Additionally, you’ll want to pass query parameters of api-version set to 2017-03-01-preview and a $filter query parameter constructed in the same way you would to query a subscription Activity Log.

activitylogquery.PNG

The SendTo-Storage function sends the Activity Log for each Management Group as a separate blob to Azure Storage.  The format of the Activity Log is raw JSON.

The SentTo-Workspace function sends the log data to Azure Monitor (really a Log Analytics Workspace) via the HTTP Data Collector API.  The product team was wonderful enough to include sample PowerShell code that made writing that function a breeze.

I did run into some weirdness with this function which was caused by the maximum size of an output stream in Runbooks which is 1MB.  When I pulled the Activity Log for 90 days, the entirety of the log was well over 1MB so it would cause the Runbook to fail three times and suspend.  Debugging this was a pain because the Runbook doesn’t report the error in an obvious way.  I got around this by collecting the log entries into a group and sending them at 200KB intervals.    Additionally, I also added some error checking and retry handling if it got throttled.

The final function is named SendTo-EventHub and delivers the logs to an Event Hub.  I couldn’t find any PowerShell cmdlets that could be used to send data to Event Hub.  This forced me to fall back to the .NET libraries.  In the end I got it working and got them streaming, but I’m sure someone more skilled in .NET than me (which isn’t difficult to be) could optimize and improve that code.

The main chunk of the solution strings everything together.  By default the solution writes the logs to Azure blob storage.  You can optionally deliver the data to Azure Monitor and Azure Event Hubs.

Well folks that brings us to the end of this post and series.  While I’m sure the product team is quickly coming out with this out of box integration, I learned a ton about Azure Automation and Runbooks working on this effort.  Runbooks are a wonderful tool if you’re a classic infrastructure / security tech new to the whole coding thing.  It’s a very simple and straightforward user experience for that audience and a good stepping stone into the coding world vs jumping directly into Azure Functions.

I’ve posted the solution up onto my Github.  For those folks without Github, I’ve put a static copy of the solution up on this website at this link.  Take it, test it, play with it, build upon it, and experiment with it.

Capturing Azure Management Group Activity Logs Using Azure Automation – Part 1

Capturing Azure Management Group Activity Logs Using Azure Automation – Part 1

Hello again fellow geeks!

Over the past few months I’ve been working with a customer who is just beginning their journey into the cloud.  We’ve had a ton of great conversations around security, governance, and operationalizing Microsoft Azure.  We recently finalized the RACI and identified the controls required by both their internal security policy and their industry compliance requirements.  With those two items complete, we put together our Azure RBAC model and narrowed down the Azure Policies we needed to put in place to satisfy our compliance controls.

After a lot of discussion about the customer’s organization, its geographical locations, business unit makeup, and how its developers and central IT operate, we came up with a subscription model.  This customer had decided on an Azure subscription model where each workload would exist in its own subscription.  Further, each workload’s production and non-production environment would be segmented in different subscriptions.  Keeping each workload in a different subscription ensures no workload will compete for resources with other workloads and hit any subscription limits.  Additionally, it allowed the customer to very easily track the costs associated with each workload.

Now why did we use separate production and non-production subscriptions for each workload?  One reason is to address the same risk as above where a non-production workload could potentially consume all resources within a subscription impacting a production workload.  The other more critical reason is it makes it easier for us to apply different governance and access controls on production workloads vs non-production workloads.  The way we do this is through the usage of Azure Management Groups.

Management Groups were introduced into general availability back in late 2018 to help address the challenges organizations were having operating subscriptions at scale.  They provided a hierarchal method to apply governance and access controls across a collection of subscriptions.  For those of you familiar with AWS, Management Groups are somewhat similar to AWS Organizations and Organizational Units.  For my fellow Windows AD peeps, you can think of Management Groups somewhat like the Active Directory container and organizational unit hierarchy in an Active Directory domain where you apply different access control entries and group policy at high levels in the OU hierarchy that is then enforced and inherited down to the children.  Management Groups work in a similar manner in that the Azure RBAC definitions and assignments and Azure Policy you assign to the parent Management Groups are inherited down into the children.

Every Azure AD tenant starts with a top-level management group called the tenant root group.  Additional management groups created within the tenant are children of the group up to a maximum of 10,000 management groups and up to six levels of depth.  Any RBAC assignment or Azure Policy assigned to the tenant root group applies to all children management group in the tenant.  It’s important to understand that Management Groups are a resource within the Azure AD tenant and not a resource of an Azure subscription.  This will matter for reasons we’ll see later.

The tenant root management group can only be administered by a Global Admin by default and even this requires a configuration change in the tenant.  The method is describe here and what it does is places the global administrator performing the action in the User Access Administrator RBAC role at the root of scope.  Once that is complete, the name of the root management group could be changed, role assignments created, or policy assigned.

Screen Shot 2019-10-17 at 9.59.59 PM

Administering Tenant Root Group

Now there is one aspect of Management Groups that is a bit funky.  If you’re very observant you probably noticed the menu option below.

Screen Shot 2019-10-17 at 9.59.59 PM.png

That’s right folks, Management Groups have their own Activity Log.  Every action you perform at the management group scope such creating an Azure RBAC role assignment or assigning or un-assigning an Azure Policy is captured in this Activity Log.  Now as of today, the only way to access these logs is viewing them through the portal or through the Azure REST API.  Unlike the Activity Logs associated with a subscription, there isn’t native integration with Event Hubs or Azure Storage.  Don’t be fooled by the Export To Event Hub link seen in the screenshot below, this will simply send you to the standard menu where you would configure subscription Activity Logs to be exported.

Screen Shot 2019-10-17 at 10.34.19 PM

Now you could log into the GUI every day and export the logs to a CSV (yes that does work with Management Groups) but that simply isn’t scalable and also prevents you from proactively monitoring the logs.  So how do we deal with this gap while the product team works on incorporating the feature?  This will be the challenge we address in this series.

Over the next few posts I’ll walk through the solution I put together using Azure Automation Runbooks to capture these Activity Logs and send them to Azure Storage for retention and an Azure Log Analytics Workspace for analysis and monitoring using Azure Monitor.

Continue the series in my second post.

Tips and Tricks for Writing Azure Policy

Tips and Tricks for Writing Azure Policy

Hello geeks!

Over the past few weeks I’ve been working with a customer who has adopted the CIS (Center for Internet Security) controls framework.  CIS publishes a set of best practices and configurations called benchmarks for commonly used systems .  As you would expect there is a set of benchmarks for Microsoft Azure.  Implementing, enforcing, and auditing for compliance with the benchmarks can be a challenge.  Thankfully, this is where Azure Policy comes to the rescue.

Azure Policy works by evaluating the properties of resources (management plane right now minus a few exceptions) created in Azure either during deployment or for resources that have already been deployed.  This means you can stop a user from deploying a non-compliant resource vs addressing it after the fact.  This feature is value added for organizations that haven’t reached that very mature level of DevOps where all infrastructure is codified and pushed through a CI/CD pipeline that performs validation tests before deployment.

Policies are created in JSON format and contain five elements.  For the purposes of this blog post, I’ll be focusing on the policy rule element.  The other elements are straightforward and described fully in the official documentation.  The policy rule contains two sub elements, a logical evaluation and effect.  The logical evaluation uses simple if-then logic.  The if block contains one or more conditions with optional logical operators.  The if block will be where you spend much of your time (and more than likely frustration).

I would liken the challenge of learning how to construct working Azure Policy to the challenge presented writing good AWS IAM Policies.  The initial learning curve is high, but once you get a hang of it, you can craft works of art.  Unfortunately, unlike AWS IAM Policy, there are some odd quirks with Azure Policy right now that are either under documented or not documented.  Additionally, given how much newer Azure Policy is, there aren’t a ton of examples to draw from online for more complicated policies.

This brings us to the purpose of this blog.  While being very very very far from an expert (more like I’m barely passable) on Azure Policy, I have learned some valuable lessons from the past few weeks that I’ve been struggling through writing custom policies.  These are the lessons I want to pass on in hopes they’ll make your journey a bit easier.

    • Just because a resource alias exists, it doesn’t mean you can use it in a policy
      When you are crafting your conditions you’ll use fields which map to properties of Azure resources and describe their state.  There are a selection of fields that are supported, but one you’ll probably use often is the property alias.   You can pull a listing of property aliases using PowerShell, CLI, or the REST API.  Be prepared to format the output because some namespaces have a ton of properties.  I threw together a Python solution to pull the namespaces into a more consumable format.If you are using an alias that is listed but your Policy fails to do what you want it to do, it could be that while the alias exists, it’s not accessible by policy during an evaluation.  If the property belongs to a namespace that contains a property that is sensitive (like a secret) it will more than likely not be accessibly by Policy and hence won’t be caught.  The general rule I follow is if the namespace’s properties aren’t accessible with the Reader Azure RBAC role, policy evaluations won’t pick them up.A good example of this is the authsettings namespace under the Microsoft.Web/sites/config.  Say for example you wanted to check to see if the Web App was using FaceBook as an identity provider, you wouldn’t be able to use policy to check whether or not facebookAppId was populated.
    • Resource Explorer, Azure ARM Template Reference, and Azure REST API Reference are your friends, use them
      When you’re putting together a new policy make sure to use Azure Resource Explorer, Azure ARM Template Reference, and Azure REST API Reference.  The ARM Template Reference is a great tool to use when you are crafting a new policy because it will give you an idea of the schema of the resource you’ll be evaluating.  The Azure REST API Reference is useful when the description of a property is less than stellar in the ARM Template Reference (happens a lot).  Finally, the Azure Resource Explorer is an absolute must when troubleshooting a policy.A peer and I ran into a quirk when authoring a policy to evaluate the runtime of an Azure Web App.  In this instance Azure Web Apps running PHP on Windows were populating the PHP runtime in the phpVersion property while Linux was populating it in the linuxFxVersion property.  This meant we had to include additional logic in the policy to detect the runtimes based on the OS.  Without using Resource Explorer we would never have figured that out.
    • Use on-demand evaluations when building new policies
      Azure Policy evaluations are triggered based upon the set of the events described in this link.  The short of it is unless you want to wait 30 minutes after modifying or assigning a new policy, you’ll want to trigger an on-demand evaluation.  At this time this can only be done with a call to an Azure REST API endpoint.  I’m unaware of a built-in method to do this with Azure CLI or PowerShell.Since I have a lot of love for my fellow geeks, I put together a Python solution you can use to trigger evaluation.  Evaluations take anywhere between 5-10 minutes.  It seems like this takes longer the more policies you have, but that could simply be in my head.
    • RTFM.
      Seriously, read the public documentation.  Don’t jump into this service without spending an hour reading the documentation.  You’ll waste hours and hours of time smashing your head against the keyboard.  Specifically, read through this page to understand how processing across arrays works.  When you first start playing with Azure Policy, you’ll come across policies with double-negatives that will confuse the hell out of you.  Read that link and walk through policies like this one.  You can thank me later.
    • Explore the samples and experiment with them.
      Microsoft has published a fair amount of sample policies in the Azure Policy repo, the built-in policies and initiatives included in the Azure Portal, and the policy samples in the documentation.  I’ve thrown together a few myself and am working on others, so feel free to use them as you please.

Hope the above helps some of you on your journey to learning Azure Policy.  It’s a tool with a ton of potential and will no doubt improve over time.  One of the best ways to help it evolve is to contribute.  If you have some kick ass policies, submit them to get them published to the Azure Policy repo and to give back to the wider community.

Have a great week folks!

Debugging Azure SDK for Python Using Fiddler

Debugging Azure SDK for Python Using Fiddler

Hi there folks.  Recently I was experimenting with the Azure Python SDK when I was writing a solution to pull information about Azure resources within a subscription.  A function within the solution was used to pull a list of virtual machines in a given Azure subscription.  While writing the function, I recalled that I hadn’t yet had experience handling paged results the Azure REST API which is the underlining API being used by the SDK.

I hopped over to the public documentation to see how the API handles paging.  Come to find out the Azure REST API handles paging in a similar way as the Microsoft Graph API by returning a nextLink property which contains a reference used to retrieve the next page of results.  The Azure REST API will typically return paged results for operations such as list when the items being returned exceed 1,000 items (note this can vary depending on the method called).

So great, I knew how paging was used.  The next question was how the SDK would handle paged results.  Would it be my responsibility or would it by handled by the SDK itself?

If you have experience with AWS’s Boto3 SDK for Python (absolutely stellar SDK by the way) and you’ve worked in large environments, you are probably familiar with the paginator subclass.  Paginators exist for most of the AWS service classes such as IAM and S3.  Here is an example of a code snipped from a solution I wrote to report on aws access keys.

def query_iam_users():

todaydate = (datetime.now()).strftime("%Y-%m-%d")
users = []
client = boto3.client(
'iam'
)

paginator = client.get_paginator('list_users')
response_iterator = paginator.paginate()
for page in response_iterator:
for user in page['Users']:
user_rec = {'loggedDate':todaydate,'username':user['UserName'],'account_number':(parse_arn(user['Arn']))}
users.append(user_rec)
return users

Paginators make handling paged results a breeze and allow for extensive flexibility in controlling how paging is handled by the underlining AWS API.

Circling back to the Azure SDK for Python, my next step was to hop over to the SDK public documentation.  Navigating the documentation for the Azure SDK (at least for the Python SDK, I can’ t speak for the other languages) is a bit challenging.  There are a ton of excellent code samples, but if you want to get down and dirty and create something new you’re going to have dig around a bit to find what you need.  To pull a listing of virtual machines, I would be using the list_all method in VirtualMachinesOperations class.  Unfortunately I couldn’t find any reference in the documentation to how paging is handled with the method or class.

So where to now?  Well next step was the public Github repo for the SDK.  After poking around the repo I located the documentation on the VirtualMachineOperations class.  Searching the class definition, I was able to locate the code for the list_all() method.  Right at the top of the definition was this comment:

Use the nextLink property in the response to get the next page of virtual
machines.

Sounds like handling paging is on you right?  Not so fast.  Digging further into the method I came across the function below.  It looks like the method is handling paging itself releasing the consumer of the SDK of the overhead of writing additional code.

        def internal_paging(next_link=None):
            request = prepare_request(next_link)

            response = self._client.send(request, stream=False, **operation_config)

            if response.status_code not in [200]:
                exp = CloudError(response)
                exp.request_id = response.headers.get('x-ms-request-id')
                raise exp

            return response

I wanted to validate the behavior but unfortunately I couldn’t find any documentation on how to control the page size within the Azure REST API.  I wasn’t about to create 1,001 virtual machines so instead I decided to use another class and method in the SDK.  So what type of service would be a service that would return a hell of a lot of items?  Logging of course!  This meant using the list method of the ActivityLogsOperations class which is a subclass of the module for Azure Monitor and is used to pull log entries from the Azure Activity Log.  Before I experimented with the class, I hopped back over to Github and pulled up the source code for the class.  Low and behold we an internal_paging function within the list method that looks very similar to the one for the list_all vms.

        def internal_paging(next_link=None):
            request = prepare_request(next_link)

            response = self._client.send(request, stream=False, **operation_config)

            if response.status_code not in [200]:
                raise models.ErrorResponseException(self._deserialize, response)

            return response

Awesome, so I have a method that will likely create paged results, but how do I validate it is creating paged results and the SDK is handling them?  For that I broke out one of my favorite tools Telerik’s Fiddler.

There are plenty of guides on Fiddler out there so I’m going to skip the basics of how to install it and get it running.  Since the calls from the SDK are over HTTPS I needed to configure Fiddler to intercept secure web traffic.  Once Fiddler was up and running I popped open Visual Studio Code, setup a new workspace, configured a Python virtual environment, and threw together the lines of code below to get the Activity Logs.

from azure.common.credentials import ServicePrincipalCredentials
from azure.mgmt.monitor import MonitorManagementClient

TENANT_ID = 'mytenant.com'
CLIENT = 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX'
KEY = 'XXXXXX'
SUBSCRIPTION = 'XXXXXX-XXXX-XXXX-XXXX-XXXXXXXX'

credentials = ServicePrincipalCredentials(
    client_id = CLIENT,
    secret = KEY,
    tenant = TENANT_ID
)
client = MonitorManagementClient(
    credentials = credentials,
    subscription_id = SUBSCRIPTION
)

log = client.activity_logs.list(
    filter="eventTimestamp ge '2019-08-01T00:00:00.0000000Z' and eventTimestamp le '2019-08-24T00:00:00.0000000Z'"
)

for entry in log:
    print(entry)

Let me walk through the code quickly.  To make the call I used an Azure AD Service Principal I had setup that was granted Reader permissions over the Azure subscription I was querying.  After obtaining an access token for the service principal, I setup a MonitorManagementClient that was associated with the Azure subscription and dumped the contents of the Activity Log for the past 20ish days.  Finally I incremented through the results to print out each log entry.

When I ran the code in Visual Studio Code an exception was thrown stating there was an certificate verification error.

requests.exceptions.SSLError: HTTPSConnectionPool(host='login.microsoftonline.com', port=443): Max retries exceeded with url: /mytenant.com/oauth2/token (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)')))

The exception is being thrown by the Python requests module which is being used underneath the covers by the SDK.  The module performs certificate validation by default.  The reason certificate verification is failing is Fiddler uses a self-signed certificate when configured to intercept secure traffic when its being used as a proxy.  This allows it to decrypt secure web traffic sent by the client.

Python doesn’t use the Computer or User Windows certificate store so even after you trust the self-signed certificate created by Fiddler, certificate validation still fails.  Like most cross platform solutions it uses its own certificate store which has to be managed separately as described in this Stack Overflow article.  You should use the method described in the article for any production level code where you may be running into this error, such as when going through a corporate web proxy.

For the purposes of testing you can also pass the parameter verify with the value of False as seen below.  I can’t stress this enough, be smart and do not bypass certificate validation outside of a lab environment scenario.

requests.get('https://somewebsite.org', verify=False)

So this is all well and good when you’re using the requests module directly, but what if you’re using the Azure SDK?  To do it within the SDK we have to pass extra parameters called kwargs which the SDK refers to as an Operation config.  The additional parameters passed will be passed downstream to the methods such as the methods used by the requests module.

Here I modified the earlier code to tell the requests methods to ignore certificate validation for the calls to obtain the access token and call the list method.

from azure.common.credentials import ServicePrincipalCredentials
from azure.mgmt.monitor import MonitorManagementClient

TENANT_ID = 'mytenant.com'
CLIENT = 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX'
KEY = 'XXXXXX'
SUBSCRIPTION = 'XXXXXX-XXXX-XXXX-XXXX-XXXXXXXX'

credentials = ServicePrincipalCredentials(
    client_id = CLIENT,
    secret = KEY,
    tenant = TENANT_ID,
    verify = False
)
client = MonitorManagementClient(
    credentials = credentials,
    subscription_id = SUBSCRIPTION,
    verify = False
)

log = client.activity_logs.list(
    filter="eventTimestamp ge '2019-08-01T00:00:00.0000000Z' and eventTimestamp le '2019-08-24T00:00:00.0000000Z'",
    verify = False
)

for entry in log:
    print(entry)

After the modifications the code ran successfully and I was able to verify that the SDK was handling paging for me.

fiddler.png

Let’s sum up what we learned:

  • When using an Azure SDK leverage the Azure REST API reference to better understand the calls the SDK is making
  • Use Fiddler to analyze and debug issues with the Azure SDK
  • Never turn off certificate verification in a production environment and instead validate the certificate verification error is legitimate and if so add the certificate to the trusted store
  • In lab environments, certificate verification can be disabled by passing an additional parameter of verify=False with the SDK method

Hope that helps folks.  See you next time!

Deep Dive into Azure Managed Identities – Part 2

Welcome back fellow geeks for the second installment in my series on Azure Managed Identities.  In the first post I covered the business problem and the risks Managed Identities address and in this post I’ll be how managed identities are represented in Azure.

Let’s start by walking through the components that make managed identities possible.

The foundational component of any identity is the data store in which the identity lives in.  In the case of managed identities, like much of the rest of the identity data for the Microsoft cloud, the data store is Azure Active Directory.  For those of you coming from the traditional on-premises environment and who have had experience with your traditional directories such as Active Directory or one of the many flavors of LDAP, Azure Active Directory (Azure AD) is an Identity-as-a-Service which includes a directory component we can think of as a next generation directory.  This means it’s designed to be highly scalable, available, and resilient and be provided to you in “as a service” model where a simple management layer sits in front of all the complexities of the compute, network, and storage infrastructure that makes up the directory.  There are a whole bunch of other cool features such as modern authentication, contextual authorization, adaptive authentication, and behavioral analytics that come along with the solution so check out the official documentation to learn about those capabilities.  If you want to nerd out on the design of that infrastructure you can check out this whitepaper and this article.

It’s worthwhile to take a moment to cover Azure AD’s relationship to Azure.  Every resource in Azure is associated with an Azure subscription.  An Azure subscription acts as a legal and payment agreement (think type of Azure subscription, pay-as-you-go, Visual Studio, CSP, etc), boundary of scale (think limits to resources you can create in a subscription), and administrative boundary.  Each Azure subscription is associated with a single instance of Azure AD.  Azure AD acts as the security boundary for an organization’s space in Azure and serves as the identity backend for the Azure subscription.  You’ll often hear it referred to as “your tenant” (if you’re not familiar with the general cloud concept of tenancy check out this CSA article).

Azure AD stores lots of different object types including users, groups, and devices.  The object type we are interested in for the purposes of managed identity are service principals.  Service principals act as the security principals for non-humans (such as applications or Azure resources like a VM) in Azure AD.  These service principals are then granted permissions to access resources in Azure by being assigned permissions to Azure resources such as an instance of Azure Key Vault or an Azure Storage account.  Service principals are used for a number of purposes beyond just Managed Identities such as identities for custom developed applications or third-party applications

Given that the service principals can be used for different purposes, it only makes sense that the service principal object type includes an attribute called the serviceprincipaltype.  For example, a third-party or custom developed application that is registered with Azure AD uses the service principal type of Application while a managed identity has the value set to ManagedIdentity.  Let’s take a look at an example of the serviceprincipaltypes in a tenant.

In my Geek In The Weeds tenant I’ve created a few application identities by registering the applications and I’ve created a few managed identities.  Everything else within the tenant is default out of the box.  To list the service principals in the directory I used the AzureAD PowerShell module.  The cmdlet that can be used to list out the service principals is the Get-AzureADServicePrincipal.  By default the cmdlet will only return the 100 results, so you need to set the All parameter to true.  Every application, whether it’s Exchange Online or Power BI, it needs an identity in your tenant to interact with it and resources you create that are associated with the tenant.  Here are the serviceprincipaltypes in my Geek In The Weeds tenant.

serviceprincipaltype.PNG

Now we know the security principal used by a Managed Identity is stored in Azure AD and is represented by a service principal object.  We also know that service principal objects have different types depending on how they’re being used and the type that represents a managed identity has a type of ManagedIdentity.  If we want to know what managed identities exist in our directory, we can use this information to pull a list using the Get-AzureADServicePrincipal.

We’re not done yet!  Managed Identities also come in multiple flavors, either system-assigned or user-assigned.  System-assigned managed identities are the cooler of the two in that they share the lifecycle of the resource they’re used by.  For example, a system-assigned managed identity can be created when an Azure Function is created thus that the identity will be deleted once the Azure VM is deleted.  This presents a great option for mitigating the challenge of identity lifecycle management.  By Microsoft handling the lifecyle of these identities each resource could potentially have its own identity making it easier to troubleshoot issues with the identity, avoid potential outages caused by modifying the identity, adhering to least privilege and giving the identity only the permissions the resource requires, and cutting back on support requests by developers to info sec for the creation of identities.

Sometimes it may be desirable to share a managed identity amongst multiple Azure resources such as an application running on multiple Azure VMs.  This use case calls for the other type of managed identity, user-assigned.  These identities do not share the lifecycle of the resources using them.

Let’s take a look at the differences between a service principal object for a user-assigned vs a system-assigned managed identity.  Here I ran another Get-AzureADServicePrincipal and limited the results to serviceprincipaltype of ManagedIdentity.

ObjectId                           : a3e9d372-242e-424b-b97a-135116995d4b
ObjectType                         : ServicePrincipal
AccountEnabled                     : True
AlternativeNames                   : {isExplicit=False, /subscriptions//resourcegroups/managedidentity/providers/Microsoft.Compute/virtualMachines/systemmis}
AppId                              : b7fa9389-XXXX
AppRoleAssignmentRequired          : False
DisplayName                        : systemmis
KeyCredentials                     : {class KeyCredential {
                                       CustomKeyIdentifier: System.Byte[]
                                       EndDate: 11/11/2019 12:39:00 AM
                                       KeyId: f8e439a8-071b-45e0-9f8e-ac10b058a5fb
                                       StartDate: 8/13/2019 12:39:00 AM
                                       Type: AsymmetricX509Cert
                                       Usage: Verify
                                       Value:
                                     }
                                     }
ServicePrincipalNames              : {b7fa9389-XXXX, https://identity.azure.net/XXXX}
ServicePrincipalType               : ManagedIdentity
------------------------------------------------
ObjectId                           : ac960ac7-ca03-4ac0-a7b8-d458635b293b
ObjectType                         : ServicePrincipal
AccountEnabled                     : True
AlternativeNames                   : {isExplicit=True,
                                     /subscriptions//resourcegroups/managedidentity/providers/Microsoft.ManagedIdentity/userAssignedIdentities/testing1234}
AppId                              : fff84e09-XXXX
AppRoleAssignmentRequired          : False
AppRoles                           : {}
DisplayName                        : testing1234
KeyCredentials                     : {class KeyCredential {
                                       CustomKeyIdentifier: System.Byte[]
                                       EndDate: 11/7/2019 1:49:00 AM
                                       KeyId: b3c1808d-6778-4004-b23f-4d339ed0a91f
                                       StartDate: 8/9/2019 1:49:00 AM
                                       Type: AsymmetricX509Cert
                                       Usage: Verify
                                       Value:
                                     }
                                     }
ServicePrincipalNames              : {fff84e09-XXXX, https://identity.azure.net/XXXX}
ServicePrincipalType               : ManagedIdentity


In the above results we can see that the main difference between the user-assigned (testing1234) and system-assigned (systemmis) is the within the AlternativeNames property.  For the system-assigned identity has values of isExplicit set to False and has another value of /subscriptions//resourcegroups/managedidentity/
providers/Microsoft.Compute/virtualMachines/systemmis
Notice the bolded portion specifies this is being used by a virtual machine named systemmis.  The user-assigned identity has the isExplicit set to True and another property with the value of /subscriptions//resourcegroups/managedidentity/
providers/Microsoft.ManagedIdentity/userAssignedIdentities/testing1234
.  Here we can see the identity is an “explicit” managed identity and is not directly linked to an Azure resource.

This difference gives us the ability to quickly report on the number of system-assigned and user-assigned managed identities in a tenant by using the following command.

Get-AzureADServicePrincipal -All $True | Where-Object AlternativeNames -like “isExplicit=True*”

True would give us user-assigned and False would give us system-assigned.  Neat right?

Let’s summarize what we’ve learned:

  • An object in Azure Active Directory is created for each managed identity and represents its security principal
  • The type of object created is a service principal
  • There are multiple service principal types and the one used by a Managed Identity is called ManagedIdentity
  • There are two types of managed identities, user-assigned and system-assigned
  • System-assigned managed identities share the lifecycle of the resource they are associated with while user-assigned managed identities are created separately from the resource, do not share the resource lifecycle, and can be used across multiple resources
  • The object representing a user-assigned managed identity has a unique value of isExplicit=True for the AlternativeNames property while a system-assigned managed identity has that value of isExplicit=False.

That’s it for this post folks.  In the next post I’ll walk through the process of creating a managed identity for an Azure VM and will demonstrate with a bit of Python code how we can use the managed identity to access a secret stored in Azure Key Vault.

See you next post!

Deep Dive into Azure Managed Identities – Part 1

“I love the overhead of password management” said no one ever.

Password management is hard.  It’s even harder when you’re managing the credentials for non-humans, such as those used by an application.  Back in the olden days when the developer needed a way to access an enterprise database or file share, they’d put in a request with help desk or information security to have an account (often referred to as a service account) provisioned in Windows Active Directory, an LDAP, or a SQL database.  The request would go through a business approval and some support person would created the account, set the password, and email the information to the developer.  This process came with a number of risks:

  • Risk of compromise of the account
  • Risk of abuse of the account
  • Risk of a significant outage

These risks arise due to the following gaps in the process:

  • Multiple parties knowing the password (the party who provisions the account and the developer)
  • The password for the account being communicated to the developer unencrypted such as plain text in an email
  • The password not being changed after it is initially set due to the inability or difficult to change the password
  • The password not being regularly rotated due to concerns over application outages
  • The password being shared with other developers and the account then being used across multiple applications without the dependency being documented

Organizations tried to mitigate the risk of compromise by performing such actions as requiring a long and complex password, delivering the password in an encrypted format such as an encrypted Microsoft Office document, instituting policy requiring the password to be changed (exceptions with this one are frequent due to outage concerns), implementing password vaulting and management such as CyberArk Enterprise Password Vault or Hashicorp Vault, and instituting behavioral monitoring solutions to check for abuse.  Password rotation and monitoring are some of the more effective mitigations but can also be extremely challenging and costly to institute at a scale even with a vaulting and management solution.  Even then, there are always the exceptions to the systems with legacy applications which are not compatible (sadly these are often some of the more critical systems).

When the public cloud came around the credential management challenge for application accounts exploded due to the most favored traits of a public cloud which include on-demand self-service and rapid elasticity and scalability.  The challenge that was a few hundred application identities has grown quickly into thousands of applications and especially containers and serverless functions such as AWS Lambda and Azure Functions.  Beyond the volume of applications, the public cloud also changes the traditional security boundary due to its broad network access trait.  Instead of the cozy feeling multiple firewalls gave you, you now have developers using cloud services such as storage or databases which are directly administered via the cloud management plane which is exposed directly to the Internet.  It doesn’t stop here folks, you also have developers heavily using SaaS-based version control solutions to store the code which may have credentials hardcoded into it potentially publicly exposing those credentials.

Thankfully the public cloud providers have heard the cries of us security folk and have been working hard to help address the problem.  One method in use is the creation of security principals which are designed around the use of temporary credentials.  This way there are no long standing credentials to share, compromise, or abuse.  Amazon has robust use of this concept in AWS using IAM Roles.  Instead of hardcoding a set of IAM User credentials in a Lambda or an application running on an EC2 instance, a role can be created with the necessary permissions required for the application and be assumed by either the Lambda service or EC2 instance.

For this series of posts I’m going to be focusing in one of Microsoft Azure’s solutions to this problem which are called Managed Identities.  For you folk that are more familiar with AWS, Managed Identities conceptually work the same was as IAM Roles.  A security principal is created, permissions are granted, and the identity is assumed by a resource such as an Azure Web App or an Azure VM.  There are some features that differ from IAM Roles that add to the appeal of Managed Identities such as associating the identity lifecycle of the Managed Identity to the resource such that when the resource is created, the managed identity is created, and when the resource is destroyed, the identity is destroy.

In this series of posts I’ll be demonstrating how Managed Identities are created, how they are used, and how they differ (sometimes for the better and sometimes not) from AWS IAM Roles.  Hope you enjoy the series and except the next entry in the series early next week.

See you soon fellow geek!