Network Security Perimeters – NSPs for Troubleshooting

Network Security Perimeters – NSPs for Troubleshooting

This is part of my series on Network Security Perimeters:

  1. Network Security Perimeters – The Problem They Solve
  2. Network Security Perimeters – NSP Components
  3. Network Security Perimeters – NSPs in Action – Key Vault Example
  4. Network Security Perimeters – NSPs in Action – AI Workload Example
  5. Network Security Perimeters – NSPs for Troubleshooting

Hello folks! The past 3 months have been completely INSANE with customer work, building demos, experimentation and learning. It’s been awesome, but goddamn has it been grueling. I’m back to finally close out my series on Network Security Perimeters. In my past posts I’ve covered NSPs from an conceptional level, the components that make them up, and two separate examples. Today I’m going to cover my favorite use case for NSPs, and this is using them for troubleshooting.

The Setup

At a very high level most enterprises have an architecture similar what you see below. In this architecture network boundaries are setup around endpoints and services. These boundaries are erected to separate hardware and services based on the data stored in that environment, the security controls enforced in that environment, whether that environment has devices connected to the Internet, what type of trust level the humans and non-humans running in those environments have, and other similar variables. For the purposes of this post, we’re gonna keep it simple and stick to LAN (trusted) and DMZ (untrusted). DMZ is where devices connected to the Internet live and LAN is where devices restricted to the private network live.

Very basic network architecture

Environments like these typically restrict access to the Internet through a firewall or appliance/service that is performing a forward web proxy function. This allows enterprises to control what traffic leaves their private network through traditional layer 5-tuple means, deep packet inspection to inspect and control traffic at layer 7, and control access to specific websites based on the user or endpoint identity. Within the LAN there is typically a private DNS service that provides name resolution for internal domains. The DMZ has either separate dedicated DNS infrastructure for DNS caching and limited conditional fowarding or it utilizes some third-party hosted DNS service like a CloudFlare. The key takeaway from the DNS resolution piece is the machines within the LAN and DMZ use different DNS services and typically the DMZ is limited or completely incapable of resolving domains hosted on the internal DNS servers.

You’re likely thinking, “Cool Matt, thanks for the cloud 101. WTF is your point?” The place where this type of setup really bites customers is when consuming Private Endpoints. I can’t tell you how many times over the years I’ve been asked to help a customer struggling with Private Endpoints only to find out the problem is DNS related to this infrastructure. The solution is typically a simple proxy bypass, but getting to that resolution often takes hours of troubleshooting. I’m going to show you how NSPs can make troubleshooting this problem way easier.

The Problem

Before we dive into the NSP piece, let’s look at how the problem described above manifests. Take an organization that uses a high level architecture to integrate with Azure such as pictured below.

Same architecture as above but now with Azure connectivity

Here we have an organization that has connectivity with Azure configure through both an ExpressRoute with S2S VPN as fallback (I hate this fallback method, but that’s a blog for another day). In the ideal world, VM1 hits Service PaaS 1 on that Private Endpoint with the traffic being sent through the VPN or ExpressRoute connection. To do that, the DNS server in the LAN must properly resolve to the private IP address of the Private Endpoint deployed for Service PaaS 1 (check out my series on DNS if you’re unfamiliar with how that works). Let’s assume the enterprise has properly configured DNS so VM1 resolves to the correct IP address.

Now it’s Monday morning and you are the team that manages Azure. You get a call from a user complaining they can’t connect to the Private Endpoint for their Azure Storage account for blob access and gives you the typical “Azure is broken!”. You hop on a call with the user and ask the user to do some nslookups from their endpoint which return the correct private IP address. You even have the user run a curl against the FQDN and still you get back the correct IP address.

This is typically the scenario that at some point gets escalated beyond support and someone from the account team says, “Hey Matt, can you look at this?” I’ll hop on a call, get a lowdown of what the customer is doing, double-check what the customer checked, maybe take a glance at routing and Network Security Groups and then jump to what is almost always the problem in these scenarios. The next question out of my mouth is, “Do you have a proxy?”

So why does this matter? This question is important whenever we consider HTTP/HTTPS traffic because it is almost always sent through through an appliance or service that is performing the architectural function of a forward web proxy before it egresses to the Internet as I covered earlier. This could be a service hosted within the organization’s boundry or it could be a third-party service like a ZScaler. Where it’s hosted isn’t super important (but can play a role in DNS), the key thing to understand is the enterprise using one and is the application being used to access the Azure PaaS service using it.

When HTTP/HTTPS traffic is configured to be proxied, the connection from the endpoint is made to the proxy service. The proxy service examines the endpoint’s request, executes any controls configured, and initiates a connection to the outbound service. This last piece is what we care about because to make a connection, the proxy service needs to do its own DNS query which means it will use the DNS server configured in the proxy. This is typically the problem because as I covered earlier, this DNS service doesn’t typically have resolution to internal domains, which would include resolution to the privatelink domains used by Azure PaaS services.

DNS resolution when using proxy

When you did curl (without specifying proxy settings) or nslookup the machine was hitting the internal DNS service which does have the ability to resolve privatelink domains giving you the false sense that everything looks good from a DNS perspective. The resolution to this problem is to work with the proxy team to put in the appropriate proxy bypass so the endpoint will connect directly to the Azure PaaS (thus using its own DNS service).

All of this sounds simple, right? The reality is getting to the point of identifying the issue was the proxy tends to take hours, if not days, and tons of people from a wide array of teams across the enterprise. This means a lot of money spent diagnosing and resolving the issue.

What if there was an easier way? In comes NSPs.

NSPs to the rescue!

If you were a good Azure citizen, you would have wrapped this service in an NSP (assuming that service has been onboarded by Microsoft to NSPs) and turned on logging as I covered in my prior posts. If you had done that, you could have leveraged the NSP logs (the NSPAccessLogs table) to identify incoming traffic being blocked by the NSP. Below is an example of what the log entry looks like.

NSP Log Sample

In the above log entry I get detail as to the operation the user attempted to perform (in this instance listing the Keys in a Key Vault), the effect of the NSP (traffic is denied), and the category of traffic (Public). If we go back to the troubleshooting steps from earlier, I may have been able to identify this problem WAY earlier and with far less people involved even if the Azure resource platform logs obfuscated the full IP address or didn’t list it at all. I can’t stress the value this presents, especially having a standardized log format. The amount of hours my customers could have saved by enabling and using these logs makes me sad.

Even more uses!

Beyond troubleshooting the proxy problem, there are any scenarios where this comes in super handy. Another such example is PaaS to PaaS traffic. Often times the documentation around when a PaaS talks to another PaaS is not clear. It may not be obvious that one PaaS is trying to communicate with another over the Microsoft public backbone. This is another area NSPs can help because this traffic can also be logged and used for troubleshooting

Troubleshooting PaaS to PaaS

We’re not done yet! Some Azure compute services can integrate into a customer virtual network using a combination of Private Endpoints for inbound traffic and regional VNet integration for outbound access (traffic initiated by the PaaS and destined for customer endpoints or endpoints in the public IP space). A good example is Azure App Services or the new API Management v2 SKUs. Often times, I’ll work with customers who think they enabled this correctly but only actually enabled Private Endpoints and missed configuring regional Vnet integration causing outbound traffic to leave the Microsoft public backbone and hit the PaaS over public IPs or misconfigured DNS for the virtual network the PaaS service has been integrated with. NSP logs can help here as well.

NSPs helping to diagnose regional VNet integration issues

Summing it up

Here are the key takeways for you for this post:

  1. Enable NSPs wherever they are supported. If you’re not comfortable enforcing them, at least turn them on for the logging.
  2. Don’t forget NSPs capture both the inbound AND outbound traffic. You’d be amazed how many Azure PaaS services (service based) can make outbound network calls that you probably aren’t tracking or controlling.
  3. Like platform logs, NSP logs are not simply a security tool. Don’t lock them away from operations behind a SIEM. Make them available to both security and operations so everyone can benefit.

NSPs are more than just a tool to block and get visibility into incoming and outbound traffic for security purposes, but also an important tool in your toolbox to help with day-to-day operational headaches. If you’re not using NSPs for supported services today, you should be. There is absolutely zero reason not to do it, and your late night troubleshooting sessions will only consume 1 Mountain Dew vs 10!

That’s it for me. Off to snowblow!

Network Security Perimeters – NSPs in Action – Key Vault Example

This is part of my series on Network Security Perimeters:

  1. Network Security Perimeters – The Problem They Solve
  2. Network Security Perimeters – NSP Components
  3. Network Security Perimeters – NSPs in Action – Key Vault Example
  4. Network Security Perimeters – NSPs in Action – AI Workload Example
  5. Network Security Perimeters – NSPs for Troubleshooting

Welcome back to my third post in my NSP (Network Security Perimeter) series. In this post I’m going to start covering some practical use cases for NSPs and demonstrating how they work. I was going to group these use cases in a single post, but it would have been insanely long (and I’m lazy). Instead, I’ll be covering one per post. These use cases are likely scenarios you’ve run into and do a good job demonstrating the actual functionality.

A Quick Review

In my first post I broke PaaS services into compute-based PaaS and service-based PaaS. NSPs are focused on solving problems for service-based PaaS. These problems include a lack of outbound network controls to mitigate data exfiltration, inconsistent offerings for inbound network controls across PaaS services, scalability with inbound IP whitelisting, difficulty configuring and managing these controls at scale, and inconsistent quality of logs across services for simple fields you’d expect in a log like calling inbound IP address.

Compute-based PaaS vs Service-based PaaS

My second post walked through the components that make up a Network Security Perimeter and their relationships to each other. I walked through each of the key components including the Network Security Perimeter, profiles, access policies, and resource associations. If you haven’t read that post, you need to read it before you tackle this one. My focus in this post will be where those resources are used and will assume you grasp their function and relationships.

Network Security Perimeter components and their relationships

With that refresher out of the way, let’s get to the good stuff.

Use Case 1: Securing Azure Key Vaults

Azure Key Vault is Microsoft’s native PaaS offering for secure key, secret, and certificate storage. Secrets sometimes need to be accessed by Microsoft SaaS (software-as-a-service) and compute-based PaaS that do not support virtual network injection or a managed virtual network concept, such as some use cases for PowerBI. Vaults are used by 3rd-party products outside of Azure or from another CSP (cloud service provider). There are also use cases where a customer may be new to Azure and doesn’t yet have the necessary hybrid connectivity and Private DNS configuration to support the usage of Private Endpoints. In these scenarios, Private Endpoints are not an option and the traffic needs to come in the public endpoint of the vault. Here is our use case for NSPs.

Services accessing Key Vault for secrets

Historically, folks would try to solve this with IP whitelisting on the Key Vault service firewall. As I covered in my first post, this is a localized configuration to the resource and can be mucked with by the resource owner unless permissions are properly scoped or an Azure Policy is used to enforce a specific configuration. This makes it difficult to put network control in the hands of security while leaving the rest of the configuration of the resource to the resource owner. Another issue that sometimes pops up with this pattern is hitting the maximum of 400 prefixes for the service firewall rules.

NSPs provide us with a few advantages here:

  1. Network security can be controlled by the NSP and that NSP can be controlled by the security team while leaving the rest of the resource configuration to the resource owner.
  2. You can allow more than 400 inbound IP rules (up to 500 in my testing). Sometimes a few extra IP prefixes is all you needed back with the service firewall.

In this type of use case, we could do something like the below. Here we have a single Network Security Perimeter for our two categories of Key Vaults for a product line. Category 1 Key Vaults need to be accessed by 3rd-party SaaS and applications that do not have the necessary network path to use Private Endpoints. Category 2 are Key Vaults used by internally facing application within Azure and those Key Vaults need to be restricted to a Private Endpoint.

Public Key Vault

For this scenario we can build something like in the image above where we have a single NSP with two profiles. One profile will be used by our “public” Key Vaults. This profile will be associated with an access rule that allows a set of trusted IP addresses from a 3rd-party SaaS solution. The other profile will have no associated access rules, thus blocking all access to the Key Vault over the public endpoint. Both resource associations will be set to enforced to ensure the the NSP rules override the local service firewall rules.

Let’s take a look at this in action.

For this scenario, I have an NSP design setup exactly as above. The access rule applied to my public vault has the IP address of my machine as seen below:

Profile access rule for publicly-facing Key Vault

At this point my vault isn’t associated with the NSP yet and it has been configured to allow all public network access. Attempts at accessing the vault from a random public IP shows successful as would be expected.

Successful retrieval of secret from untrusted IP address prior to NSP association

Next I associate the vault to the NSP and set it to enforced mode. By default it will be configured in Transition mode (see my second post for detail on this) which means it will log whether the traffic would be allowed or denied but it won’t block the traffic. Since I want the NSP to override the local service firewall, I’m going to set it to enforced.

When trying to pull the secret from the vault using a machine with the trusted public IP listed in the access rule associated to the profile, I’m capable of getting the secret.

Successful call from trusted IP listed in NSP profile access rule

If I attempt to access the secret from an untrusted IP, even with the service firewall on the vault configured to allow all public network access, I’m rejected with the message below.

Denied call from an untrusted IP due to NSP

Review of the logs (NSPAccessLogs table) shows that the successful call was due to the access rule I put in place and the denied call triggered DefaultDenyAll rule.

Now what about my private vault? Let’s take a look at that one next.

Private Key Vault

For this scenario I’m going to use the second profile in the NSP. This profile doesn’t have any associated access rules which effectively blocks all traffic to the public endpoint originating from outside the NSP. My goal is to make this vault accessible only from a private endpoint.

First, I associate the resource to the NSP profile and configure it in enforced mode.

Private Key Vault associated to profile in enforced mode

This is another vault where I’ve configured the service firewall to allow all public network access. Attempting to access the resource throws the message indicating the NSP is blocking access.

Denied call from a public IP when NSP denies all public access

I’ve created a Private Endpoint for this vault as well. As I covered earlier in this series, NSPs are focused on public access and do not limit Private Endpoint access, so that means it doesn’t log access from a Private Endpoint, right? Wrong! A neat feature of NSP wrapped resources is those the NSPs will allow the traffic and log it as seen below.

NSP log entry showing access through Private Endpoint

In the above log entry you’ll see the traffic is labeled as private indicating it’s traffic being allowed through the DefaultAllowAll rule and the TrafficType set to Private because it’s coming in through a Private Endpoint. Interestingly enough, you also get the operation that was being performed in the request. I could have sworn these logs used to include the specific Private Endpoint resource ID the traffic ingressed from, but perhaps I imagined that or it was removed when the service graduated to GA (generally available).

Summing it up

In this post I gave an overview of a simple use case that many folks may have today. You could easily sub out Key Vault for any of the other supported PaaS that has a similar public endpoint access model and the setup will be the same. Here are some key takeaways:

  1. NSPs allow you to enforce public access network controls regardless how the resource owner configures the service firewall on the resource.
  2. Profiles seem to support a maximum of 500 IP prefixes for inbound and 500 for outbound. This is more than the 400 available in the service firewall. This is based on my testing and no idea if it’s a soft or hard limit.
  3. NSPs provide a standardized log format for network access. No more looking at 30 different log schemas across different resources, half of which don’t contain network information or someone drank too much tequila and decided to mask an octet of the IP. Additionally, they will log network access attempts through Private Endpoints.

In my next post I’ll cover a use where we have two resources in the same NSP that communicate with each other.

See you next post!

Logging in Azure OpenAI Service

This is part of my series on GenAI Services in Azure:

  1. Azure OpenAI Service – Infra and Security Stuff
  2. Azure OpenAI Service – Authentication
  3. Azure OpenAI Service – Authorization
  4. Azure OpenAI Service – Logging
  5. Azure OpenAI Service – Azure API Management and Entra ID
  6. Azure OpenAI Service – Granular Chargebacks
  7. Azure OpenAI Service – Load Balancing
  8. Azure OpenAI Service – Blocking API Key Access
  9. Azure OpenAI Service – Securing Azure OpenAI Studio
  10. Azure OpenAI Service – Challenge of Logging Streaming ChatCompletions
  11. Azure OpenAI Service – How To Get Insights By Collecting Logging Data
  12. Azure OpenAI Service – How To Handle Rate Limiting
  13. Azure OpenAI Service – Tracking Token Usage with APIM
  14. Azure AI Studio – Chat Playground and APIM
  15. Azure OpenAI Service – Streaming ChatCompletions and Token Consumption Tracking
  16. Azure OpenAI Service – Load Testing

Updates:

  • 1/18/2024 – Logs now include Entra ID security principal objectid property in RequestResponse log

Welcome back fellow geeks.

Over the past few weeks I’ve done a series of posts on the Azure OpenAI Service covering some of the security features of the service. In my first post I gave an overview of what security controls Microsoft makes available for customers to configure to secure their instance of the service. In the second and third posts I did deep dives into the authentication and authorization capabilities of the service. Tonight I’m going to cover the logging capabilities of the service.

Let’s jump right in!

The Azure OpenAI Service emits both logs and metrics. For the purposes of this post I’ll be covering logs. I’ll cover the metrics and monitoring of the service in another post if there is a community interest. Logs emitted by the service have been integrated with the diagnostic setting feature. For those unfamiliar with the diagnostic settings feature, it provides a very simple way to deliver logs and metrics emitted by an Azure service to an Azure Storage Account, Log Analytics Workspace, or to an Event Hub (common use case for passing on to a SIEM like Splunk). In the image below, you can see I’m sending all of the logs and metrics emitted from the service to a Log Analytics Workspace.

Diagnostic Settings

In the image above you can see that the Azure OpenAI Service emits three types of logs which include audit logs, request and response logs, and trace logs. As of the date of this blog all of these logs are sent to the AzureDiagnostics table if you opt to send this logs to a Log Analytics Workspace, so dust off your Kusto skills.

Let’s first take a look at the audit logs, because I know that’s where your security focused eyes darted to. I want to remind you this is a very new service and lots of improvements are coming. Yeah, I did that. I pulled a sales dude move. Seriously though, the audit logging is very limited and likely not what you’d hope for as of the date of this blog. The only events that seem to be logged to the Audit Log for the service are when a ListKeys operation is performed. The operation means a security principal accessed the API Keys. The API keys are used to authenticate to the data plane of the service and do not allow for granular authorization at that plane. Check out my last two posts on authentication and authorization if that sentence doesn’t make sense. Unfortunately, the identity that accessed the API key isn’t listed in the log entry which makes it pretty useless in its current state. Below is a sample entry.

Azure OpenAI Service Audit Log Entry Example

Making this even more useless, this operation is also logged in the Azure Activity Log. The log entry within the Activity Log does include the security principal that performed the action so you’ll want to watch for that activity there. I imagine over time the audit log will be improved to capture more operations and associate those operations to a security principal.

Activity Log Entry Showing List Keys Operation

Next I’m going to cover the Request and Response Logs. This log set is really interesting because likely your expectation is the same as mine was that these would include details around prompts sent to the models and information on the response such as the number of tokens consumed. While it does operations around requests for things such a completions or summarizations, it also captures a ton of other events that would likely be more suited for the audit log. Additionally, the data it captures about these actions is extremely limited.

Let’s take a look at a log entry where I requested the model complete a sentence for me. In my code I’m calling the API using an Azure AD service principal NOT an API key with the shattered hope that the log entry would capture the service principal I’m using.

1/18/2024 – The Entra ID object id is now included in the RequestResponse log entry! Hooray!

Prompt and Response Log Entry

In the above log entry we don’t get any information to correlate the operation back an entity even when using Azure AD authentication. All we can see is the completion action occurred at a specific time and resulted in a success status code. You’ll also see there is a CallerIPAddress field. This will include the first three octets of the IP address called the service but not the last octet. Kinda weird it’s being masked like this, but I guess that’s better than nothing? (Not really, but hey it’s a new service).

Before you ask, no, the content of the prompts and responses are not logged in any of these logs.

There is one additional field of relevance I couldn’t fit within the above screenshot and that’s properties_s. The only real useful information on this is total response time the service took to return an answer to the user. I hoped this would have had some information around tokens used, but sadly it does not.

properties_s field of a Prompt and Response Log Entry

Besides prompts and responses, this log seems to capture other data plane operations. This includes everything from activities users have performed around uploading files to the service to train fine-tuned models, activities around fine-tuned models (listing, creation, deletion), creation of embeddings, and management of models deployed to the service. Most of these operations should be in the Audit Log in my opinion. I’m not sure why they’re included in this log, but they are. No, none of these operations include details as to who performed these actions beyond the first three octets of the IP address.

Lastly, there is the trace log. I have no idea what’s logged in there because I have yet generate any trace log data. If you know what gets logged in there, let me know in the comments.

So yes folks, there are some serious gaps in the logging for the service today. However, the service is new and the underlining technology is still pretty new as well so we can’t expect perfection out of the gates. My advice to customers has been to build the logging they need into whatever application is fronting the user access and to lock the service down from an authorization perspective so that the only access to the service comes through that application.

My peer Jake Wang has come up with a creative solution to address some of the logging gaps in the service by placing an API Management instance in front of it. With this design anything communicating with the Azure OpenAI Service instance has to go through APIM. Within APIM you can do whatever fancy logging you want to do, toss in some additional throttling to specific user requests, and lots of other cool stuff. It’s a great workaround while the Product Group improves the native logging. Check out my recent post for some of the gotchas of APIM logging for the Azure OpenAI Service.

If you have a different API Gateway like Mulesoft you could use this same pattern with that instead of APIM.

Well folks that wraps things up. I hope you got some value out of this post and I’d encourage you to make your voices heard by submitting feedback to the product group on how you’d like to see the logging improved for the service.

Thanks for reading!

AWS Managed Microsoft AD Deep Dive Part 1 – Overview

AWS Managed Microsoft AD Deep Dive  Part 1 – Overview

Welcome back my fellow geeks!

Earlier this year I did a deep dive into Microsoft’s managed Active Directory service, Microsoft Azure Active Directory Domain Services (AAD DS).  I found was a service in its infancy and showing some promise, but very far from being an enterprise-ready service.  I thought it would be fun to look at Amazon’s (which I’ll refer to as Amazon Web Services (AWS) for the rest of the entries in this series) take on a managed Microsoft Active Directory (or as Microsoft is referring to it these days Windows Active Directory).

Unless your organization popped up in the last year or two and went the whole serverless route you are still managing operating systems that require centralized authentication, authorization, and configuration management.  You also more than likely have a ton of legacy/classic on-premises applications that require legacy protocols such as Kerberos and LDAP.  Your organization is likely using Windows Active Directory (Windows AD) to provide these capabilities along with Windows AD’s basic domain name system (DNS) service and centralized identity data store.

It’s unrealistic to assume you’re going to shed all those legacy applications prior to beginning your journey into the public cloud.  I mean heck, shedding the ownership of data centers alone can be a huge cost driver.  Organizations are then faced with the challenge of how to do Windows AD in the public cloud.  Is it best to extend an existing on-premises forest into the public cloud?  What about creating a resource forest with a trust?  Or maybe even a completely new forest with no trust?  Each of these options have positives and negatives that need to be evaluated against organizational requirements across the business, technical, and legal arenas.

Whatever choice you make, it means additional infrastructure in the form of more domain controllers.  Anyone who has managed Windows AD in an enterprise knows how much overhead managing domain controllers can introduce.  Let me clarify that by managing Windows AD, it does not mean opening Active Directory Users and Computers (ADUC) and creating user accounts and groups.  I’m talking about examining performance monitor AD counters and LDAP Debug logs to properly size domain controllers, configuring security controls to comply with PCI and HIPAA requirements or aligning with DISA STIGS, managing updates and patches, and troubleshooting the challenges those bring which requires extensive knowledge of how Active Directory works.  In this day an age IT staff need to be less focused on overhead such as this and more focused on working closely with its business units to drive and execute upon business strategy.  That folks is where managed services shine.

AWS offers an extensive catalog of managed services and Windows AD is no exception.  Included within the AWS Directory Services offerings there is a powerful offering named Amazon Web Services Directory Service for Microsoft Active Directory, or more succinctly AWS Managed Microsoft AD.  It provides all the wonderful capabilities of Windows AD without all of the operational overhead.  An interesting fact is that the service has been around since December 2015 in comparison to Microsoft’s AAD DS which only went into public preview at in 3rd Q 2017.  This head start has done AWS a lot of favors and in this engineer’s opinion, has established AWS Managed Microsoft AD as the superior managed Windows AD service over Microsoft’s AAD DS.  We’ll see why as the series progresses.

Over the course of this series I’ll be performing a similar analysis as I did in my series on Microsoft AAD DS.  I’ll also be examining the many additional capabilities AWS Managed Microsoft AD provides and demoing some of them in action.  My goal is that by the end of this series you understand the technical limitations that come with the significant business benefits of leveraging a managed service.

See you next post!