Logging in Azure OpenAI Service

Welcome back fellow geeks.

Over the past few weeks I’ve done a series of posts on the Azure OpenAI Service covering some of the security features of the service. In my first post I gave an overview of what security controls Microsoft makes available for customers to configure to secure their instance of the service. In the second and third posts I did deep dives into the authentication and authorization capabilities of the service. Tonight I’m going to cover the logging capabilities of the service.

Let’s jump right in!

The Azure OpenAI Service emits both logs and metrics. For the purposes of this post I’ll be covering logs. I’ll cover the metrics and monitoring of the service in another post if there is a community interest. Logs emitted by the service have been integrated with the diagnostic setting feature. For those unfamiliar with the diagnostic settings feature, it provides a very simple way to deliver logs and metrics emitted by an Azure service to an Azure Storage Account, Log Analytics Workspace, or to an Event Hub (common use case for passing on to a SIEM like Splunk). In the image below, you can see I’m sending all of the logs and metrics emitted from the service to a Log Analytics Workspace.

Diagnostic Settings

In the image above you can see that the Azure OpenAI Service emits three types of logs which include audit logs, request and response logs, and trace logs. As of the date of this blog all of these logs are sent to the AzureDiagnostics table if you opt to send this logs to a Log Analytics Workspace, so dust off your Kusto skills.

Let’s first take a look at the audit logs, because I know that’s where your security focused eyes darted to. I want to remind you this is a very new service and lots of improvements are coming. Yeah, I did that. I pulled a sales dude move. Seriously though, the audit logging is very limited and likely not what you’d hope for as of the date of this blog. The only events that seem to be logged to the Audit Log for the service are when a ListKeys operation is performed. The operation means a security principal accessed the API Keys. The API keys are used to authenticate to the data plane of the service and do not allow for granular authorization at that plane. Check out my last two posts on authentication and authorization if that sentence doesn’t make sense. Unfortunately, the identity that accessed the API key isn’t listed in the log entry which makes it pretty useless in its current state. Below is a sample entry.

Azure OpenAI Service Audit Log Entry Example

Making this even more useless, this operation is also logged in the Azure Activity Log. The log entry within the Activity Log does include the security principal that performed the action so you’ll want to watch for that activity there. I imagine over time the audit log will be improved to capture more operations and associate those operations to a security principal.

Activity Log Entry Showing List Keys Operation

Next I’m going to cover the Request and Response Logs. This log set is really interesting because likely your expectation is the same as mine was that these would include details around prompts sent to the models and information on the response such as the number of tokens consumed. While it does operations around requests for things such a completions or summarizations, it also captures a ton of other events that would likely be more suited for the audit log. Additionally, the data it captures about these actions is extremely limited.

Let’s take a look at a log entry where I requested the model complete a sentence for me. In my code I’m calling the API using an Azure AD service principal NOT an API key with the shattered hope that the log entry would capture the service principal I’m using.

Prompt and Response Log Entry

In the above log entry we don’t get any information to correlate the operation back an entity even when using Azure AD authentication. All we can see is the completion action occurred at a specific time and resulted in a success status code. You’ll also see there is a CallerIPAddress field. This will include the first three octets of the IP address called the service but not the last octet. Kinda weird it’s being masked like this, but I guess that’s better than nothing? (Not really, but hey it’s a new service).

Before you ask, no, the content of the prompts and responses are not logged in any of these logs.

There is one additional field of relevance I couldn’t fit within the above screenshot and that’s properties_s. The only real useful information on this is total response time the service took to return an answer to the user. I hoped this would have had some information around tokens used, but sadly it does not.

properties_s field of a Prompt and Response Log Entry

Besides prompts and responses, this log seems to capture other data plane operations. This includes everything from activities users have performed around uploading files to the service to train fine-tuned models, activities around fine-tuned models (listing, creation, deletion), creation of embeddings, and management of models deployed to the service. Most of these operations should be in the Audit Log in my opinion. I’m not sure why they’re included in this log, but they are. No, none of these operations include details as to who performed these actions beyond the first three octets of the IP address.

Lastly, there is the trace log. I have no idea what’s logged in there because I have yet generate any trace log data. If you know what gets logged in there, let me know in the comments.

So yes folks, there are some serious gaps in the logging for the service today. However, the service is new and the underlining technology is still pretty new as well so we can’t expect perfection out of the gates. My advice to customers has been to build the logging they need into whatever application is fronting the user access and to lock the service down from an authorization perspective so that the only access to the service comes through that application.

My peer Jake Wang has come up with a creative solution to address some of the logging gaps in the service by placing an API Management instance in front of it. With this design anything communicating with the Azure OpenAI Service instance has to go through APIM. Within APIM you can do whatever fancy logging you want to do, toss in some additional throttling to specific user requests, and lots of other cool stuff. It’s a great workaround while the Product Group improves the native logging. If you have a different API Gateway like Mulesoft you could use this same pattern with that instead of APIM.

Well folks that wraps things up. I hope you got some value out of this post and I’d encourage you to make your voices heard by submitting feedback to the product group on how you’d like to see the logging improved for the service.

Thanks for reading!

Capturing Azure Management Group Activity Logs Using Azure Automation – Part 2

Welcome back fellow geeks!

This post will be the second post in a series covering how to use Azure Automation to capture Azure Management Group Activity Logs.  In the first post I walked through what management groups are and the problems that they solve.  The key takeaway of that post is that management groups have their own Activity Logs and (at this time) they’re only accessible from within the Portal and over the Azure REST API.  Given that management groups are where we’re applying our Azure Policy for governance and compliance and our access controls via Azure RBAC, the Activity Logs are pretty critical.  So what is a geek to do?

In this post I’ll cover a solution I put together to solve the problem.  It uses an Azure Automation PowerShell Runbook to iterate through the management groups within an Azure Active Directory tenant, write the logs to Azure Storage, and optionally deliver the logs to Azure Monitor or Azure EventHubs.  The architecture is pictured below.

Capture.PNG

If you’re not familiar with Azure Automation it’s a service that provides a number of key capabilities within Azure such as configuration management, update management, and process automation.  If you’re coming from AWS, I’d compare it to a service somewhat similar to AWS Systems Manager.  For the purposes of this series of posts I’m going to focus on the process automation capability of the service delivered through Runbooks.  I’m not going to go too in-depth into Azure Automation, but I’ll provide a brief overview of the service features and tweaks relevant to the solution.

Runbooks are modules of code that can be strung together to perform a series of tasks such as performing maintenance on a collection of VMs.  The modules can be authored using either PowerShell or Python.  At this time only Python 2 is supported which makes me a sad panda.  Given that Python 2 enters end of life in two months, I’d recommend doing anything Python related in Azure Functions.  I could devote an entire blog post complaining about the lack of Python 3 in the year 2019, but I’ll spare you.  You’re going to want to author your Runbooks in PowerShell until/if Python 3 is supported is supported in the future.

The Azure Automation account acts as a logical container for the Runbooks created within it.  An Azure Automation Account can be provided with a RunAs account, which is simply a service principal in Azure Active Directory.   The service principal is configured with a certificate credential which is used by the Automation Account to authenticate to Azure AD and access Azure resources within the tenant.  Any Runbooks you create within the Automation account can assume the identity to execute tasks across your Azure resources.

You can automatically provision the RunAs account when the Azure Automation Account is provisioned, just be aware that the service principal will be granted the Contributor role on the Azure Subscription.  This is probably going to be way more permissions than are needed so I’d recommend removing that role assignment, creating a custom RBAC role, and assigning it at the appropriate scope.

Automation Accounts have a number of assets which are relevant for Runbooks.  These include variables, connections, credentials, and certificates.  The links I provided will give you detailed information on these assets, so I’ll summarize the relevant content to the solution.  Variables can come in a variety of types including strings and integers and can also optionally be encrypted.  For this solution I use encrypted variables to store the Event Hub connection string, Log Analytics Workspace Id, and Log Analytics Workspace Key.  Connections contain information required to connect to an external service or application.  The only connection asset used with this solution is the AzureServicePrincipal which is used by the RunAs account.  You can retrieve the  connection to get information such as the Azure AD tenant Id and application id (client id in the OAuth world).  Lastly, we have the certificate asset, which as the name describes, can be used to securely store a certificate that is used for authentication.  This solution uses the AzureRunAsCertificate certificate which contains the certificate asset used to authenticate the Automation Account RunAs account.

Each Automation Account comes with a predefined set of PowerShell modules and .NET libraries.  You can add additional modules and libraries by importing them to the Automation Account.  For this solution I added a number of .NET libraries including the ADAL and some libraries required to communicate with Event Hubs.  While PowerShell does a wonderful job of handling things at the management plane of Azure, it is severely lacking in the data plane requiring you to fall back on incorporating .NET code into your PowerShell script.

The above (including the links) should give you the bare minimum you need to understand to use this solution.  Let’s deep dive into the code.  Since this is a fairly lengthy script I’m not going to paste every line of code.  Instead I’m going to call out key sections of code that were particularly relevant or interesting to write.

The first function in the script is called Get-AdalToken and uses the .NET ADAL library to retrieve a token from Azure AD.  When I code in Python I typically use the MSAL library since I find it to be a bit more slick, but found the .NET version too cumbersome and difficult to use in in PowerShell.  If you’ve ever used .NET libraries in your PowerShell scripts, you know where I’m coming from.

The token retrieved by the function is used for calls to the Azure Management REST API.  The reason I went with ADAL vs pulling the access token from a session created using Add-AzAccount method as demonstrated here is I wanted code I could reuse for other purposes outside of the Azure REST API.

Once the token is retrieved it is stored in a variable for later use in the script.

adal

Next up we have the Get-AllManagementGroups function.  This function calls the Azure REST API to get a full listing of management groups.  Oddly enough there is an AzureRM cmdlet included in the AzureRm.Resources module that comes preinstalled with every new Automation Account.  However, even after updating the modules within the account (this link tells you how to do this and I highly recommend doing it whenever you create a new automation account) the cmdlet only ever reported back the tenant root group.  This occurred even when following the instructions to spit back all Management Groups.  I chalked it up to there being an issue with the cmdlet or user error on my part.  Either way, it was simple enough to whip up a call to the REST API.

Following the Get-AllManagementGroups function we have the Get-ManagementGroupActivityLog function.  Let me tell you folks, this one was an absolute pain to write.  According to this Azure feedback thread these logs have been accessible over the API since back in March of this year, but the REST API reference documentation doesn’t look to have been updated to reflect this.  I’m going to save you all a ton of headaches and hours of experimentation and searching the web.  When you want to get Activity Logs over the REST API you are going to use the following endpoint:


https://management.azure.com/providers/Microsoft.Management/

managementGroups/mgmtGroupId/providers/microsoft.insights/
eventtypes/management/values

The mgmtGroupId variable would be the name of your management group.  If your management group is named production then the value in that URL would be production.  Additionally, you’ll want to pass query parameters of api-version set to 2017-03-01-preview and a $filter query parameter constructed in the same way you would to query a subscription Activity Log.

activitylogquery.PNG

The SendTo-Storage function sends the Activity Log for each Management Group as a separate blob to Azure Storage.  The format of the Activity Log is raw JSON.

The SentTo-Workspace function sends the log data to Azure Monitor (really a Log Analytics Workspace) via the HTTP Data Collector API.  The product team was wonderful enough to include sample PowerShell code that made writing that function a breeze.

I did run into some weirdness with this function which was caused by the maximum size of an output stream in Runbooks which is 1MB.  When I pulled the Activity Log for 90 days, the entirety of the log was well over 1MB so it would cause the Runbook to fail three times and suspend.  Debugging this was a pain because the Runbook doesn’t report the error in an obvious way.  I got around this by collecting the log entries into a group and sending them at 200KB intervals.    Additionally, I also added some error checking and retry handling if it got throttled.

The final function is named SendTo-EventHub and delivers the logs to an Event Hub.  I couldn’t find any PowerShell cmdlets that could be used to send data to Event Hub.  This forced me to fall back to the .NET libraries.  In the end I got it working and got them streaming, but I’m sure someone more skilled in .NET than me (which isn’t difficult to be) could optimize and improve that code.

The main chunk of the solution strings everything together.  By default the solution writes the logs to Azure blob storage.  You can optionally deliver the data to Azure Monitor and Azure Event Hubs.

Well folks that brings us to the end of this post and series.  While I’m sure the product team is quickly coming out with this out of box integration, I learned a ton about Azure Automation and Runbooks working on this effort.  Runbooks are a wonderful tool if you’re a classic infrastructure / security tech new to the whole coding thing.  It’s a very simple and straightforward user experience for that audience and a good stepping stone into the coding world vs jumping directly into Azure Functions.

I’ve posted the solution up onto my Github.  For those folks without Github, I’ve put a static copy of the solution up on this website at this link.  Take it, test it, play with it, build upon it, and experiment with it.

Capturing Azure Management Group Activity Logs Using Azure Automation – Part 1

Capturing Azure Management Group Activity Logs Using Azure Automation – Part 1

Hello again fellow geeks!

Over the past few months I’ve been working with a customer who is just beginning their journey into the cloud.  We’ve had a ton of great conversations around security, governance, and operationalizing Microsoft Azure.  We recently finalized the RACI and identified the controls required by both their internal security policy and their industry compliance requirements.  With those two items complete, we put together our Azure RBAC model and narrowed down the Azure Policies we needed to put in place to satisfy our compliance controls.

After a lot of discussion about the customer’s organization, its geographical locations, business unit makeup, and how its developers and central IT operate, we came up with a subscription model.  This customer had decided on an Azure subscription model where each workload would exist in its own subscription.  Further, each workload’s production and non-production environment would be segmented in different subscriptions.  Keeping each workload in a different subscription ensures no workload will compete for resources with other workloads and hit any subscription limits.  Additionally, it allowed the customer to very easily track the costs associated with each workload.

Now why did we use separate production and non-production subscriptions for each workload?  One reason is to address the same risk as above where a non-production workload could potentially consume all resources within a subscription impacting a production workload.  The other more critical reason is it makes it easier for us to apply different governance and access controls on production workloads vs non-production workloads.  The way we do this is through the usage of Azure Management Groups.

Management Groups were introduced into general availability back in late 2018 to help address the challenges organizations were having operating subscriptions at scale.  They provided a hierarchal method to apply governance and access controls across a collection of subscriptions.  For those of you familiar with AWS, Management Groups are somewhat similar to AWS Organizations and Organizational Units.  For my fellow Windows AD peeps, you can think of Management Groups somewhat like the Active Directory container and organizational unit hierarchy in an Active Directory domain where you apply different access control entries and group policy at high levels in the OU hierarchy that is then enforced and inherited down to the children.  Management Groups work in a similar manner in that the Azure RBAC definitions and assignments and Azure Policy you assign to the parent Management Groups are inherited down into the children.

Every Azure AD tenant starts with a top-level management group called the tenant root group.  Additional management groups created within the tenant are children of the group up to a maximum of 10,000 management groups and up to six levels of depth.  Any RBAC assignment or Azure Policy assigned to the tenant root group applies to all children management group in the tenant.  It’s important to understand that Management Groups are a resource within the Azure AD tenant and not a resource of an Azure subscription.  This will matter for reasons we’ll see later.

The tenant root management group can only be administered by a Global Admin by default and even this requires a configuration change in the tenant.  The method is describe here and what it does is places the global administrator performing the action in the User Access Administrator RBAC role at the root of scope.  Once that is complete, the name of the root management group could be changed, role assignments created, or policy assigned.

Screen Shot 2019-10-17 at 9.59.59 PM

Administering Tenant Root Group

Now there is one aspect of Management Groups that is a bit funky.  If you’re very observant you probably noticed the menu option below.

Screen Shot 2019-10-17 at 9.59.59 PM.png

That’s right folks, Management Groups have their own Activity Log.  Every action you perform at the management group scope such creating an Azure RBAC role assignment or assigning or un-assigning an Azure Policy is captured in this Activity Log.  Now as of today, the only way to access these logs is viewing them through the portal or through the Azure REST API.  Unlike the Activity Logs associated with a subscription, there isn’t native integration with Event Hubs or Azure Storage.  Don’t be fooled by the Export To Event Hub link seen in the screenshot below, this will simply send you to the standard menu where you would configure subscription Activity Logs to be exported.

Screen Shot 2019-10-17 at 10.34.19 PM

Now you could log into the GUI every day and export the logs to a CSV (yes that does work with Management Groups) but that simply isn’t scalable and also prevents you from proactively monitoring the logs.  So how do we deal with this gap while the product team works on incorporating the feature?  This will be the challenge we address in this series.

Over the next few posts I’ll walk through the solution I put together using Azure Automation Runbooks to capture these Activity Logs and send them to Azure Storage for retention and an Azure Log Analytics Workspace for analysis and monitoring using Azure Monitor.

Continue the series in my second post.