Blocking API Key Access in Azure OpenAI Service

Posted on July 1, 2023 by mattfeltonma

This is part of my series on GenAI Services in Azure:

Hello folks! I’m back again with another post on the Azure OpenAI Service. I’ve been working with a number of Microsoft customers in regulated industries helping to get the service up and running in their environments. A question that frequently comes up in this conversations is “How do I prevent usage of the API keys?”. Today, I’m going to cover this topic.

I’ve covered authentication in the AOAI (Azure OpenAI Service) in a past post so read that if you need the gory details. For the purposes of this post, you need to understand that AOIA supports both API keys and AAD (Azure Active Directory) authentication. This dual support is similar to other Azure PaaS (platform-as-a-service) offerings such as Azure Storage, Azure CosmosDB, and Azure Search. When the AOAI instance is created, two API keys are generated which provide full permissions at the data plane. If you’re unfamiliar with the data plane versus management plane, check out my post on authorization.

Given the API keys provide full permissions at the data plane monitoring and controlling their access is critical. As seen in my logging post monitoring the usage of these keys is no simple task since the built-in logging is minimal today. You could use a custom APIM (Azure API Management) policy to include a portion of the API key to track its usage if you’re using the advanced logging pattern, but you still don’t have any ability to restrict what the person/application can do within the data plane like you can when using AAD authentication and authorization. You should prefer AAD authentication and authorization where possible and tightly control API key usage.

In my authorization and logging posts I covered how to control and track who gets access to the API keys. I’ve also covered how APIM can be placed in front of an AOAI instance to enforce AAD authentication. If you block network access to the AOAI service to anything but APIM (such as using a Private Endpoint and Network Security Group) you force the usage of APIM which forces the use of AAD authentication preventing API keys from being used.

**Azure OpenAI Service and Azure API Management Pattern**

The major consideration of the pattern above is it breaks the Azure OpenAI Studio as of today (this may change in the future). The Azure OpenAI Studio is an GUI-based application available within the Azure Portal which allows for simple point-and-click actions within the AOAI data plane. This includes actions such as deploying models and sending prompts to a model through a GUI interface. While all this is available via API calls, you will likely have a user base that wants access a simple GUI to perform these types of actions without having to code to them. To work around this limitation you have to open up network access from the user’s endpoint to the AOAI instance. Opening up these network flows allows the user to bypass APIM which means the user could use an API key to make calls to the AOAI service. So what to do?

In every solution in tech (and life) there is a screwdriver and a hammer. While the screwdriver is the optimal way to go, sometimes you need the hammer. With AOAI the hammer solution is to block usage of API key-based authentication at the AOAI instance level. Since AOAI exists under the Azure Cognitive Services framework, it benefits from a poorly documented property called disableLocalAuth. Setting this property to true blocks the API key-based authentication completely. This property can be set at creation or after the AOAI instance has been deployed. You can set it via PowerShell or via a REST call. Below is code demonstrating how to set it using a call to the Azure REST API.

body=$(cat <<EOF
{
    "properties" : {
        "disableLocalAuth": true
    }
}

az rest --method patch --uri "https://management.azure.com/subscriptions/SUBSCRIPTION_ID/resourceGroups/RESOURCE_GROUP/providers/Microsoft.CognitiveServices/accounts/AOAI_INSTANCE_NAME?api-version=2021-10-01" --body $body

The AOIA instance will take about 2-5 minutes to update. Once the instance finishes updating, all calls to it using API key-based authentication will receive an error such as seen below when using the OpenAI Python SDK.

You can re-enable the usage of API keys by setting the property back to false. Doing this will update the AOAI resource again (around 2-5 minutes) and the instance will begin accepting API keys. Take note that turning the setting off and then back on again WILL cycle the API keys so don’t go testing this if you have applications in production using API keys today.

Mission accomplished right? The user or application can only access the AOAI instance using AAD authentication which enforced granular Azure RBAC authroization. Heck, there is even an Azure Policy available you can use to audit whether AOAI instances have had this property set.

There is a major consideration with the above method. While you’ve blocked access to the API keys, you’re still created a way to circumvent APIM. This means you lose out on the advanced logging provided by APIM and you’ll have to live with the native logging. You’ll need to determine whether that risk is acceptable to your organization.

My suggestion would be to use this control in combination with strict authorization and network controls. There should be a very limited set of users with permissions directly on the AOAI resource and the direct network access to the resource should be tightly controlled. The network control could be accomplished by creating a shared jump host users that require this access could use. Key thing is you treat access to the Azure OpenAI Studio as an exception versus the norm. I’d imagine Microsoft will evolve the Azure OpenAI Studio deployment options over time and address the gaps in native logging. For today, this provides a reasonable compromise.

I did encounter one “quirk” with this option that is worth noting. The account I used to lab this all out had the Owner role assignment at the subscription level. With this account I was able to do whatever I wanted within the AOAI data layer when disableLocalAuth was set to false. When I set disableLocalAuth to true I was unable to make data plane calls (such as deploying new models). When I granted my user one of the data plane roles (such as Azure Cognitive Service OpenAI Contributor) I was able to perform data plane operations once again. It seems like setting this property to true enforces a rule which requires being granted specific data plane-level permissions. Make sure you understand this before you modify the property.

Well folks that concludes this blog post. Here are your key takeaways:

API Key-based authentication can be blocked at the AOIA instance by setting the disableLocalAuth property to true. This setting can be set at deployment or post deployment and takes 2-5 minutes to take effect. Switching the value of this property from true to false will regenerate the API keys for the instance.
The Azure OpenAI Studio requires the user’s endpoint have direct network access to the AOAI instance. This is because it uses the user’s endpoint to make specific API calls to the data plane. You can look at this yourself using debug mode in your browser or a local proxy like Fiddler. Direct network access to the AOAI instance means you will only have the information located in the native logs for the activities the user performs.
Setting disableLocalAuth to true enforces a requirement to have specific data plane-level permissions. Owner on the subscription or resource group is not sufficient. Ensure you pre-provision your users or applications who require access to the AOAI instance with the built-in Azure RBAC roles such as Azure Cognitive Services OpenAI User or a custom role with equivalent permissions prior to setting the option to true.

Thanks folks and have a great weekend!

Logging in Azure OpenAI Service

Posted on April 13, 2023 by mattfeltonma

This is part of my series on GenAI Services in Azure:

Updates:

1/18/2024 – Logs now include Entra ID security principal objectid property in RequestResponse log

Welcome back fellow geeks.

Over the past few weeks I’ve done a series of posts on the Azure OpenAI Service covering some of the security features of the service. In my first post I gave an overview of what security controls Microsoft makes available for customers to configure to secure their instance of the service. In the second and third posts I did deep dives into the authentication and authorization capabilities of the service. Tonight I’m going to cover the logging capabilities of the service.

Let’s jump right in!

The Azure OpenAI Service emits both logs and metrics. For the purposes of this post I’ll be covering logs. I’ll cover the metrics and monitoring of the service in another post if there is a community interest. Logs emitted by the service have been integrated with the diagnostic setting feature. For those unfamiliar with the diagnostic settings feature, it provides a very simple way to deliver logs and metrics emitted by an Azure service to an Azure Storage Account, Log Analytics Workspace, or to an Event Hub (common use case for passing on to a SIEM like Splunk). In the image below, you can see I’m sending all of the logs and metrics emitted from the service to a Log Analytics Workspace.

In the image above you can see that the Azure OpenAI Service emits three types of logs which include audit logs, request and response logs, and trace logs. As of the date of this blog all of these logs are sent to the AzureDiagnostics table if you opt to send this logs to a Log Analytics Workspace, so dust off your Kusto skills.

Let’s first take a look at the audit logs, because I know that’s where your security focused eyes darted to. I want to remind you this is a very new service and lots of improvements are coming. Yeah, I did that. I pulled a sales dude move. Seriously though, the audit logging is very limited and likely not what you’d hope for as of the date of this blog. The only events that seem to be logged to the Audit Log for the service are when a ListKeys operation is performed. The operation means a security principal accessed the API Keys. The API keys are used to authenticate to the data plane of the service and do not allow for granular authorization at that plane. Check out my last two posts on authentication and authorization if that sentence doesn’t make sense. Unfortunately, the identity that accessed the API key isn’t listed in the log entry which makes it pretty useless in its current state. Below is a sample entry.

**Azure OpenAI Service Audit Log Entry Example**

Making this even more useless, this operation is also logged in the Azure Activity Log. The log entry within the Activity Log does include the security principal that performed the action so you’ll want to watch for that activity there. I imagine over time the audit log will be improved to capture more operations and associate those operations to a security principal.

**Activity Log Entry Showing List Keys Operation**

Next I’m going to cover the Request and Response Logs. This log set is really interesting because likely your expectation is the same as mine was that these would include details around prompts sent to the models and information on the response such as the number of tokens consumed. While it does operations around requests for things such a completions or summarizations, it also captures a ton of other events that would likely be more suited for the audit log. Additionally, the data it captures about these actions is extremely limited.

Let’s take a look at a log entry where I requested the model complete a sentence for me. In my code I’m calling the API using an Azure AD service principal NOT an API key with the shattered hope that the log entry would capture the service principal I’m using.

1/18/2024 – The Entra ID object id is now included in the RequestResponse log entry! Hooray!

In the above log entry we don’t get any information to correlate the operation back an entity even when using Azure AD authentication. All we can see is the completion action occurred at a specific time and resulted in a success status code. You’ll also see there is a CallerIPAddress field. This will include the first three octets of the IP address called the service but not the last octet. Kinda weird it’s being masked like this, but I guess that’s better than nothing? (Not really, but hey it’s a new service).

Before you ask, no, the content of the prompts and responses are not logged in any of these logs.

There is one additional field of relevance I couldn’t fit within the above screenshot and that’s properties_s. The only real useful information on this is total response time the service took to return an answer to the user. I hoped this would have had some information around tokens used, but sadly it does not.

**properties_s field of a Prompt and Response Log Entry**

Besides prompts and responses, this log seems to capture other data plane operations. This includes everything from activities users have performed around uploading files to the service to train fine-tuned models, activities around fine-tuned models (listing, creation, deletion), creation of embeddings, and management of models deployed to the service. Most of these operations should be in the Audit Log in my opinion. I’m not sure why they’re included in this log, but they are. No, none of these operations include details as to who performed these actions beyond the first three octets of the IP address.

Lastly, there is the trace log. I have no idea what’s logged in there because I have yet generate any trace log data. If you know what gets logged in there, let me know in the comments.

So yes folks, there are some serious gaps in the logging for the service today. However, the service is new and the underlining technology is still pretty new as well so we can’t expect perfection out of the gates. My advice to customers has been to build the logging they need into whatever application is fronting the user access and to lock the service down from an authorization perspective so that the only access to the service comes through that application.

My peer Jake Wang has come up with a creative solution to address some of the logging gaps in the service by placing an API Management instance in front of it. With this design anything communicating with the Azure OpenAI Service instance has to go through APIM. Within APIM you can do whatever fancy logging you want to do, toss in some additional throttling to specific user requests, and lots of other cool stuff. It’s a great workaround while the Product Group improves the native logging. Check out my recent post for some of the gotchas of APIM logging for the Azure OpenAI Service.

If you have a different API Gateway like Mulesoft you could use this same pattern with that instead of APIM.

Well folks that wraps things up. I hope you got some value out of this post and I’d encourage you to make your voices heard by submitting feedback to the product group on how you’d like to see the logging improved for the service.

Thanks for reading!

Authorization in Azure OpenAI Service

Posted on April 6, 2023 by mattfeltonma

This is part of my series on GenAI Services in Azure:

Hello folks!

The fun with the new Azure OpenAI Service continues! I’ve been lucky enough to have been tapped to help a number of Microsoft financial services customers with getting the Azure OpenAI Service in place with the appropriate infrastructure and security controls. In the process, I get to learn from a ton of smart people in the AI space. It’s truly been one of the highlights of my 20-year career.

Over the past few weeks I’ve been posting about what I’ve learned, and today I’m going to continue with that. In my first post on the service I gave a high level overview of the security controls Microsoft makes available to the customer to secure their instance of Azure OpenAI Service. In my second post I dove deep into how the service handles authentication and how Azure Active Directory (Azure AD) can be used to improve over the built-in API key-based authentication. Today I’m going to cover authorization and demonstrate how using Azure AD authentication lets you take advantage of granular authorization with Azure RBAC.

Let’s dig in!

As I covered in my last post, the Azure OpenAI Service has both a management plane and data plane. Each plane supports different types of authentication (process of verifying the identity of a user, process, or device, often as a prerequisite to allowing access to resources in an information system) and authorization (The right or a permission that is granted to a system entity to access a system resource). Operations such as swapping to a customer-managed key, enabling a private endpoint, or assigning a managed identity to the service occur within the management plane. Activities such as uploading training data or issuing a prompt to a model occur at the data plane. Each plane uses a different API endpoint. The image below will help you visualize the different planes.

**Azure OpenAI Service Management and Data Planes**

As illustrated above, authorization within the management plane is handled using Azure RBAC because authentication to that plane requires Azure AD-based authentication. Here we can limit the operations occurring at the management plane a security principal (user, service principal, managed identity, Azure Active Directory group (local or synchronized from on-premises) can perform by using Azure RBAC. For those of you coming from the AWS world, and where the Azure OpenAI Service may be your first venture into Azure, Azure RBAC is Azure’s authorization solution. It’s similar to an AWS IAM Policy. Let’s take a look at a built-in RBAC role that a customer might grant a data scientist who will be using the Azure OpenAI Service.

{
    "id": "/subscriptions/de90ea7d-a9c3-4957-8c96-XXXXXXXXXXXX/providers/Microsoft.Authorization/roleDefinitions/a97b65f3-24c7-4388-baec-2e87135dc908",
    "properties": {
        "roleName": "Cognitive Services User",
        "description": "Lets you read and list keys of Cognitive Services.",
        "assignableScopes": [
            "/"
        ],
        "permissions": [
            {
                "actions": [
                    "Microsoft.CognitiveServices/*/read",
                    "Microsoft.CognitiveServices/accounts/listkeys/action",
                    "Microsoft.Insights/alertRules/read",
                    "Microsoft.Insights/diagnosticSettings/read",
                    "Microsoft.Insights/logDefinitions/read",
                    "Microsoft.Insights/metricdefinitions/read",
                    "Microsoft.Insights/metrics/read",
                    "Microsoft.ResourceHealth/availabilityStatuses/read",
                    "Microsoft.Resources/deployments/operations/read",
                    "Microsoft.Resources/subscriptions/operationresults/read",
                    "Microsoft.Resources/subscriptions/read",
                    "Microsoft.Resources/subscriptions/resourceGroups/read",
                    "Microsoft.Support/*"
                ],
                "notActions": [],
                "dataActions": [
                    "Microsoft.CognitiveServices/*"
                ],
                "notDataActions": []
            }
        ]
    }
}

Let’s briefly walkthrough each property. The id property is the unique resource name assigned to this role definition. Next up we have the name property and description properties which need no explanations. The assignableScopes property determines at which scope an RBAC role can be assigned. Typical scopes include management groups, subscriptions, resource groups, and resources. Built-in roles will always have an assignable scope of “/” which denotes the RBAC role can be assigned to any management group, subscription, resource group, or role.

I’ll spend a bit of time on the permissions property. The permissions property contains a few different child properties including actions, notActions, dataActions, and notDataActions. The actions property lists the management plane operations allowed by the role while the dataActions lists the data plane operations allowed by the role. The notActions and notDataActions are interesting in that they are used to strip permissions out of the actions or dataActions. For example, say you granted a user full data plane operations to an Azure Key Vault but didn’t want them to have the ability to delete keys. You could to this by giving the user the dataAction of Microsoft.KeyVaults/* and notDataAction of Microsoft.KeyVaults/keys/purge/action. Take note this is NOT an explicit deny. If the user gets this permission in another way through assignment of a different RBAC role the user will be able to perform the action. At this time, Azure does not have a generally available feature that allows for an explicit deny like AWS IAM and what does exist in preview has an extremely narrow scope such that it isn’t very useful.

When you’re ready to assign a role to a security principal (user, service principal, managed identity, Azure Active Directory group (local or synchronized from on-premises) you create what is called a role assignment. A role assignment associates an Azure RBAC Role Definition to a security principal and scope. For example, in the below image I’ve created an RBAC Role Assignment for the Cognitive Services User Role at the resource group scope for the user Carl Carlson. This grants Carl the permission to perform the operations listed in the role definition above to any resource within the resource group, including the Azure OpenAI Resource.

Scroll back and take a look at the role definition, notice any risky permission? If you noticed the permission Microsoft.CognitiveServices/accounts/listkeys/action (remember that the Azure OpenAI Service falls under the Cognitive Services umbrella), grab yourself a cookie. As I’ve covered previously, every instance of the Azure OpenAI Service comes with two API keys. These API keys allow for authentication to the instance at the data plane level, can’t be limited in what they can do, and are very difficult to ever track back to who used them. You will want to very tightly control access to those API keys so be wary of who you give this role out to and may want to instead create a similar custom role but without this permission.

The are two other roles which are specific two the Azure OpenAI Service are the Cognitive Services OpenAI Contributor and Cognitive Services OpenAI User. Let’s look at the contributor role first.

{
    "id": "/providers/Microsoft.Authorization/roleDefinitions/a001fd3d-188f-4b5d-821b-XXXXXXXXXXXX",
    "properties": {
        "roleName": "Cognitive Services OpenAI Contributor",
        "description": "Full access including the ability to fine-tune, deploy and generate text",
        "assignableScopes": [
            "/"
        ],
        "permissions": [
            {
                "actions": [
                    "Microsoft.CognitiveServices/*/read",
                    "Microsoft.Authorization/roleAssignments/read",
                    "Microsoft.Authorization/roleDefinitions/read"
                ],
                "notActions": [],
                "dataActions": [
                    "Microsoft.CognitiveServices/accounts/OpenAI/*"
                ],
                "notDataActions": []
            }
        ]
    }
}

The big difference here is this role doesn’t grant much at the management plane. While this role may seem appealing to give to a data scientist because it doesn’t allow access to the API keys, it also doesn’t allow access to the instance metrics. I’ll talk about this more when I do a post on logging and monitoring in the service, but access to the metrics are important for the data scientists. These metrics allow them to see how much volume they’re doing with the service which can help them estimate costs and avoid hitting API limits.

Under the dataActions you can see this role allows all data plane operations. These operations include uploading training data for the creation of fine-tuned models. If you don’t want your users to have this access, then you can either strip the permissions Microsoft.CognitiveServices/accounts/OpenAI/files/import/action or grant the user the next role I’ll talk about.

One interesting thing to note is that while this role grants all data actions, which include data plane permissions around deployments, users with this role cannot deploy models to the instance. An error will be thrown that the user does not have the Microsoft.CognitiveServices/accounts/deployments/write permission. I’m not sure if this by design, but if anyone has a workaround for it, let me know in the comments. It would seem like if you want the user to deploy a model, you’ll need to model a custom role after this role and add that permissions.

The last role I’m going to cover is the Cognitive Services OpenAI User role. Let’s look at the permissions for this one.

{
    "id": "/providers/Microsoft.Authorization/roleDefinitions/5e0bd9bd-7b93-4f28-af87-XXXXXXXXXXXX",
    "properties": {
        "roleName": "Cognitive Services OpenAI User",
        "description": "Ability to view files, models, deployments. Readers can't make any changes They can inference",
        "assignableScopes": [
            "/"
        ],
        "permissions": [
            {
                "actions": [
                    "Microsoft.CognitiveServices/*/read",
                    "Microsoft.Authorization/roleAssignments/read",
                    "Microsoft.Authorization/roleDefinitions/read"
                ],
                "notActions": [],
                "dataActions": [
                    "Microsoft.CognitiveServices/accounts/OpenAI/*/read",
                    "Microsoft.CognitiveServices/accounts/OpenAI/engines/completions/action",
                    "Microsoft.CognitiveServices/accounts/OpenAI/engines/search/action",
                    "Microsoft.CognitiveServices/accounts/OpenAI/engines/generate/action",
                    "Microsoft.CognitiveServices/accounts/OpenAI/engines/completions/write",
                    "Microsoft.CognitiveServices/accounts/OpenAI/deployments/search/action",
                    "Microsoft.CognitiveServices/accounts/OpenAI/deployments/completions/action",
                    "Microsoft.CognitiveServices/accounts/OpenAI/deployments/embeddings/action",
                    "Microsoft.CognitiveServices/accounts/OpenAI/deployments/completions/write"
                ],
                "notDataActions": []
            }
        ]
    }
}

Like the contributor role, this role is very limited with management plane permissions. At the data plane level, this role really allows for issuing prompts and not much else. This role is great a non-human application role assigned via service principal or managed identity. It will allow the application to issue prompts and not much else. You don’t have to worry about a user exploiting this role to access training data you may have uploaded or making any modification to the Azure OpenAI Service instance.

Well folks that wraps this up. Let’s sum up what we’ve learned:

The Azure OpenAI Service supports fine-grained authorization through Azure RBAC at both the management plane and data plane when the security principal is authenticated through Azure AD.
Avoid using API keys where possible and leverage Azure RBAC for authorization. You can make it much more fine-grained, layer in the controls provided by Azure AD on top of it, and associate the usage of the service back to user (kinda as we’ll see in my post on logging).
Tightly control access to the API keys. I’d recommend any role you give to a data scientist or an application that you strip out the listkeys permissions.
I’d recommend creating a custom role for human users modeled after the Cognitive Services User role but without the listkeys permission. This will grant the user access to the full data plane and allow access to management plane pieces such as metrics. You can optionally be granular with your dataActions and leave out the files permissions to prevent human users from uploading training data.
I’d recommend using the built-in Cognitive Services OpenAI User role for service principals and managed identities assigned to applications. It grants only the permissions these applications are likely going to need and nothing more.
I’d avoid using notActions and notDataActions since it’s not an explicit deny and it’s very difficult to determine an effective user’s access in Azure without another tool like Entra Permissions Management.

Well folks, I hope this post has helped you better understand authorization in the service and how you could potentially craft it to align with least privilege.

Next post up will be around logging.

Have a great night!

Azure OpenAI Service – Infra and Security Stuff

Posted on March 19, 2023 by mattfeltonma

This is part of my series on GenAI Services in Azure:

Welcome back fellow geeks!

The past few months have been crazy busy. My customer load has doubled and customers who went into hibernation for holidays have decided to wake up in full force. With that new demand comes interesting new use cases and blog topics.

Unless you’ve been living under a rock, you’re well aware of the insane amount of innovation and technical developments in the AI space. It seems every day there’s 10 articles on OpenAI’s models (hilarious South Park episode on ChatGPT recently). Microsoft decided to dive straight into the deep end and formed a partnership with OpenAI. Out of this partnership came the Azure OpenAI Service which runs OpenAI models like ChatGPT on Azure infrastructure. As you can imagine, this offering has big appeal to new and existing Azure customers.

Given the demand I was seeing within my own customers, I decided to take a look at the security controls (or infra/security stuff as one of my data counterparts calls it) available within the service. Before jumping into the service, I did some basic experimentation with the OpenAI’s own service using this wonderful tutorial by the Part Time Larry. I found his step-by-step walkthrough of some of the sample code to be absolutely stellar in understanding just how simple it is to interact with the service.

With a very basic (and I do stress basic) understanding of how to interact with OpenAI’s API, I decided upon a use case. The use case I decided upon was to use the summarization feature he davinci GPT-3 model to summarize the NIST document on Zero Trust. I was interested in which key points it would extract from document and whether those would align with what I drew from the document after reading through it fully (re-reading the doc is still in my todo list!).

Before I could do any of the cool stuff I had to get onboarded to the service. At this time, customers must request their subscriptions be onboarded into the service using the process described in Microsoft’s public documentation. While I waited for my subscription to be onboarded, I read through the public documentation with a focus on the “infra/security” stuff. Like most of the data services in Azure, the information on the levers customers can pull around security controls like network, encryption-at-rest, and identity were very high level and not very useful. Lots of mentions of words, but no real explanation of those features would “look” when enabled in the service. There is also the matter of how Microsoft is handling and securing the data the customer data for the service.

Like every cloud provider, Microsoft operates within the shared responsibility model where Microsoft is responsible for the security of the cloud and you, the customer, are responsible for security within the cloud. Simply put, there are controls Microsoft manages behind the scenes and there are controls Microsoft puts in the customer’s hands and it’s on the customer to enable those controls. Microsoft describes how the data is processed and secured for the Azure OpenAI Service in the public documentation. Customers should additionally review the Microsoft Products and Services Data Protection Addendum and specific product terms. Another great resource to review is documentation within the Microsoft Services Trust Portal. In the Trust Portal you can find all the compliance-related documentation such as the SOC-2 Type II which will provide detail as to Microsoft’s processes and controls it uses to protect data. For a much deeper dive, you can review the FedRAMP SSP (System Security Plan). I typically find myself scanning through the SOC2 first and then very often diving deeper by reading through the relevant sections in the FedRAMP SSP. I’ll let you read through and consume the documentation above (and you should be doing that for every service you consume). For the purposes of this blog post, I’m going to look at the “security within the cloud”.

I’m a big fan of taking a step back and looking at things from a high level architectural view. After reading through documentation, I envisioned the following Azure components being they key components required in any implementation of the service within a regulated industry.

Let’s walk through each of these components.

The first component is the Azure OpenAI Service instance which is service under the Cognitive Services umbrella. Azure Cognitive Services includes existing services like speech-to-text, image analysis, and the like. This was a great idea by Microsoft because it would allow the Product Group (PG) managing the Azure OpenAI Service to leverage existing architectural standards already adopted for other services under the Cognitive Services umbrella.

The next component is the Azure Key Vault instance. Within an instance of Azure OpenAI Service there are three types of data that could be stored within a customer’s instance of the service. I say could because this data is only stored if you choose to use specific features and capabilities of the service. This data includes training data you may provide to fine-tune models, the fine-tuned models themselves, and prompts and completions. Training data is only stored if you opt to train your own fine-tuned models and the training data can be removed as soon as you finish training your fine-tuned model. From talking to my much smarter peers, there is a very low percentage of customers that will need to create fine-tuned models. I’ve heard as low as 1% of customers will need to do this since the included models are already trained very effectively. Prompts and completions are by default stored for 30 days for human evaluation to ensure the models are being used in an appropriate way. Customers have the option to opt out of the content filtering using the process outlined in this piece of public documentation. If they opt out, this data is never stored.

If the customer opts to use a feature that creates this data, then the data is encrypted-at-rest by default with Microsoft-managed keys when stored within the Microsoft-managed boundary. This means that Microsoft manages the authorization and rotation of the keys. Many regulated customers have regulatory requirements or internal policies that require the customer to manage authorization and rotation of any keys used to encrypt data in their environment. For that reason, cloud providers such as Microsoft provide the option to use CMKs (Customer Managed Keys). In Azure, these CMKs are stored within an Azure Key Vault instance within a customer’s subscription and the customer controls authorization and access to the keys.

The Azure OpenAI Service supports the use of CMKs to protect at least two out of three of these sets of data. The documentation is unclear as to whether the prompts and completions can be encrypted with CMKs. If you happen to know, let me know in the comments. Take note that for now you need to request access to get your subscription approved for CMKs with the Azure OpenAI Service.

Next up we have virtual networks, private endpoints and Azure Private DNS. Like the rest of the services in the Cognitive Services umbrella, the OpenAI service supports private endpoints as a means to lock down network access to your private IP space. The DNS namespace for the service is privatelink.openai.azure.com. Best practice would have you hosting this zone in Azure Private DNS which we’ll see later on when I share a sample architecture. It is worth noting that the Azure Open AI Service also supports what I refer to as the service firewall. This allows you to limit access to the service to a specific set of public IPs (such as your enterprise’s forward web proxy) or to a specific virtual network via a Service Endpoint.

Next, we have Azure Storage. If you choose to build a fine tuned model training data can be uploaded to an Azure Storage Account within the customer’s subscription. The customer’s instance of the Azure OpenAI Service can then retrieve the data using a method I will explain later in this post.

We then have managed identities and Azure RBAC. For the service, managed identities are used to access the CMKs stored in the customer Key Vault instance. Azure RBAC will be used to control access to the Azure OpenAI Services instance and keys used to call the service APIs.Stepping back and looking at the components above and how they fit together to provide security controls across identity, network, encryption, and encryption, I see it like the below.

For the Azure OpenAI Service instance running the models, you lock down the service using Azure RBAC. Authentication to the service is supported through a set of API keys which you will need to manage rotation of. Optionally, (I haven’t tested this myself), you can use Azure AD authentication to obtain a bearer token to authenticate to the service. You secure network access by restricting access to the service using private endpoints. Data is optionally encrypted with CMKs stored in a customer-managed Key Vault instance to enable the customer to control access to the keys, rotate keys, and audit usage of those keys. The Azure OpenAI Service also offers logs and metrics which can be delivered to Azure Storage, a Log Analytics Workspace, or an Event Hub via the diagnostics settings configured on the instance. The security specific logs you’ll be interested in are the audit logs and potentially the prompt and completion logs.

The Azure Key Vault instance used when customers opt to use CMKs can have access to the keys controlled using Azure RBAC (when using a Key Vault instance enabled for Azure RBAC vault policies) and managed identities. The Azure OpenAI Service instance will access the CMK using the managed identity assigned to the service. Take note that as of today, you cannot use the Key Vault service firewall to restrict network access. Azure Cognitive Services is not considered a Trusted Azure Service for Key Vault and thus can’t be allowed network access when the service firewall is enabled.

If the customer chooses to store training data in an Azure Storage Account before uploading to the service, the account can be secured for user access with Azure RBAC or SAS tokens. Since SAS tokens are a nightmare to manage for humans, you’ll want to control access to the data for humans using Azure RBAC. The Azure OpenAI Service itself does not support the use of a managed identity for access of Azure Storage today. This means you’ll need to secure the data using a SAS token for non-human access of the data during upload. Since the Azure OpenAI Service does not yet support a managed identity for access to Azure Storage, it cannot take advantage of the service instance authorization rules. Allowing just the trusted services for Azure Storage doesn’t seem to work either in my testing. This means that you’ll need to allow all public network access to the storage account. Your means to secure that data will be SAS tokens largely for the access coming from the Azure OpenAI Service. Not ideal, but hey, the service is very new.

So putting everything together than we’ve learned, what could this look like architecturally?

**Azure OpenAI Service Sample Architecture**

Above is an example architecture that is common in regulated organizations that have adopted Azure VWAN. In this pattern, all service instances related to the deployment would be placed in a dedicated workload subscription as indicated by the orange outline. This includes the virtual network containing the Azure OpenAI Service private endpoint, the Azure OpenAI Service instance, user-assigned managed identity used by the Azure OpenAI Service instance, the workload key vault containing the CMK used to encrypt the data held by the Azure OpenAI Service, and the Azure Storage Account used to stage training data to be uploaded to the service.

The Azure OpenAI Service would have its network access secured to the private endpoint. Both the Azure Key Vault instance and Storage Account would have their network access open to public networks. Access to the data for Azure Key Vault would be secured with Azure AD authentication and Azure RBAC vault policies for authorization. The Azure Storage account would use Azure AD authentication and Azure RBAC to control access for human users and SAS tokens to control access from the Azure OpenAI Service instance.

Lastly, although not listed in the images, it should go without saying that Azure Policy should be put in place to ensure all of the resources look the way you and your security team has decided the resources need to look.

As the service grows and matures, I expect some of these gaps in network controls to be addressed through support for managed identities to access storage accounts and the addition of the service to Azure Key Vault’s trusted services. I also wouldn’t be surprised to see some type of VNet-injection or VNet-integration to be introduced similar to what is available in Azure Machine Learning.

Well folks, I hope this helped you infra and security folks do your “infra/security stuff” for the day and you now better understand some of the levers and switches you have available to you to secure the service. As I progress in my learning of the service and AI in general, I plan on adding some posts which will walk through the implementation in action doing a deeper dive how this architecture looks when implemented. I have it running in my demo environment, but time is a very limited thing these days.

Thanks folks and I hope your journey into AI has been as fun as mine has been so far!

Check out my other posts on this service:

Journey Of The Geek

The chronicles of a Bostonian tech geek navigating through life and technology

Tag Archives: microsoft azure

Azure OpenAI Service – Infra and Security Stuff