Welcome back fellow geeks!
The past few months have been crazy busy. My customer load has doubled and customers who went into hibernation for holidays have decided to wake up in full force. With that new demand comes interesting new use cases and blog topics.
Unless you’ve been living under a rock, you’re well aware of the insane amount of innovation and technical developments in the AI space. It seems every day there’s 10 articles on OpenAI’s models (hilarious South Park episode on ChatGPT recently). Microsoft decided to dive straight into the deep end and formed a partnership with OpenAI. Out of this partnership came the Azure OpenAI Service which runs OpenAI models like ChatGPT on Azure infrastructure. As you can imagine, this offering has big appeal to new and existing Azure customers.
Given the demand I was seeing within my own customers, I decided to take a look at the security controls (or infra/security stuff as one of my data counterparts calls it) available within the service. Before jumping into the service, I did some basic experimentation with the OpenAI’s own service using this wonderful tutorial by the Part Time Larry. I found his step-by-step walkthrough of some of the sample code to be absolutely stellar in understanding just how simple it is to interact with the service.
With a very basic (and I do stress basic) understanding of how to interact with OpenAI’s API, I decided upon a use case. The use case I decided upon was to use the summarization feature he davinci GPT-3 model to summarize the NIST document on Zero Trust. I was interested in which key points it would extract from document and whether those would align with what I drew from the document after reading through it fully (re-reading the doc is still in my todo list!).
Before I could do any of the cool stuff I had to get onboarded to the service. At this time, customers must request their subscriptions be onboarded into the service using the process described in Microsoft’s public documentation. While I waited for my subscription to be onboarded, I read through the public documentation with a focus on the “infra/security” stuff. Like most of the data services in Azure, the information on the levers customers can pull around security controls like network, encryption-at-rest, and identity were very high level and not very useful. Lots of mentions of words, but no real explanation of those features would “look” when enabled in the service. There is also the matter of how Microsoft is handling and securing the data the customer data for the service.
Like every cloud provider, Microsoft operates within the shared responsibility model where Microsoft is responsible for the security of the cloud and you, the customer, are responsible for security within the cloud. Simply put, there are controls Microsoft manages behind the scenes and there are controls Microsoft puts in the customer’s hands and it’s on the customer to enable those controls. Microsoft describes how the data is processed and secured for the Azure OpenAI Service in the public documentation. Customers should additionally review the Microsoft Products and Services Data Protection Addendum and specific product terms. Another great resource to review is documentation within the Microsoft Services Trust Portal. In the Trust Portal you can find all the compliance-related documentation such as the SOC-2 Type II which will provide detail as to Microsoft’s processes and controls it uses to protect data. For a much deeper dive, you can review the FedRAMP SSP (System Security Plan). I typically find myself scanning through the SOC2 first and then very often diving deeper by reading through the relevant sections in the FedRAMP SSP. I’ll let you read through and consume the documentation above (and you should be doing that for every service you consume). For the purposes of this blog post, I’m going to look at the “security within the cloud”.
I’m a big fan of taking a step back and looking at things from a high level architectural view. After reading through documentation, I envisioned the following Azure components being they key components required in any implementation of the service within a regulated industry.
Let’s walk through each of these components.
The first component is the Azure OpenAI Service instance which is service under the Cognitive Services umbrella. Azure Cognitive Services includes existing services like speech-to-text, image analysis, and the like. This was a great idea by Microsoft because it would allow the Product Group (PG) managing the Azure OpenAI Service to leverage existing architectural standards already adopted for other services under the Cognitive Services umbrella.
The next component is the Azure Key Vault instance. Within an instance of Azure OpenAI Service there are three types of data that could be stored within a customer’s instance of the service. I say could because this data is only stored if you choose to use specific features and capabilities of the service. This data includes training data you may provide to fine-tune models, the fine-tuned models themselves, and prompts and completions. Training data is only stored if you opt to train your own fine-tuned models and the training data can be removed as soon as you finish training your fine-tuned model. From talking to my much smarter peers, there is a very low percentage of customers that will need to create fine-tuned models. I’ve heard as low as 1% of customers will need to do this since the included models are already trained very effectively. Prompts and completions are by default stored for 30 days for human evaluation to ensure the models are being used in an inappropriate way. Customers have the option to opt out of the content filtering using the process outlined in this piece of public documentation. If they opt out, this data is never stored.
If the customer opts to use a feature that creates this data, then the data is encrypted-at-rest by default with Microsoft-managed keys when stored within the Microsoft-managed boundary. This means that Microsoft manages the authorization and rotation of the keys. Many regulated customers have regulatory requirements or internal policies that require the customer to manage authorization and rotation of any keys used to encrypt data in their environment. For that reason, cloud providers such as Microsoft provide the option to use CMKs (Customer Managed Keys). In Azure, these CMKs are stored within an Azure Key Vault instance within a customer’s subscription and the customer controls authorization and access to the keys.
The Azure OpenAI Service supports the use of CMKs to protect at least two out of three of these sets of data. The documentation is unclear as to whether the prompts and completions can be encrypted with CMKs. If you happen to know, let me know in the comments. Take note that for now you need to request access to get your subscription approved for CMKs with the Azure OpenAI Service.
Next up we have virtual networks, private endpoints and Azure Private DNS. Like the rest of the services in the Cognitive Services umbrella, the OpenAI service supports private endpoints as a means to lock down network access to your private IP space. The DNS namespace for the service is privatelink.openai.azure.com. Best practice would have you hosting this zone in Azure Private DNS which we’ll see later on when I share a sample architecture. It is worth noting that the Azure Open AI Service also supports what I refer to as the service firewall. This allows you to limit access to the service to a specific set of public IPs (such as your enterprise’s forward web proxy) or to a specific virtual network via a Service Endpoint.
Next, we have Azure Storage. If you choose to build a fine tuned model training data can be uploaded to an Azure Storage Account within the customer’s subscription. The customer’s instance of the Azure OpenAI Service can then retrieve the data using a method I will explain later in this post.
We then have managed identities and Azure RBAC. For the service, managed identities are used to access the CMKs stored in the customer Key Vault instance. Azure RBAC will be used to control access to the Azure OpenAI Services instance and keys used to call the service APIs.Stepping back and looking at the components above and how they fit together to provide security controls across identity, network, encryption, and encryption, I see it like the below.
For the Azure OpenAI Service instance running the models, you lock down the service using Azure RBAC. Authentication to the service is supported through a set of API keys which you will need to manage rotation of. Optionally, (I haven’t tested this myself), you can use Azure AD authentication to obtain a bearer token to authenticate to the service. You secure network access by restricting access to the service using private endpoints. Data is optionally encrypted with CMKs stored in a customer-managed Key Vault instance to enable the customer to control access to the keys, rotate keys, and audit usage of those keys. The Azure OpenAI Service also offers logs and metrics which can be delivered to Azure Storage, a Log Analytics Workspace, or an Event Hub via the diagnostics settings configured on the instance. The security specific logs you’ll be interested in are the audit logs and potentially the prompt and completion logs.
The Azure Key Vault instance used when customers opt to use CMKs can have access to the keys controlled using Azure RBAC (when using a Key Vault instance enabled for Azure RBAC vault policies) and managed identities. The Azure OpenAI Service instance will access the CMK using the managed identity assigned to the service. Take note that as of today, you cannot use the Key Vault service firewall to restrict network access. Azure Cognitive Services is not considered a Trusted Azure Service for Key Vault and thus can’t be allowed network access when the service firewall is enabled.
If the customer chooses to store training data in an Azure Storage Account before uploading to the service, the account can be secured for user access with Azure RBAC or SAS tokens. Since SAS tokens are a nightmare to manage for humans, you’ll want to control access to the data for humans using Azure RBAC. The Azure OpenAI Service itself does not support the use of a managed identity for access of Azure Storage today. This means you’ll need to secure the data using a SAS token for non-human access of the data during upload. Since the Azure OpenAI Service does not yet support a managed identity for access to Azure Storage, it cannot take advantage of the service instance authorization rules. Allowing just the trusted services for Azure Storage doesn’t seem to work either in my testing. This means that you’ll need to allow all public network access to the storage account. Your means to secure that data will be SAS tokens largely for the access coming from the Azure OpenAI Service. Not ideal, but hey, the service is very new.
So putting everything together than we’ve learned, what could this look like architecturally?
Above is an example architecture that is common in regulated organizations that have adopted Azure VWAN. In this pattern, all service instances related to the deployment would be placed in a dedicated workload subscription as indicated by the orange outline. This includes the virtual network containing the Azure OpenAI Service private endpoint, the Azure OpenAI Service instance, user-assigned managed identity used by the Azure OpenAI Service instance, the workload key vault containing the CMK used to encrypt the data held by the Azure OpenAI Service, and the Azure Storage Account used to stage training data to be uploaded to the service.
The Azure OpenAI Service would have its network access secured to the private endpoint. Both the Azure Key Vault instance and Storage Account would have their network access open to public networks. Access to the data for Azure Key Vault would be secured with Azure AD authentication and Azure RBAC vault policies for authorization. The Azure Storage account would use Azure AD authentication and Azure RBAC to control access for human users and SAS tokens to control access from the Azure OpenAI Service instance.
Lastly, although not listed in the images, it should go without saying that Azure Policy should be put in place to ensure all of the resources look the way you and your security team has decided the resources need to look.
As the service grows and matures, I expect some of these gaps in network controls to be addressed through support for managed identities to access storage accounts and the addition of the service to Azure Key Vault’s trusted services. I also wouldn’t be surprised to see some type of VNet-injection or VNet-integration to be introduced similar to what is available in Azure Machine Learning.
Well folks, I hope this helped you infra and security folks do your “infra/security stuff” for the day and you now better understand some of the levers and switches you have available to you to secure the service. As I progress in my learning of the service and AI in general, I plan on adding some posts which will walk through the implementation in action doing a deeper dive how this architecture looks when implemented. I have it running in my demo environment, but time is a very limited thing these days.
Thanks folks and I hope your journey into AI has been as fun as mine has been so far!
Check out my other posts on this service: