Hello again my fellow geeks.
Welcome to part two of my series on visualizing Office 365 security logs. In my last post I walked through the process of getting the sign-in and security logs and provided a link to some Lambda’s I put together to automate pulling them down from Microsoft Graph. Recall that the Lambda stores the files in raw format (with a small bit of transformation on the time stamps) into Amazon S3 (Simple Storage Service). For this demonstration I modified the parameters for the Lambda to download the 30 days of the sign-in logs and to store them in an S3 bucket I use for blog demos.
When the logs are pulled from Microsoft Graph they come down in JSON (JavaScript Object Notation) format. Love JSON or hate it is the common standard for exchanging information these days. The schema for the JSON representation of the sign-in logs is fairly complex and very nested because there is a ton of great information in there. Thankfully Microsoft has done a wonderful job of documenting the schema. Now that we have the logs and the schema we can start working with the data.
When I first started this effort I had put together a Python function which transformed the files into a CSV using pipe delimiters. As soon as I finished the function I wondered if there was an alternative way to handle it. In comes Amazon Athena to the rescue with its Openx-JsonSerDe library. After reading through a few blogs (great AWS blog here), StackOverflow posts, and the official AWS documentation I was ready to put something together myself. After some trial and error I put together a working DDL (Data Definition Language) statement for the data structure. I’ve made the DDLs available on Github.
Once I had the schema defined, I created the table in Athena. The official AWS documentation does a fine job explaining the few clicks that are provided to create a table, so I won’t re-create that here. The DDLs I’ve provided you above will make it a quick and painless process for you.
Let’s review what we’ve done so far. We’ve setup a reoccurring job that is pulling the sign-in and audit logs via the API and is dumping all that juicy data into cheap object storage which we can further enforce lifecycle policies for. We’ve then defined the schema for the data and have made it available via standard SQL queries. All without provisioning a server and for pennies on the dollar. Not to shabby!
At this point you can use your analytics tool of choice whether it be QuickSight, Tableau, PowerBi, or the many other tools that have flooded the market over the past few years. Since I don’t make any revenue from these blog posts, I like to go the cheap and easy route of using Amazon QuickSight.
After completing the initial setup of QuickSight I was ready to go. The next step was to create a new data set. For that I clicked the Manage Data button and selected New Data Set.
On the Create a Data Set screen I selected the Athena option and created a name for the data source.
From there I selected the database in Athena which for me was named azuread. The tables within the database are then populated and I chose the tbl_signin_demo which points to the test S3 bucket I mentioned previously.
Due to the complexity of the data structure I opted to use a custom SQL query. There is no reason why you couldn’t create the table I’m about to create in Athena and then connect to that table instead to make it more consumable for a wider array of users. It’s really up to you and I honestly don’t know what the appropriate “big data” way of doing it is. Either way, those of you with real SQL skills may want to look away from this query lest you experience a Raiders of The Lost Ark moment.
You were warned.
SELECT records.id, records.createddatetime, records.userprincipalname, records.userDisplayName, records.userid, records.appid, records.appdisplayname, records.ipaddress, records.clientappused, records.mfadetail.authdetail AS mfadetail_authdetail, records.mfadetail.authmethod AS mfadetail_authmethod, records.correlationid, records.conditionalaccessstatus, records.appliedconditionalaccesspolicy.displayname AS cap_displayname, array_join(records.appliedconditionalaccesspolicy.enforcedgrantcontrols,' ') AS cap_enforcedgrantcontrols, array_join(records.appliedconditionalaccesspolicy.enforcedsessioncontrols,' ') AS cap_enforcedsessioncontrols, records.appliedconditionalaccesspolicy.id AS cap_id, records.appliedconditionalaccesspolicy.result AS cap_result, records.originalrequestid, records.isinteractive, records.tokenissuername, records.tokenissuertype, records.devicedetail.browser AS device_browser, records.devicedetail.deviceid AS device_id, records.devicedetail.iscompliant AS device_iscompliant, records.devicedetail.ismanaged AS device_ismanaged, records.devicedetail.operatingsystem AS device_os, records.devicedetail.trusttype AS device_trusttype,records.location.city AS location_city, records.location.countryorregion AS location_countryorregion, records.location.geocoordinates.altitude, records.location.geocoordinates.latitude, records.location.geocoordinates.longitude,records.location.state AS location_state, records.riskdetail, records.risklevelaggregated, records.risklevelduringsignin, records.riskstate, records.riskeventtypes, records.resourcedisplayname, records.resourceid, records.authenticationmethodsused, records.status.additionaldetails, records.status.errorcode, records.status.failurereason FROM "azuread"."tbl_signin_demo" CROSS JOIN (UNNEST(value) as t(records))
This query will de-nest the data and give you a detailed (possibly extremely large depending on how much data you are storing) parsed table. I was now ready to create some data visualizations.
The first visual I made was a geospatial visual using the location data included in the logs filtered to failed logins. Not surprisingly our friends in China have shown a real interest in my and my wife’s Office 365 accounts.
Next up I was interested in seeing if there were any patterns in the frequency of the failed logins. For that I created a simple line chart showing the number of failed logins per user account in my tenant. Interestingly enough the new year meant back to work for more than just you and me.
Like I mentioned earlier Microsoft provides a ton of great detail in the sign-in logs. Beyond just location, they also provide reasons for login failures. I next created a stacked bar chat to show the different reasons for failed logs by user. I found the blocked sign-ins by malicious IPs interesting. It’s nice to know that is being tracked and taken care of.
Failed logins are great, but the other thing I was interested in is successful logins and user behavior. For this I created a vertical stacked bar chart that displayed the successful logins by user by device operating system (yet more great data captured in the logs). You can tell from the bar on the right my wife is a fan of her Mac!
As I gather more data I plan on creating some more visuals, but this was great to start. The geo-spatial one is my favorite. If you have access to a larger data set with a diverse set of users your data should prove fascinating. Definitely share any graphs or interesting data points you end up putting together if you opt to do some of this analysis yourself. I’d love some new ideas!
That will wrap up this series. As you’ve seen the modern tool sets available to you now can do some amazing things for cheap without forcing you to maintain the infrastructure behind it. Vendors are also doing a wonderful job providing a metric ton of data in their logs. If you take the initiative to understand the product and the data, you can glean some powerful information that has both security and business value. Even better, you can create some simple visuals to communicate that data to a wide variety of audiences making it that much more valuable.
Have a great weekend!