ABCs of UEBA: M is for Machine Learning

If log data is the life blood of User and Entity Behavior Analytics (UEBA), then Machine Learning (ML) is the brain. Machine learning algorithms ingest data feeds and turn raw data into risk prioritized intelligence. This is the value of ML in the world of behavior analytics. And it happens in real-time, on big data, across all users and entities in the network.

Machine Learning in Action: How Does it all Work?

How does machine learning work as it relates to UEBA? It’s essentially a process that involves the following four phases:

  1. Collect user and entity event data
  2. Perform statistical analysis on the data to figure out which fields are usable
  3. Create variables/features that can capture the information encoded in the data
  4. Apply an algorithm to see which algorithm can best fit the data

Types of Machine Learning Algorithms

There are two main classes of machine learning algorithms: unsupervised learning algorithms and supervised learning algorithms. Within each class there are a number of algorithms. There is no “best” algorithm. It all depends on the data and your goal. Let’s look at a few that apply to UEBA. This is a non-exhaustive list:

  1. Unsupervised Learning Algorithms – characterized by clustering and groups
    • K-means
    • Hierarchical Clustering
    • DBSCAN (Density-based Spatial Clustering of Application with Noise)
    • Local Outlier Factor
    • One Class SVM
  1. Supervised Learning Algorithms – characterized by tags
    • Linear Regression: line
    • Logistic Regression: curve
    • Decision Trees
    • Neural Networks: weights continually adjusted during training
    • Naïve Bayes

Supervised vs Unsupervised: Which Algorithm to use?

When should you use a supervised learning algorithm versus an unsupervised learning algorithm? When you have a dataset that has markers or tags, then it’s easier to use a supervised learning algorithm because the algorithm knows how to do a fit around the data set.

Take, for example, a credit card statement. How do you figure out which charges are fraudulent? If the charges are tagged, then you can use a supervised learning algorithm. Fraud charges would be tagged as “1” and authorized charges tagged as “0”.  In this case, the supervised learning algorithm knows how to distinguish between fraud and non-fraud. If the data is not tagged, however, you would need to use an unsupervised learning algorithm to identify the fraudulent transactions.

Think of it like this: when you are training a dataset to identify cat and dog images, initially the algorithm does not know the difference between a cat or a dog until you tell it. A tag is a label. It gives the algorithm a goal to adjust parameters around. And labels help supervised learning algorithms find patterns in tagged data.

When you label a user and entity behavior, the supervised learning algorithms learn how to distinguish between good and malicious behavior. Generally, there are multiple supervised learning algorithms people use. The most common algorithms in use today are logistic regression, deep neural networks, and linear regression. These are the typical supervised machine learning algorithms used with UEBA platforms, starting from the most simplistic and moving to those delivering more complexity.

Unsupervised learning algorithms are characterized by clustering and groups. You can run an unsupervised learning algorithm to “learn” which data points are similar and which ones are not. For example, let’s say someone is breaking into a machine. We quite literally don’t know what is going on, so we look at the machine logs. We start sampling the data and breaking it down into frequency counts, histograms and time series to see where the averages are. When you pass data through an unsupervised learning algorithm (for example, K-means), it clusters data that are similar. Any outlier points will wind up typically in the smallest cluster. And that is where we find outlier behavior.

Let’s look at another example: say you want to cluster children in a classroom by height. Children with similar heights will be clustered into the same group. You’ll always find an odd person who is either very tall or very short, and these individuals will stand out and form their own clusters. They will be tagged as outliers since the size of the clusters are so small. You can use this mechanism to tag other similar data.

Machine Learning Monitors User and Entity Behaviors

Machine learning algorithms are used to create models. Gurucul uses machine learning models to monitor user and entity behavior at scale.  Take SSH logs. If you analyze SSH logs using a clustering algorithm, you will likely see the same user logging into the same machine or group of machines at approximately the same time(s) every day. However, if this user suddenly logs into a different machine, the clustering algorithm will spike, and this new machine will be put into its own cluster – as an outlier. This behavior is far from the normal behavior exhibited by the user, which is an example of how ML identifies anomalous behavior. The real question is: how risky is this behavior? To ascertain whether anomalous behavior is malicious, we look at additional context. What else is this user doing? What are users in his peer group doing?

When we look at user and entity activity with machine learning models, we use multiple algorithms to get to the truth. For example, Time Series Analysis refers to regression algorithms that use time dependencies within the data. Is the user operating out of bounds – in a certain hour or range of time? Combine that information with results from a K-means algorithm that looks at groups of machines being targeted, and you can have context for determining that user is working off hours, on a system update for example.

How Machine Learning Predicts and Detects Insider Threats

Predicting, detecting, and stopping insider threats is a key UEBA use case. Here is where the machine learning rubber meets the road as they say. Given the appropriate log data, machine learning algorithms can detect if outsiders gain unauthorized access, which is an account compromise insider threat.

How do machine learning models predict malicious insiders? One example is using ML models to perform sentiment analysis on email logs. Sentiment analysis data mines emails to see if someone is going to go off the deep end, so you can stop them before they do. You cannot scan the content of the emails due to data privacy laws, but you can scan the email subject lines, attachments, sender and recipient details. This usually gives you enough information to see what’s going on.

In one use case with a customer, we were doing sender and recipient checks on pairs. All the outliers that jumped out were users who were emailing source code to and accounts. To add to the context, we also looked at source code repository logs and discovered that a particular user had been downloading complete source code trees which was very unusual behavior. As you can imagine, that user was quietly escorted from the premises.

The greater the potential damage, the more critical it is to employ ML to predict and prevent that damage. In the technology sector, for example, senior management are renowned for taking intellectual property with them when they move to a competitor. ML takes politics out of the equation and flattens employment hierarchies. If you’re an insider threat, Gurucul UEBA will detect your bad behavior, whether you’re a CEO or an administrator. It’s just data science to us.

How Machine Learning Detects Unknown threats

We talked about the misuse of personnel badges in our previous blog, “ABCs of UEBA: L is for Logs”. That’s one example of how machine learning can detect unknown threats. In another case, we found multiple logins on the same computer from different people in different parts of the company. It turns out that an employee was sharing her login with her manager. Was that an anomaly? Yes. But was it an actual threat? Unclear. However, we detected the anomaly with ML and the behavior was certainly unknown to our customer.

In another case, there was some unusual activity in the system logs and our customer could not figure out what was going on. We ran analytics on the logs and found that someone was logging into multiple accounts using the same cell phone, but from different locations. One was in New York and the other was in Boston. We saw the spike and reported it. It was a complete unknown. It may have been a cloned cell phone, but whatever it was, it wasn’t acceptable behavior. They closed the account and that was that!

Learn more about how Gurucul’s behavior based security analytics implements machine learning models for advanced threat detection and prevention by reading this blog post.

Customize Machine Learning Models with Gurucul STUDIOTM

Sometimes, out-of-the box machine learning models will not yield the most effective results. In such cases, it’s a huge benefit to leverage a UEBA platform that gives you the ability to customize ML models or build your own. Gurucul STUDIO enables you to create custom ML models without coding and a minimal knowledge of data science. Gurucul STUDIO provides a step-by-step graphical interface to select attributes, train models, create baselines, set prediction thresholds and define feedback loops. It supports an open choice for big data and a flex data connector to ingest any on-premises or cloud data source for desired attributes. It also provides an analytics Software Development Kit (SDK) to allow you to build models outside the platform (in Python, Java, whatever you like) and import them into Gurucul UEBA.  Contact us to see a demo or for more information on our UEBA platform.

Prev: ABCs of UEBA: L is for LOGS Next: ABCs of UEBA: N is for NETWORK