[vc_row][vc_column][vc_column_text]In the world of cyber security, there is the concept of known threats and unknown threats. Known threats are threats you know about, and methods exist to remediate said threats. Unknown threats are the things you don’t yet know about – like zero-day threats. These are the most difficult threats to detect unless you have a mature User and Entity Behavior Analytics (UEBA) solution. One of the most valuable capabilities of a UEBA solution is its ability to predict, detect, and stop known and unknown threats.
Cyberattack types span the gamut: Trojan horse, adware, computer worms, botnets, DOS and DDOS attacks, phishing, rootkits, SQL Injection attacks, man-in-the-middle attacks and more. Known threats are easy to detect because you know what to look for. Granted, new threats are constantly popping up, but the minute they are discovered you can add them to your repertoire.
The most common way to detect known threats is to write a rule to filter something out that is known, since you already know what you’re looking for. This is how SIEMs work; you need to write rules to look for specific threats, which means the rules have to be updated and maintained. It’s slow, ineffective and laborious.
You can certainly write rules with machine learning models. That is one approach: to create a machine learning model that is a rule or a set of rules. It can filter out known threats – for example, a trojan or botnet – by looking for specific keywords in the dataset.
The difference between a SIEM and machine learning is that you can collapse hundreds of rules into a set of features from a machine learning model. With SIEMs, you’re looking at writing SQL queries that are essentially “if-then” statements. If I see this data, then it’s a botnet. And these SQL queries are executed in a sequence, not all at once. In contrast, a machine learning model can track probabilities on all behaviors simultaneously. Machine learning collapses all the “if-then” rules into a mathematical function.
The second approach to finding known threats is an algorithmic approach. In this approach, you create a machine learning model where the known threats are tagged as bad behavior – essentially as malware. In this approach, the machine learning model learns the difference between bad behavior and benign behavior.
Let’s look at an example. Let’s say you want to train a machine learning model to distinguish between a dog and a cat. You have a dataset made up of dog images and cat images. You map the data and train the model to learn which of the images are dogs, and which are cats. This is a machine learning approach to distinguishing dogs from cats. Let’s now say that cats are malware. Then the model will be able to distinguish between malware (cats) and non-malware (dogs).
Rules are very effective when you want to cherry pick data. Let’s say you want to look for attacks after midnight. Your rule will automatically pick out everything after midnight. But what happens if the attacker comes in at 11:59pm? Your rule is going to miss that attacker. If you use a machine learning model, however, it will detect slight variations because it’s working with probabilities in the data, not absolute hard numbers. The machine learning model would detect the attacker coming in at 11:59pm where the rule would not.
Think of it this way: In a SQL query you’re going to select columns A and B and there’s a “where” clause, for example “where column A is less than some value, and column B is between some values.” If you have only two columns, with a few distinct values in them, then your “and” and “or” logic becomes very easy. If A=n and B<t or B>x, then… that’s not hard to do. But when you start adding more variants to the data, your SQL query becomes more complicated. Now let’s say you add another column of data. It’s even worse. Now add 3 or 4 more columns. Your SQL script is unreadable. It’s virtually impossible to manage. Rules don’t work at scale.
Here’s an analogy: you’re conducting a census around the number of people with the name “James”. If you’re looking at one building, you can do a manual count of people named “James” in that one building. But now let’s say there are 50,000 buildings. You’re going to run into a problem manually counting people named “James” in 50,000 buildings. You’re better off doing a small a sample and approximations. That’s what machine learning does – it’s essentially using statistics at the core.
So, how would machine learning detect phishing attacks? An email analytics machine learning model would look at URLs within emails to identify bad URLs indicative of phishing attacks. A NetFlow machine learning model would look at NetFlow and packet data to identify botnet and Denial of Service attacks. A permission grouping model would detect insider threats – as permissions and admin rights start getting switched around. Machine learning models are tailored to detect specific threats. Here are some examples of Gurucul’s machine learning models.
Detecting unknown threats is where a mature UEBA platform really shines. And ironically, it detects unknown threats the same way it detects known threats.
A rule is specifically looking for a known threat. With machine learning, we invert the problem. We train the model so that its baseline is all the good behavior. Anytime any kind of fluctuation happens, that is an anomalous signal and the model will flag it. Any sufficient deviation in the known patterns is going to get flagged as a threat.
Here’s an analogy. Let’s say you go to the doctor and are hooked up to an EKG machine. Your heartbeat is detected and there are parameters set by the EKG machine that looks at the electrical activity of your heartbeat. If your heartbeat starts spiking, then you know that something is wrong.
Similarly, when we are creating a machine learning model on a dataset, it learns all the patterns that are normal patterns. When unseen data is getting passed through that is an unknown threat, it’s going to create a spike that is easily detected. The key is being able to distinguish between unusual behavior and true threats. That’s where context comes into play. The more context rich data you have, the more effective your UEBA solution will be at detecting and stopping unknown threats.
An advanced UEBA like Gurucul’s UEBA detects known and unknown threats using machine learning models on big data.
Prev: ABCs of UEBA: J is for JSON Next: ABCs of UEBA: L is for LOGS
[/vc_column_text][/vc_column][/vc_row]