Logs are the lifeblood of User and Entity Behavior Analytics platforms. The more relevant log data ingested, the better the efficacy of the analytics. Take vegetable soup as an analogy. If you make it with zucchini and carrots only, it’s not so tasty. But if you add peas, corn, green beans, leeks, celery, tomatoes, potatoes, bay leaves, olive oil, lemon juice and thyme – then you have yourself a delicious meal. It’s the same with UEBA and logs. You need a variety of log sources to produce appealing analytics.
Logs Add Context
Log sources provide the context machine learning algorithms need to generate intelligent output. And, connecting the dots across the various log data is how UEBA delivers value. As a simplistic example, let’s say Jerry Brown has to use a badge scanner to gain entry to his office:
- At 8:50am, the badge scanning system logs Jerry as swiping his badge to enter the office.
- Also at 8:50am, the VPN logs show Jerry Brown logging into the corporate network from a remote location.
- Jerry can’t be in the office and remotely logging into the network at the same time.
- Aggregating the log data from both systems is the only way to surface this anomalous VPN activity.
If the physical badging system logs weren’t available, this account compromise event may have gone unnoticed.
More Data Beats Better Algorithms
Anand Rajaraman taught a popular course on Data Mining at Stanford University. He wrote a blog post on his Datawocky site called, “More data usually beats better algorithms”. It makes the point that in data science, data is what matters most. His students entered a Netflix Challenge, where they were trying their hand at building a better movie recommendation algorithm than the one developed by Netflix. Team A created a sophisticated algorithm using Netflix data. Team B used a simple algorithm, but added in movie genre data from the IMDB database. Guess what? Team B got much better results! This clearly shows how more data beats better algorithms.
Logs Tell a Story
Logs should capture the following data:
- The event that occurred
- The process that triggered the event
- Any additional data, if relevant. Additional data might include a description of the event. The more descriptive the better.
If you’re missing fields in your log data your behavior models won’t work.
Let’s look at system level logs. Operating system logs track events like SSH logins and log outs. The logs will track when users traverse from one machine to another. System logs also look at applications users access, like Git – a source code repository for software developers. The logs will tell you what time a user logs into and out of Git, from what IP address. Based on this information, it makes it easier to track who is accessing Git from what machines.
If you have malware on your machine, and it is automatically trying to login to other machines on the network, it would be recognized by mature UEBA solutions as a brute force attack. How? Because the UEBA platform would look at the login times and notice an unusual number of login attempts in succession to multiple machines within a very short time span. We ran analytics on medical device logs for one of our customers and determined that a medical device had malware on it. It was making short burst transmissions across the network, essentially hopping across networks. The logs were showing login attempts from machine to machine in succession. Looking at the log data, specifically the time delta between login attempts, we discovered it was too fast for a human being to type that fast.
At another customer we were looking at Lenel system logs, which are physical badge scan logs for building access. The logs were odd in that there were a sequence of logins and logouts with a particular badge getting swiped that did not add up. One of our data scientists actually walked from one building to the next to measure the time it actually took to get from one building to the other – and back. He found that it wasn’t possible to swipe a badge in one building and then again in a second building within the time indicated in the logs. There just wasn’t enough time for an individual to get from one location to the next. It turned out this individual was actually sharing his badge with others. He had multiple badges in operation, which was a violation of security policy to say the least!
Logs tell a story about when a specific type of event occurs at a particular date and time. Depending on the fields in the log and whether they are correctly filled in, you will have traces that somebody was there. It could be a person, or it could be automated malware, and the logs will tell the story of the data.
Not All Log Data is Created Equal
Not all log data is going to yield meaningful results for behavior analytics. An example of an inconsequential log source for UEBA is Microsoft Office application logs. Microsoft Office keeps logs on its applications, tracking events like if the application crashes on an individual’s machine. That data is not useful, since it pertains to that single user. It does not tell you what is going on across the network.
An example of a useful application log would be Subversion logs. Apache Subversion is another software versioning and revision control system for software developers. The logs show people checking software in and out of the repository all day long. The logs contain the filenames of the files that are checked out or in. The filenames themselves are meaningless to the UEBA platform because there can be thousands of files. However, if you look at the root directories for the project, you can start to see patterns in the data. In our client’s case, we were able to identify one person who was checking out thousands of projects (i.e., groups of files) that he had no right to check out. We originally thought it was an automated process checking out all these projects, but the client discovered it was someone who was no longer with the company checking out these files. If you look at the logs in the context of millions of files, the filenames get lost in the noise. You want to look at the higher-level data which is the number of root projects being checked out. Then you can see that this individual is not only checking out source code projects, he’s checking out semiconductor designs and other intellectual property he shouldn’t be downloading. Now you’ve uncovered an insider threat with behavior based security analytics.
The most critical element in ingesting and analyzing log data is that the time stamp always needs to be accurate across all the systems. The time stamp must be accurate otherwise you’ll have a mishmash of event data and you won’t know which one is correct. Typically, servers will set the time as UTC (Coordinated Universal Time) so all the time stamps are identical regardless of the user’s time zone.
Identify Patterns in Log Data with Gurucul UEBA
Gurucul UEBA will make sense of your log data so you can respond to critical threats in real-time. Gurucul UEBA detects changes in behavior patterns by ingesting log data from a huge variety of sources. Gurucul has over 300 out-of-the-box connectors to log sources from SIEMs to IGA platforms to firewalls to proprietary applications. Contact us to learn more about our extensive log ingestion capabilities for user and entity behavior analytics.
Prev: ABCs of UEBA: K is for Known Next: ABCs of UEBA: M is for Machine Learning