ABCs of UEBA: D is for Data | Gurucul Risk Analytics

The most effective User and Entity Behavior Analytics (UEBA) solutions leverage big data. Big data refers to large and complex data sets that traditional data processing application software cannot process effectively because of volume, velocity and variety. This was documented as part of big data’s original definition by Gartner’s Doug Laney in 2001. Those three V’s of Big Data have since grown by other experts to include three more. Gurucul has further expanded the V’s of Big Data list.

Know the Eight V’s of Big Data

The eight V’s of big data are:

Volume – The quantity of generated and stored data. The size of the data determines the value and potential insight and whether it can be considered big data or not.
Velocity – The speed data is generated and processed to meet the requirements of availability in real-time, as well as demands and challenges that that might impact or impede its access for efficient utility and analytical development.
Variety – The type and nature of the data, both structured and unstructured, which expands the choices and options which facilitate analysts to effectively draw from the range of critical context to produce useful resulting insights.
Veracity – The quality of raw or refined captured data can vary greatly, affecting accurate analysis.
Variability – Inconsistency of the data set can hamper processes to handle and manage it.
Value – What benefit data delivers by virtue of comprehensive control of big data’s massive volume.
Venue – The scotoma or blind spots in a security perimeter that come with separate and unintegrated silos of data, and are sought after by hackers.
Vector – The channels by which data flows and is ingested into data lakes and elsewhere, as well as its effectiveness and cost.

Data volume, variety and velocity – the original V’s of Big Data – have changed the game, and with it, a seismic expansion of the threat plane. The volume of security data UEBA solutions must contend with is mind blowing. A user might have four or five devices, most of which are mobile and typically exist outside the organization’s physical environment. This represents a flood of access and activity security data that UEBA solutions have to monitor. In the near future, a user might have ten or more devices. IoT and other emerging technologies will increase data variety exponentially. Whatever that volume and variety is, advanced UEBA solutions must account for all of it. Now, do you see why we expanded from the three V’s to the eight V’s of Big Data?

Implement Fast Data

The reality of perpetually expanding data makes it clear that manual management of security data in a timely manner is impossible. Fast data is necessary, which represents active security data, immediate status, with no limit on scalability and availability for ongoing purposes.

Fast data is the rapid application of mining and analyzing structured and unstructured big data into smaller data sets, in near-real or real-time, to deliver timely actionable intelligence. With this in mind, the criticality of Venue and Vector (as mentioned in the eight V’s of Big Data) comes into sharper focus. For Venue, when all the critical data resides and is maintained in an agnostic data lake, that’s clearly beneficial. Yet when it is isolated and cordoned off into silos, that’s where security problems can be hidden.

Too often, big data is set aside in separate silos, controlled by different groups within an organization. Relating to data flow, or lack of it, Vector underscores the movement and direction of big data into a centralized data lake. However, if the data is not in alignment with an all-inclusive ingestion process, from all silos, it cannot provide the critical contextual value its potential represents. The siloed nature of data is a problem that will leave unknown unknowns unaddressed across a serious and expanding access and activity threat plane, if not resolved by a mature UEBA solution.

Ingest Critical UEBA Data Sources

Furthermore, UEBA draws from a broad range of data sources to provide holistic monitoring and behavior analytics for a risk-based approach. The broad variety of data from various security solutions include the data sources below.

Security Information and Event Management (SIEM). A SIEM solution’s primary function is to aggregate the data relevant to monitoring and managing privileges of users and services, including directory services, system-configuration changes and log audits. SIEMs gather, analyze and present information from network and security devices, IAM applications, vulnerability management and policy compliance tools, operating systems, database and application logs, as well as external threat data.

Identity and Access Management (IAM) and Privileged Access Management (PAM). IAM manages an individual’s proper and timely access to approved resources within an organization. Managed entities in IAM include users, hardware, network resources and applications. PAM specifically deals with the challenge of privileged access management within an organization. Privileged users hold the keys to the kingdom, and any compromise of their credentials can be catastrophic for an organization.

Data Loss Prevention (DLP). DLP solutions either provide alerts on, or prevent, potential data breaches and exfiltration transmissions by monitoring, detecting and blocking sensitive data, while the data is: in-use (endpoint actions), in-motion (network traffic), and at-rest (data storage). DLP also deals with data leakage incidents of sensitive data. Sensitive data includes private or company information, intellectual property (IP), financial or patient information, credit card data and other information.

Active Directory (AD). Active Directory Domain Services (AD DS) authenticates and authorizes all users and computers in a Windows domain network. Domain controllers assign and enforce security policies for all computers and software.

Endpoint Detection and Response (EDR). EDR software employs advanced threat detection technology on endpoints (computers), which focuses on detecting suspicious activity on hosts and network PCs. EDR software is reactive in detecting and stopping threats (malware, virus, zero-day attacks and advanced persistent threats).

Secure Web Gateways and Secure Email Gateways (SWG/SEG). A SWG is traditionally an appliance-based secure Web gateway that uses real-time code analysis technology, URL filtering and antivirus scanning to prevent malware and Web-based threats. A SEG is an email security solution that protects against spam and data leakage. It also provides reporting, analyzes inbound and outbound content, and assists with policy control.

Cloud access security brokers (CASB). A CASB is a technology solution that arbitrates data between in-house IT architectures and cloud vendor environments. Its capabilities traditionally include the ability to encrypt or manage data, so it is more secure in a cloud environment. A CASB resides between internal and external systems for securing outbound data. CASB solutions often provide features such as auditing, data loss prevention, encryption and monitoring. CASBs help protect enterprise systems against cyber-threats with features such as malware prevention and data security that render data streams unreadable by outside parties.

NetFlow. NetFlow is a network protocol for monitoring and collecting network traffic flow data produced by NetFlow-enabled routers and switches. NetFlow-enabled routers export traffic statistics that are gathered by a NetFlow collector (either a hardware appliance or software application) which performs traffic analysis determining direction and volume.

Additional security data sources. Added to the solutions above, UEBA solutions also monitor visibility of commercial business applications and databases, HR information, social media, as well as Dynamic Host Configuration Protocol (DHCP), SaaS/IaaS solutions and document files. In addition, different organizations have different security needs, based on size, organizational complexity and business model. Other sources of data might be a requirement as well.

Leverage Big Data Technologies – Data Lakes

Commonly called data lakes, these are not to be confused with data warehouses, which by nature have their data organized for a particular purpose, traditionally modeled for reporting. A data lake is more like a large body of water in its natural state. It is where all data streams flowing into it are unfiltered, unprocessed from source systems. This raw and untransformed, unstructured data is critical. The more data in a data lake, the more potential for UEBA to extract knowledge deeply via machine learning. The more data you feed to UEBA, the smarter it gets. UEBA transforms and processes the raw data to reveal meaningful and predictive patterns and to extract insights as needed.

Data lakes utilize cheap and readily available storage. They make scaling to terabytes or petabytes far more economical than traditional data warehouses. This is possible by virtue of a data lake’s capability of allocating various virtual storage nodes. These are not tied to a single server or a single location and serve the purpose for expanding storage as necessary. This represents a tightly controlled cost, since customers only pay for the storage space they need, when they need it.

Data lakes are data agnostic, storing all non-traditional and traditional data. This is regardless of source and structure (structured or unstructured), in its raw form, until required for use. Unlike data warehouses, with their rigid designs, data lakes support deep analysis of factors that emerge over time. This is where supporting data for analysis is natively available within the larger set of data.

Below are the most common data lakes leveraged by UEBA platforms.

Apache Hadoop. The original data lake, and typically referred to simply as Hadoop, it is considered the main and foundational player in the data lake world of offerings. It provides broad agnostic data lake repository capabilities with the Hadoop Distributed File System (HDFS). Hadoop’s popularity is empowered by virtue of the fact it is a colledata lakes leveraged by UEBA platformsction of open source projects which means that development occurs at a rapid pace. Hadoop’s reliance on open source software and commodity hardware make it compelling, from both a cost and features perspectives. Apache Hadoop MapReduce is a framework for processing large data sets in parallel across a Hadoop cluster. Considered a standard for data lakes by many, Hadoop has the widest adoption within the industry.
MapR. An Apache Hadoop distributor, MapR offers their version of HDFS, a database management system, a set of data management tools, and related software. Combining analytics in real-time with operational applications, its technology runs on both commodity hardware and public cloud computing services. High profile adoptions have increased MapR’s prominence in the market. Amazon chose MapR to deliver an upgraded version of their Elastic Map Reduce (EMR) service. Google selected MapR as a technology partner where MapR broke the minute sort speed record on Google’s compute platform. MapR has a number of supporting solutions including their MapR-DB document database management system which is known as NoSQL.
Cloudera/Hortonworks. Cloudera and Hortonworks, two of the biggest players in the Hadoop space, merged in January of 2019. While Hadoop is open source and freely available, Cloudera and Hortonworks abstract away most of the infrastructure. They focus on slightly different markets, with Hortonworks going after a more technical user with a pure open-source approach, while Cloudera offers proprietary tools.
The Elastic Stack (previously known as: The ELK Stack and Elastic ELK). Originating from an entirely different technology framework than Hadoop, the Elastic Stack is comprised of Elasticsearch, Logstash, and Kibana, which are platforms under the Elastic umbrella that were developed to integrate and work with each other efficiently. Each component is a separate solution powered by the open-source vendor Elastic. Elasticsearch delivers search and distributed computing capabilities, while Logstash normalizes a wide range of time series data, and Kibana is a visualization tool, all of which collectively deliver a holistic analytics tool. Because of its ease of use and simplicity, the Elastic Stack is growing in popularity.
Microsoft Azure’s HDInsight. Built to the open HDFS standard, HDInsight is a fully managed Cloud Hadoop offering that provides optimized open source analytic clusters for Spark, Hive, MapReduce, HBase, Storm, Kafka, and R-Server. Designed for redundancy and high availability, providing head node replication, data geo-replication, and built-in standby NameNode, HDInsight offers resilience to critical failures not delivered in standard Hadoop solutions.
Amazon’s Data Lake Solution. Nested within the broad array of AWS Cloud offerings, and more specifically in Amazon Simple Storage Service (S3), the company announced their data lake solution in late 2016. The AWS Cloud solution suite includes managed services that help ingest, store, find, process, and analyze both structured and unstructured data and it is integrated with Hadoop with MapReduce modeling for large-scale data processing. While data is stored in S3, the metadata is stored in both Amazon DynamoDB and Amazon Elasticsearch Service (Amazon ES). Storing data in S3 allows a durable secure data storage in any format. Data in S3 integrates with other services, such as Amazon Redshift for data warehousing, Amazon QuickSight for data visualization or, Amazon Machine Learning to build machine learning models. In addition, the Amazon Data Lake solution integrates with third-party tools to facilitate customers provisioning the right tool for their needs.
Other Data Lake Vendors. Competition is relentless in this space. Additional data lake vendors to watch include: HVR, Snowflake and Zaloni.

Choose Open Choice of Big Data

Be very careful when selecting a UEBA solution. Not all big data based UEBA offerings are equal. The problem with some UEAB vendors is that they customize their data lake backend. This means that even if you own a data lake of the same flavor, you’ll have to buy theirs, too. That’s not ideal. We recommend choosing a UEBA vendor like Gurucul that leverages the eight V’s of big data. Additionally, Gurucul offers open choice of big data. As a visionary, Gurucul decided not to be reliant on any one big data platform from the very beginning. We made this decision because we knew that our customer’s underlying data layer could change at any time. And, we want to be able to support any data lake. We can set our UEBA right on top of your data lake. If you don’t have a data lake, we’ll give you Hadoop for free. It’s a game changer.

Get Ready for Megascale Data

In conclusion, data growth is still exploding beyond the eight V’s of Big Data. We’re only at the tip of a continually morphing and growing iceberg. Even today we’re seeing the ever-changing exponential aspect of data. In the past, it was based on a per human focus. Now, you add IoT, a plethora of applications, ubiquitous 24/7 global access, and other factors into the equation. You will see it no longer correlates on these per human calculations. There is a multiplication factor where one percent represented one IP address, or one machine ID. Now, one percent might be in association with 1000 machine IDs.

So, this only underscores the fact that CISOs must master the eight V’s of big data. And they must recognize that we’re just at the initial phase of big data. There will be so much more in the near future, expanding by leaps and bounds. In fact, industry experts may need to find a new term for it, something like Megascale Data. You better make sure your UEBA platform can handle Megascale Data! Plan ahead.

Prev: ABCs of UEBA: C is for Context Next: ABCs of UEBA: E is for Entity