Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy. The term “big data” can also refer to the use of predictive analytics, User and Entity Behavior Analytics (UEBA), or other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. Big data can be structured, semi-structured and unstructured data that has the potential to be mined for information.
Discussions about big data traditionally include data lakes. Data lakes support storing data in its original or exact format. The goal is to offer a raw or unrefined view of data to data scientists and analysts for discovery and analytics.
Big data differs from a relational database. Relational databases have been around since the early 70’s. A relational database is a collection of data items organized as a set of formally described tables with unique index keys. Data can be accessed or reassembled in many different ways without having to reorganize the database tables, often in queries with boolean logic.
The problem with relational database technology is managing multiple, continuous streams of data and scalability for a high volume of data. Nor can it modify the incoming data in real-time.
Big data technologies have made it technically and economically feasible to collect and store larger datasets and to analyze them in order to uncover new insights. In most cases, big data processing involves a common data flow – from the collection of raw data to the consumption of actionable information.
A selection of specific attributes defines big data. They are frequently called the four V’s: volume, variety, velocity, and veracity. Variability has also been mentioned. The concept of Big Data gained momentum in the early 2000s when Doug Laney created the now-mainstream definition of big data as the three Vs (Pop quiz – What were the original three? Answer is at the end*).
Source: The FOUR V’s of Big Data, IBM
Volume – The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can actually be considered big data or not.
Variety – The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.
Velocity – In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.
Veracity – The quality of captured data can vary greatly, affecting accurate analysis.
Variability – Inconsistency of the data set can hamper processes to handle and manage it.
Big data uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data. This may include low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors that may not have been possible without this data.
You can’t have a conversation about big data for very long without running into the elephant in the room, Hadoop.
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. (Gurucul GRA is built upon big data infrastructure [Hadoop] with a flexible metadata end. This allows customers an open choice for big data [for example, the customer’s native Hadoop deployment, Cloudera, Hortonworks, MapR and Elastic/ELK]).
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks.
Some recommended tools companies use with big data are – Hadoop, Cloudera, MongoDB, Hive, Spark, and Tableau 7 Top 6 Big Data Tools to Master in 2017
The benefits associated with using big data include:
Big data is –
Big data solutions offer cloud hosting, highly indexed and optimized data structures, automatic archival and extraction capabilities, and reporting interfaces have been designed to provide more accurate analyses that enable businesses to make better decisions.
- Descriptive analytics to help users answer the question: “What happened and why?”
- Predictive analytics help users estimate the probability of a given event in the future.
- Prescriptive analytics provide specific recommendations to the user.
To learn more:
If you would like the big picture for big data, then check out ‘The Human Face of Big Data’, a PBS special that explores the digital exhaust we create and the internet as a big sensor.
A new book by Gurucul CEO Saryu Nayyar: “Borderless Behavior Analytics – Who’s Inside? What’re They Doing?” provides insights on the cybersecurity landscape and how defenses are evolving. It includes seven chapters of valuable insights from seven leading CISO and CIOs.
* Answer to the quiz – Volume, Velocity, and Variety.