Rapid changes are occurring within the big data analysis space and a lot is being written about them. These changes are not spontaneous but rather have occurred over the course of time and can be traced back like a biological evolution only in this case it will be traced back to early days of computers, laptops, mobiles and other such devices. One can clearly see similar progressions in the recent evolution of business requirements related to big data. Organizations which once had a limited number of analytics users and use cases have grown to become more data driven and dependent in recent years. As a result, today they need to put their ability to analyze the into the hands of more business analysts and they need immediate access to their data by removing any latencies lying in between for more rapid and highly iterative results.
Table of Contents
Big data discovery powered by Apache Stark has emerged as a new technology which can address these diverse requirements. So, let’s track back in time to make sense of these evolutionary steps.
The era begins with the enterprise data warehouse and with the set of business intelligence tools to with it. So, EWD is simply a curated environment containing a critical subset of enterprise data. This subset was prone to the limitations of the EDW infrastructure and requirements of the BI tools. During this era, analysts need to know what kind of questions they need to ask and the process of finding the answer then required the participation of the whole team and could take months to deduce the answer. The supporting workflow for this kind of data processing was extracted, transform and load (ETL) which requires programmers, data warehouse architects, BI architects and administrators. And since this was only a subset of enterprise data, it could very well be 10 per cent or even less which is available at the time of processing. Because of the complexity of the process and the long-time durations involved, iterating was a difficult job and not viable.
The more recent development in this field is the birth of the data lake which is altogether a different concept. First enabled by Hadoop, in some cases, this offers tremendous data savings say up to 1000 times and even in some cases data lakes are known to serve as a common repository for all the enterprise’s data and not just a subset. Between three major factors:
Hadoop
Availability of inexpensive servers
and Affordable cloud developments
Data lakes are easily solving the storage limitations of the inferior EDW era. But data lakes alone cannot do much to address issues like complexity and latency issues encountered in the data warehouse model.
In the absence of carefully curated data as in the case of EDW Hadoop requires additional support before the data can be analyzed.
In order to perform the SQL query on the data administrators need to organize the data in SQL on Hadoop systems which add complexity to the process and do not support discovery and iteration.
For a workflow to truly enable the advantages of Hadoop-based Data Lake, the business needs a set of tools which can open the data lake to everyone in the organization whoever needs them and also the workflow needs to reduce the dependency on many other specialized resources. This is possible with the help of Spark under the influence of which a data lake can truly become a data discovery environment. Spark possesses advanced data analytics capabilities and can analyze data across enterprises which is highly reducing the need for complex and time-consuming MapReduce programming.
Big data discovery is simply a beginning opening door for a more diverse and powerful set of capabilities for organizations to process their data lakes and make the most use of them.