Managing and analyzing big data is no job of a child in fact, the data itself is present in so many forms that collecting and analyzing it has become a great challenge. In order to tackle these challenges which companies and teams face in day-to-day life while analyzing big data developers have created a new set of technologies (open source) based around Hadoop. Since its birth Apache Hadoop and Apache Software Foundation project has grown a lot and added many new members to its family. The single software today has evolved into an entire ecosystem.
Table of Contents
Spark, Hive, HBase and Storm are some options which companies are using. These technologies enable them to deal with massive amounts of data in real-time. These technologies are constantly trying to enhance the company’s experience with big data on a day-to-day basis.
There are many projects in the Hadoop ecosystem and below we are taking a look at some of the most significant ones.
It is a flagship technology which became the centre of gravity for an entire ecosystem. Developed at Yahoo, it was originally a side project because developers at yahoo needed a way to store and process large amounts of data they were gathering from their new search engine. This technology eventually contributed to the Apache Software Foundation.
It was originally developed by Facebook and later contributed to the Apache software foundation. A data warehouse infrastructure built on top of Hadoop, its main job is to provide services like data summarization, query and analysis.
This project was born at the company named Powerset, and later it was acquired by Microsoft. Its main goal is to process large amounts of data for natural language processing. At its core, it is a non-relational, distributed database based on Google’s Big Table. It joined the Apache family in 2010.
It is the new rising star of the Apache ecosystem. Developed in UC Berkley, it is a fast alternative to Hadoop’s MapReduce technology and depending upon the application can be 100 times faster. Spark developers provide support to the Apache software foundation and also offer a commercial service known as Spark-as-a-Service.
Originally it was a project of LinkedIn that was developed as a messaging system for the real-time data which is generated and processed by the company’s career website and platform and was eventually donated to open source in 2011.
It is a real-time computation system which makes it easy to process the unbounded streams of data reliably. Sometimes this technology is described as an alternative to Spark and the company which originally developed it was BackType which was acquired by Twitter in 2011.
Nifi stands for Niagara files is a technology which is developed by US National Security Agency (NSA). Its main role is to automate the flow of data between the systems and is offered via a web-based interface. Since it is developed by NSA it supports SSL, SSH, HTTPS and role-based authentication and authorization.
Flink is a distributed data analysis engine and is used to process batch and streaming data.
Apache arrow is a technology which was developed by the company named Dremio. The same company contributed to the Apache Drill project. In fact, Arrow is based on the code from the Apache Drill project.
These were some highlights of Apache’s Hadoop ecosystem. But this is not all; work on many other projects is going on and on these as well. Documentation for these is available on Apache Software Foundation Website.