The chances are, the most common software you will hear about while you are navigating the Big Data World, is Hadoop.
Hadoop by itself, is very important and widely used computational platform, and adding to that, it is the base of other software, technologies and applications that are built on top of it.
Lots, and lots of important technologies, and software in different domains and business are built on Hadoop.
Hadoop is complex and its computational framework MapReduce is complex, but the good news is there is a good chance that you can use all of the technologies that are built on Hadoop without the need to learn or deal with Hadoop at all.
In spite of that, still understanding the basic concepts of Hadoop will give you a boost in the Big Data World, becuase you will encounter its concepts in other technologies and software.

Introduction

Hadoop is an open source distributed computing platform, that help process huge amount of data (that might reach trillions of bytes in size).
To put the concept of Hadoop in simple way, Hadoop divides the huge data into smaller pieces, called Shards, and distribute these pieces into different machines, and instruct those machines to process the data, so each machine will process a small amount of data.
Regardless of how small the machine, Hadoop can add that machine computational power into one huge pool of computation power. Hadoop can accommodate what is called Commodity Hardware, which means a cheap affordable hardware, and divide a huge amount of data and distribute it among these machines and demanding those machines to process the data.
This is why Hadoop is described as “Distributed Computing Platform”, because it distribute data among many machines and distribute the computing of the data into different machine.
In this post I am going to describe the two basic concepts that make up Hadoop, how Hadoop distribute data, and how divide the computing and processing of data among different machines.

Hadoop Components

Hadoop borrowed from Google two important technologies, and using the Java language, introduced them to the open-source world.
Those technologies are:

  • HDFS: Hadoop Distributed File System.
  • MapReduce Framework.

As well, Hadoop version 2.0 introduced another part, which is:

  • Yarn: a framework to distribute and run data processing tasks.


The most two important technologies in Hadoop are: HDFS, and MapReduce.


HDFS: Hadoop Distributed File System

As we mentioned, Hadoop idea is simple: divide the huge amount of data, into smaller pieces, and send those pieces to other machines.
HDFS is the part of Hadoop that is responsible of distributing the data among many machines.
It is borrowing Google File System (GFS) idea, and implementing it in Java language.


MapReduce

As Hadoop divide the data, and store them into smaller parts using HDFS, then it needs to process them, and this is where MapReduce comes into the picture.
MapReduce is another idea from Google, and it is a programming model, that process big data, using a parallel, distributed algorithm on a cluster of machines.
It is not easy at all to think in MapReduce way, and it takes lots of practice and training, and time to have a mind shift to think in MapReduce way.
But achieving a MapReduce-based algorithm is solving your processing problem, will be the solution to use Hadoop to process your large data.
As I said, I am not going to dive into the details of MapReduce, because there are books and books about it, and I am going to jump to the easy part.

Hadoop Ecosystem

As I mentioned before, lots of open-source software, with different domains are based on top of Hadoop, and I am going to list just some of them here divided by the domain:


NoSQL Databases


  1. Apache HBase: Inspired by Google BigTable, it is a column-based database.
  2. Apache Cassandra: Another NoSql Database that combine document database, with big column-based database.


Document-based Database


  1. MongoDB-Hadoop: Even MongoDB it is a separate software from Hadoop, but with MongoDB-Hadoop connector we can integrate the two worlds together, so MongoDB can handle huge data.
  2. RethinkDB: is a JSON-based database, that use Hadoop for storage.


NewSQL Databases


  1. Akiban Server: A NewSQL database that combines SQL and Non-SQL, written in Java and can be distributed on top of Hadoop.
  2. Haeinsa: Haeinsa add SQL capabilites on top of HBase, which we mentioned above.


SQL-based Databases


  1. Apache Hive: A fully SQL-based Database on Hadoop. Although its SQL is not fully SQL92 compliant, but it is very close to it.
  2. Facebook Presto: Presto from Facebook, is an open-source SQL engine. Facebook says it is 10 times faster than Hive.
  3. eBay Kylin: Another open-source database engine with SQL interface, including OLAP support.
  4. Apache Phoenix: Is an SQL wrapper, with JDBC driver for HBASE.


Distributed Computing


  1. Apache Spark: Huge data analytics, that is based on Hadoop’s HDFS. It provides an easier alternative for MapReduce to do distributed computing.
  2. Apache Pig: Another alternative for MapReduce, but this time using a simple scripting language called Pig, that will be translated to mapReduce.


And more

Hadoop can be integrated with lots of other software, where it provides the ability to distribute and process huge amount of data to those software.
Software like:

And many others

Conclusion

As you can see, lots of open-source software are based on Hadoop, or many other software can be integrated with Hadoop. So, it is not required to do a complex programming with MapReduce.

List of posts

This post is the third part of many posts related to bid data. The other posts are:

  1. Introduction to big data.
  2. Streaming real time data.
  3. Hadoop & MapReduce (this post).
  4. ElasticSearch (coming).
  5. Spark and data analysis with Python (coming).
  6. Hive and data processing (coming).
  7. Practical example