If you are looking to learn Hadoop, this is the right stage you have landed at. In this Hadoop tutorial, you can learn from the basics to the advance level in a very simplified manner.
Are you looking to learn Hadoop? If yes, then you have landed on the right page. This tutorial will take you through the basics to the advanced level of Hadoop in a very simplified manner. So without wasting much time let’s plunge into the details of Hadoop along with the suitable practical scenarios.
In this Hadoop tutorial, we are going to cover
Before starting the technical part in this Hadoop tutorial, let’s discuss some exciting story about how Hadoop came into existence. Hadoop was started by two people Doug Cutting and Mike Cafarella, who were in mission mode to build a search engine which could have a capability to index 1 Billion pages. They had undergone research for that, and they came to know that it requires to set up a system which hardware costs half million dollars and a monthly running cost of $30,000, which is a considerable capital expenditure for them. However, soon they realized that it is tough for their Architecture to support one billion of web pages on it.
In 2003 they red a paper clip about the Google Distributed File System known as ‘GFS’ which used in Google's production. Now the news on GFS is something exactly they were looking, and it became a solution for their storage problem of the vast amount of data that's get generated in the process of web crawl and indexing. Later in 2004 google came up with one more invention which is MapReduce into this technical world. These two inventions from Google led to the origin of software called “HADOOP.”
Doug has given the following on the google contribution to the development of the Hadoop framework
“Google is living a few years in the future and sending the rest of us messages.”
So with the above discussion, you would have got an idea about Hadoop and how powerful it is.
The world is transforming at a rapid speed with the technological advancements in each sector, and we can get the best out of everything. In the olden days we used to have landlines, but now we have moved to smartphones which can perform all most all works like a PC. Similarly, we used to store the data into floppy drives back in the 90’s. And these floppy drives had been replaced by the hard disks due to its limited storage and processing capability. But now we can store terabytes of data on to Cloud, without bothering about the storage limitations.
Have you heard about IoT? Now it’s been a disruptive technology in all industries. IoT connects all physical devices to the internet or to the other methods, to perform certain things without human intervention in between the Devices. Here let us discuss the best example of IoT is smart Air conditioners in homes. The Smart AC’s are connected to the internet, and it can adjust the temperature inside the room by monitoring the outside temperature. With this, you could get an idea of how much data has been generated by the IoT devices by the massive no of different devices worldwide and its share in contributing to big data.
Now We should consider the main contributor to big data is social media. Social media is one of the biggest reason for the evolution of big data, and it gives the information about population behavior. The following image will give you a clear idea about the amount of data generated by social media every minute.
Apart from the data rate of data generating, the second challenge comes with the unstructured or unorganized data sets which make processing a problem.
Here let’s take a restaurant as an example to get a better understanding of the problems related to bid data and how Hadoop solves it.
John is a businessman who has opened a small restaurant. Initially, he had one chef with one food shelf, they started receiving two orders per hour and the restaurant set up was enough for handling orders.
Here we are going to compare restaurant scenario with a traditional system where data was used to generated in a consistent way, and our storage system like RDBMS was capable enough for processing it just like John’s chef. Here in this scenario, the chef is compared with traditional processing and the food shelf is compared with data storage as shown in the above image.
After a few months, Bob thought to expand his business and started taking online orders as well by adding few more Cuisines to the menu to serve a large number of people. The Orders were rose to an alarming level of 10 per hour, and it became tough for a single chef to handle the extra work. Aware of that situation John started to think about the measures that he needs to take for handling the case.
The same thing happened with big data; suddenly the data started generating at a rapid rate because of the data growth factors such as social media, etc. now the traditional system, just like a chef in john restaurant unable to cope up with the situation. Thus, there was a need for different kind of solution to tackle the problem.
After a lot of research john came up with an idea to increase the no of chefs to four, and everything is functioning well, and they can handle the orders. But soon this solution was lead to another problem which is a food shelf since four chefs were sharing the single shelf it became a hurdle for them. John started to think again to put a stop to the situation.
In the same fashion, to handle the data processing of vast volumes of data many processing units were installed( just like John hired extra chefs to manage the orders). But even in this case setting up additional processing units was not a solution to the problem. Here, the real bottleneck is the central unit for data storage. In other words, the whole performance of the processing units was driven by the primary Storage. If it is not efficient the entire system would be affected. Hence, there was a storage problem to resolve.
John came up with another idea to divide the chefs into two categories, i.e., junior and senior chefs and assigned each junior chef with a food shelf. Each senior chef was appointed with two junior chefs to prepare meat & sauce. According to John's plan, among two junior chefs, one will make the meat, and the other will prepare sauce, so both the ingredients will be transferred to the head chef and he will mix both the ingredients to prepare the final order.
Hadoop works similarly as John’s restaurant does. The way food shelves distributed in John’s restaurant, in the same fashion the data will be stored in a distributed manner with replications, to give fault tolerance. For parallel processing, first, the data gets processed by the slave nodes, where, it is stored for a while to obtain intermediate results, and later these intermediate results are merged by the master node to produce the final result.
The above analogy must have given you a fair idea about how big data is a problem statement and how Hadoop is a solution for it. As we discussed in the above scenario, there are three significant challenges with Big Data.
Saving vast amounts of data in a traditional storage system is not possible. The reason is known to everyone, the storage capacity is limited, but the data is generating at a rapid speed.
As we know data storage is a problem, but even there is another problem is associated with it which is storing heterogeneous data. The data is not only colossal, but it presented in different formats such as semi-structured, structured, and unstructured. So you have to make sure that you have a multitasking system to store the diversified data which is generating from different sources and in different formats.
As we all aware big data consists of a significant number of datasets it is tough to prepare the data in a less span of time.
To overcome the storage issue as well as a processing issue there are two components created in Hadoop which are - HDFS & YARN. HDFS stands for Hadoop Distributed File System, it resolves the storage problem by storing the data in a distributed manner, and it is easily scalable. YARN Stands for Yet Another Resource Negotiator, and it is designed to decrease the processing time drastically. Let’s move ahead and understand what Hadoop is?
Hadoop is an open source programming structure which is intended to store the colossal volumes of informational collections distributedly on expansive groups of the ware. Hadoop programming structured on a paper discharge by google on MapReduce, and it can apply to all thought of practical programming. Hadoop was produced in Java programming dialect, and Doug Cutting and Michael J. Cafarella structured it.
When machines are working in tandem mode process if one device fails, there is another device ready to take charge of the responsibility of it and perform the functions of it without any interruption in between. Hadoop is designed with an inbuilt fault tolerance feature which makes it highly reliable.
Hadoop can operate on standard commodity hardware (your PC or Laptop). For instance, in a mini Hadoop cluster, all data nodes requires a standard configuration setup like 5- 10 terabytes hard disk, Xeon processor and 8-16 GB RAM is enough. So, the cost for Hadoop is very economical and easy to operate on your regular PC or Laptops. And more importantly, Hadoop is an open source software, so, you need not pay costs like licensing.
Hadoop has the inbuilt capacity of integrating with cloud computing technology. Especially, when Hadoop is installed on cloud technologies you need not consider the storage problem. You can arrange the systems and hardware according to your requirements.
Hadoop is very flexible when it comes to its performance in dealing with different methods of data. Hadoop is a flexibility feature to process the different kinds of data such as unstructured, semi-structured, and structured data.
The above are the four features which are helping in Hadoop as the best solution for significant data challenges. Let’s move forward and learn what the core components of Hadoop are.
While you are setting up the Hadoop cluster, you will be provided with many services to choose, but among them, two are more mandatory to select which are HDFS (storage) and YARN (processing). Let’s get more details about these two.
The main components of HDFS are Namenode and Datanode. First, let’s discuss the NameNode
YARN consists of two essential components
Resource Manager and Node Manager
The above explanations and examples might have given you a brief idea about big data, how does it get generated, problems that related to Big data, and how Hadoop is useful to solve these problems.
Happy learning and I will come up with a new post soon.