top

Hadoop 2.0 - Understanding HDFS And YARN

What is Hadoop 2.0? Hadoop 2.0 is the second iteration of the Hadoop framework for distributed data processing. It can be considered as the generational shift in the architecture Apache Hadoop. Hadoop 2.0 has now established itself as the dominant big data analysis platform in the Hadoop ecosystem. It has gone ahead of Hadoop 1’s more restricted processing model of batch-oriented MapReduce jobs. Hadoop 2.0 is more interactive and has specialized processing models. The introduction of HDFS federation and resource manager YARN are two of the most important features introduced in Hadoop 2.0. HDFS federation : Multiple independent name nodes/namespaces are used by the hdfs federation in order to scale the name service horizontally. In the federation, the good thing is that the independent name nodes don’t require any coordination with each other. All Namenodes uses data nodes as a common storage for blocks. The cluster contains each data node registered with all the Namenodes. Datanodes handles commands from the Namenodes and also sends periodic heartbeats. Datanodes are also responsible for sending the block reports.   YARN (Yet Another Resource Negotiator): YARN is the new component added in Hadoop 2.0 and it has been introduced in between HDFS and MapReduce. YARN allows multiple application to run on the same platform. YARN framework is actually responsible for resource management of the Hadoop clusters. In Hadoop 1.0, MapReduce used to perform both cluster resource management and data processing but now Hadoop 2.0 took over the task of cluster management from MapReduce. YARN has a centralized resource manager component which manages resources and allocates resource to the application. Hadoop 1.0 vs Hadoop 2.0 Let’s check out some important differences between Hadoop 1.0 and Hadoop 2.0 : Hadoop 1.0  - Limited up to 4000 nodes per cluster  - Only has one namespace for managing HDFS  - Map and Reduce slots are static  - Running job is only MapReduce Hadoop 2.0  - Potentially up to 10000 nodes per cluster  - Supports multiple namespaces for managing HDFS  - Efficient cluster utilization (YARN)  - Any app can integrate with Hadoop YARN : NextGen Hadoop MapReduce Architecture With YARN , applications run “in” Hadoop instead of “on” Hadoop. YARN architecture (or called MR2) works by splitting up the two major responsibilities of JobTracker and TaskTracker into separate entities. Hadoop 2.0 has replaced JobTracker and TaskTracker by three components i.e. ResourceManager, NodeManager, and ApplicationMaster. ResourceManager and NodeManager combine together to form data-computation framework.  - ResourceManager acts as the scheduler and allocates resources amongst all the application in the system.  - NodeManager takes navigation from the ResourceManager and it runs on each node in the cluster. Resources available on a single node is managed by NodeManager.  - ApplicationMaster , a framework-specific library is responsible for running specific YARN job and for negotiating resources from the ResourceManager and working with NodeManager to execute and monitor containers. Container plays an important role in data processing. ApplicationMaster executes the data and then it is passed on to containers for actual processing. An application gets a specific amount of resources (memory, CPU etc.) on a specific host after container grants access for it.
Rated 4.0/5 based on 20 customer reviews
Normal Mode Dark Mode

Hadoop 2.0 - Understanding HDFS And YARN

Susan May
Blog
10th Aug, 2016
Hadoop 2.0 - Understanding HDFS And YARN

What is Hadoop 2.0?

Hadoop 2.0 is the second iteration of the Hadoop framework for distributed data processing. It can be considered as the generational shift in the architecture Apache Hadoop. Hadoop 2.0 has now established itself as the dominant big data analysis platform in the Hadoop ecosystem. It has gone ahead of Hadoop 1’s more restricted processing model of batch-oriented MapReduce jobs. Hadoop 2.0 is more interactive and has specialized processing models.

The introduction of HDFS federation and resource manager YARN are two of the most important features introduced in Hadoop 2.0.

HDFS federation :

Multiple independent name nodes/namespaces are used by the hdfs federation in order to scale the name service horizontally. In the federation, the good thing is that the independent name nodes don’t require any coordination with each other. All Namenodes uses data nodes as a common storage for blocks. The cluster contains each data node registered with all the Namenodes. Datanodes handles commands from the Namenodes and also sends periodic heartbeats. Datanodes are also responsible for sending the block reports.

 

YARN (Yet Another Resource Negotiator):

YARN is the new component added in Hadoop 2.0 and it has been introduced in between HDFS and MapReduce. YARN allows multiple application to run on the same platform. YARN framework is actually responsible for resource management of the Hadoop clusters. In Hadoop 1.0, MapReduce used to perform both cluster resource management and data processing but now Hadoop 2.0 took over the task of cluster management from MapReduce. YARN has a centralized resource manager component which manages resources and allocates resource to the application.

Hadoop 1.0 vs Hadoop 2.0

difference_hadoop10_and_hadoop20

Let’s check out some important differences between Hadoop 1.0 and Hadoop 2.0 :

Hadoop 1.0

 - Limited up to 4000 nodes per cluster

 - Only has one namespace for managing HDFS

 - Map and Reduce slots are static

 - Running job is only MapReduce

Hadoop 2.0

 - Potentially up to 10000 nodes per cluster

 - Supports multiple namespaces for managing HDFS

 - Efficient cluster utilization (YARN)

 - Any app can integrate with Hadoop

YARN : NextGen Hadoop MapReduce Architecture

With YARN , applications run “in” Hadoop instead of “on” Hadoop. YARN architecture (or called MR2) works by splitting up the two major responsibilities of JobTracker and TaskTracker into separate entities. Hadoop 2.0 has replaced JobTracker and TaskTracker by three components i.e. ResourceManager, NodeManager, and ApplicationMaster.

NextGen Hadoop MapReduce Architecture

ResourceManager and NodeManager combine together to form data-computation framework.

 - ResourceManager acts as the scheduler and allocates resources amongst all the application in the system.

 - NodeManager takes navigation from the ResourceManager and it runs on each node in the cluster. Resources available on a single node is managed by NodeManager.

 - ApplicationMaster , a framework-specific library is responsible for running specific YARN job and for negotiating resources from the ResourceManager and working with NodeManager to execute and monitor containers.

Container plays an important role in data processing. ApplicationMaster executes the data and then it is passed on to containers for actual processing. An application gets a specific amount of resources (memory, CPU etc.) on a specific host after container grants access for it.

Susan

Susan May

Writer, Developer, Explorer

Susan is a gamer, internet scholar and an entrepreneur, specialising in Big Data, Hadoop, Web Development and many other technologies. She is the author of several articles published on Zeolearn and KnowledgeHut blogs. She has gained a lot of experience by working as a freelancer and is now working as a trainer. As a developer, she has spoken at various international tech conferences around the globe about Big Data.


Website : https://www.zeolearn.com

Leave a Reply

Your email address will not be published. Required fields are marked *

SUBSCRIBE OUR BLOG

Follow Us On

Share on

other Blogs