What is Hadoop 2.0?
Hadoop 2.0 is the second iteration of the Hadoop framework for distributed data processing. It can be considered as the generational shift in the architecture Apache Hadoop. Hadoop 2.0 has now established itself as the dominant big data analysis platform in the Hadoop ecosystem. It has gone ahead of Hadoop 1’s more restricted processing model of batch-oriented MapReduce jobs. Hadoop 2.0 is more interactive and has specialized processing models.
The introduction of HDFS federation and resource manager YARN are two of the most important features introduced in Hadoop 2.0.
HDFS federation :
Multiple independent name nodes/namespaces are used by the hdfs federation in order to scale the name service horizontally. In the federation, the good thing is that the independent name nodes don’t require any coordination with each other. All Namenodes uses data nodes as a common storage for blocks. The cluster contains each data node registered with all the Namenodes. Datanodes handles commands from the Namenodes and also sends periodic heartbeats. Datanodes are also responsible for sending the block reports.
YARN (Yet Another Resource Negotiator):
YARN is the new component added in Hadoop 2.0 and it has been introduced in between HDFS and MapReduce. YARN allows multiple application to run on the same platform. YARN framework is actually responsible for resource management of the Hadoop clusters. In Hadoop 1.0, MapReduce used to perform both cluster resource management and data processing but now Hadoop 2.0 took over the task of cluster management from MapReduce. YARN has a centralized resource manager component which manages resources and allocates resource to the application.
Hadoop 1.0 vs Hadoop 2.0
Let’s check out some important differences between Hadoop 1.0 and Hadoop 2.0 :
- Limited up to 4000 nodes per cluster
- Only has one namespace for managing HDFS
- Map and Reduce slots are static
- Running job is only MapReduce
- Potentially up to 10000 nodes per cluster
- Supports multiple namespaces for managing HDFS
- Efficient cluster utilization (YARN)
- Any app can integrate with Hadoop
YARN : NextGen Hadoop MapReduce Architecture
With YARN , applications run “in” Hadoop instead of “on” Hadoop. YARN architecture (or called MR2) works by splitting up the two major responsibilities of JobTracker and TaskTracker into separate entities. Hadoop 2.0 has replaced JobTracker and TaskTracker by three components i.e. ResourceManager, NodeManager, and ApplicationMaster.
ResourceManager and NodeManager combine together to form data-computation framework.
- ResourceManager acts as the scheduler and allocates resources amongst all the application in the system.
- NodeManager takes navigation from the ResourceManager and it runs on each node in the cluster. Resources available on a single node is managed by NodeManager.
- ApplicationMaster , a framework-specific library is responsible for running specific YARN job and for negotiating resources from the ResourceManager and working with NodeManager to execute and monitor containers.
Container plays an important role in data processing. ApplicationMaster executes the data and then it is passed on to containers for actual processing. An application gets a specific amount of resources (memory, CPU etc.) on a specific host after container grants access for it.