Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.
It is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
Traditional Enterprise Systems normally uses a centralized server for storing and processing Data. The traditional model is certainly not suitable to process huge volumes of scalable data and cannot be accommodated by standard database servers. While processing multiple files simultaneously centralized system creates too much of a bottleneck.
Google solved this issue by using an algorithm called MapReduce. MapReduce divides a task into small parts and assigns them to many computers. Later, the results are collected in one place and integrated to form the result dataset. It makes Data processing lot easier.
Various Functions of MapReduce are as follows:
The process of Shuffling and sorting takes place after the completion of map task where the input to every reducer is sorted according to the keys. In this process, the system sorts the key-value output of the map tasks and transfer it to the reducer is called shuffle. In a process of Sorting MapReduce job helps reducer to easily distinguish when a new reduce task should start. This saves time for the reducer.
Inpua tsplit means the logical partition of data in Hadoop. This logical partition of data is processed one per Mapper. For processing a large amount of data, a physical partition of data requires a large number of mapper execution. To solve this problem, data is partitioned logically with larger size. First, it was 64 MB sized nowadays 128MB sized blocks are used in Hadoop. This large size of input split enables easy replication and processing of data.
Most common InputFormat are:
The user of the MapReduce framework needs to specify following configurations:
In Hadoop large number of intermediate data has been generated in the mapper phase. In the process of large data is transfer erring from mapper to reducer it will take up a lot of network resources. Therefore as a Solution, use Combiner with Mappers which act as mini-reducer’s i.e. Combiner processes the o/p of Mapper and does local aggregation before passing it to the reducer. Hence reducing the load on reducer. Combiner and Reducer use same code only difference is that combiner works along with each mapper.
Any data type that can be used for a Value field in a mapper or reducer must implement org.apache.hadoop.io.Writable Interface to enable the field to be serialized and deserialized.
By default Key fields should be comparable with each other. So, these must implement Hadoop's org.apache.hadoop.io.WritableComparable Interface which in turn extends Hadoop’s Writable interface and java.lang.Comparacomparedbleinterfaces.
IdentityMapper is the default Mapper class in Hadoop. This mapper come to a scenario when no other mapper class applied.
IdentityReducer is the default Reducer class in Hadoop. This mapper is executed when no reducer class is defined in the MapReduce job. This class merely passes the input key-value pairs into the output directory.
The distributed cache is used to distribute large read-only files that are needed by map/reduce jobs to the cluster. The framework will copy the necessary files from a URL on to the slave node before any tasks for the job are executed on that node. The files are only copied once per job and so should not be modified by the application.
It is an important feature provided by the MapReduce framework. For sharing some files across all nodes in the Hadoop Cluster, DistributedCache is used. The files could be executable jar files or simple properties file
By default, Hadoop can run 2 mappers and 2 reducers in one data node. Also, each node has 2 map slots and 2 reducer slots. It’s possible to change these default values in Mapreduce.xml in a conf file.
Since this framework supports chained operations wherein an input of one map job serves as the output for other, there is a need for job controls to govern these complex operations.
The various job control options are:
Job.submit() : to submit the job to the cluster and immediately return
Job.waitforCompletion(boolean) : to submit the job to the cluster and wait for its completion
If suppose our commodity hardware has less storage space, for solving this issue we can change the split size by writing the ‘custom splitter‘. There is a feature of customization in Hadoop which can be called from the main method.
Yes, MapReduce can be written in many programming languages Java, R, C++, Scripting Languages (Python, PHP). Any language able to read from stadin and write to stdout and parse tab and newline characters should work. Hadoop streaming (A Hadoop Utility) allows you to create and run Map/Reduce jobs with any executable or scripts as the mapper and/or the reducer.
The storage node is the place where the file system resides to store data for further processing. And the compute node is the place where the actual logic of the business is execute.
MapReduce job is a unit of work that a client wants to be performed. It consists of input data, MapReduce program in Jar file and configuration setting in XML files. Hadoop runs this job by dividing it in different tasks with the help of JobTracker.
HDFS block splits data into physical divisions, but it is different processing MapReduce, InputSplit in MapReduce splits input files logically. It is also used to control a number of mappers, the size of splits is user-defined. On the contrary, the HDFS block size is fixed to 64 MB, i.e. for 1GB data, it will be 1GB/64MB = 16 splits/blocks. However, if input split size is not defined by the user, it takes the HDFS default block size.
Following are the key concepts regarding how MapReduce works in the Hadoop
MapReduce works exclusively with <key, value> pairs. It views the job inputs as a set
of <key, value> pairs and outputs a set of <key, value> pairs as well. The map and
reduce functions in a MapReduce program have the following general form:
Map: (k1, v1) => list(k2, v2)
Reduce: (k2, list(v2) => list(k3, v3)
Normally, the map input key value types (k1 and v1) differ from the map output
types (v2, k2). Reduce input types must have the same types as the output of the mapping
process. Reduce output types may be different (k3, v3).
By default block size = 64mb, but to process the data, job tracker split the data. Hadoop architect use these formulas to know split size.
by default split size = block size.
Always No of splits = No of mappers.
Apply above formula:
PIG is a data flow language, the main work of Pig is to manage the flow of data from an input source to output store. As part of managing this data flow, it moves data feeding it to process 1. After that, it takes the output taking the output and feeding it to process2.
The core features of the pig are preventing the execution of subsequent stages if the previous stage fails, manages temporary storage of data and most importantly compresses and rearranges processing steps for faster processing. While this can be done for any kind of processing tasks Pig is written specifically for managing data flow of MapReduce type of jobs. Most if not all jobs in a Pig are map-reduce jobs or data movement jobs. Pig allows for custom functions to be added which can be used for processing in Pig, some default ones are like ordering, grouping, distinct, count etc.
MapReduce on the other hand is a data processing paradigm, it is a framework for application developers to write code in so that its easily scaled to PB of tasks, this creates a separation between the developer that writes the application vs the developer that scales the application. Not all applications can be migrated to Map reduce but good few can be including complex ones like k-means to simple ones like counting uniques in a dataset.
The main reason to retrieve data faster, Hadoop reads data parallel, due to this it can access data faster. Hadoop writes in sequence, but not parallel, the main reason it might result one node can be overwritten by other and where the second node. Parallel processing is independent, so there is no relation between two nodes, if writes data in parallel, it’s not possible where the next chunk of data has. For example, 100 MB data write parallel, 64 MB one block another block 36, if data write parallel first block doesn’t know where the remaining data. So Hadoop reads parallel and writes sequentially.
The Local aggregation (combining of key/value pairs) done inside the mapper.
Map method does not emit key/value pairs, it only updates internal data structure. Close method combines and preprocess all stored data and emits final key/value pairs. An internal data structure is initialized in the init method.
The first step of scheduling job is, JobTracker communicates with NameNode to identify data location and submits the work to the TaskTracker node. Then the TaskTracker plays a major role as it notifies the JobTracker for any job failure. It depends upon the heartbeat reporter reassuring the JobTracker that it is still alive. Later, the JobTracker is responsible for the actions as in it may either resubmit the job or mark a specific record as unreliable or blacklist it.
Either combiner or a mapper combines key/value pairs with the same key together. They may do also some additional preprocessing of combined values. Only key/value pairs produced by the same mapper are combined.
Key/Value pairs created by map tasks are transferred between nodes during shuffle and sort phase. Local aggregation reduces the amount of data to be transferred.
If the distribution of values over keys is skewed, data preprocessing in combiner helps to eliminate reduce stragglers.
There are five separate daemon processes on a Hadoop system. Each of the daemon processes has its JVM. Out of the five daemon processes, three runs on the master node whereas two runs on the slave nodes.
The daemon processes are as follows:
MapReduce Framework consists of a single Job Tracker per Cluster, one Task Tracker per node. Usually, a cluster has multiple nodes, so each cluster has single Job Tracker and multiple TaskTrackers.JobTracker can schedule the job and monitor the Task Trackers. If Task Tracker failed to execute tasks, try to re-execute the failed tasks.
TaskTracker follow the JobTracker’s instructions and execute the tasks. As a slave node, it reports the job status to Master JobTracker in the form of Heartbeat.