Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.
An operation is a method, which can be applied on RDD to accomplish certain task. There are two types of RDD operations: Action and Transformation.
Transformations are kind of operations which transforms the RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable). Operations like map, filter, flatMap are transformations.
Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When any action is triggered, new RDD is not formed (unlike transformation). Thus, actions are RDD operations that will give non-RDD values. The output of actions are stored to drivers or to the external storage system.
Explanation: Lazy evaluation means that the Spark execution will not start until an action is triggered. In Spark, lazy evaluation comes into picture when Spark transformations occur.
Transformations are lazy in nature. When some operation in RDD is called, it does not execute immediately. Spark maintains the record of all operations being called through Directed Acyclic Graph. Spark RDD can be thought as the data, that we built up through transformation. Since transformations are lazy in nature, so we can execute operation any time by calling an action on data. Thus, in lazy evaluation data is not loaded until it is necessary.
Explanation: DAG refers to Directed Acyclic Graph. Here, the word “directed” means that iit is a finite directed graph with no directed cycles. There are finite numbers of vertices and edges, where each edge is directed from one vertex to another. It contains sequence of vertices such that every edge is directed from earlier to later in the sequence.
Explanation: Once any action is performed on an RDD, Spark context gives the program to driver. The driver creates the directed acyclic graph or execution plan (job) for the program. Once the DAG is created, the driver divides it into a number of stages. These stages are then divided into number of smaller tasks and all these tasks are given to the executors for execution.
The Spark driver is accountable for converting a user program into units of physical execution called tasks. All Spark programs follow the same structure at a high level. They create RDDs from some input, derive new RDDs from those using transformations, and perform actions to collect or save data. A Spark program implicitly creates a logical directed acyclic graph of operations.When the driver runs, it converts this logical graph into a physical execution plan i.e tasks.
Explanation: Shuffling is a process of redistributing data across partitions (aka repartitioning) that may or may not cause moving data across JVM processes or even over the wire (between executors on separate machines). It is the process of data transfer between stages.
Explanation: There are two types of “Deploy modes” in spark: Client Mode and Cluster Mode. If the driver component of spark job runs on the machine from which job is submitted, in that case deploy mode is “client mode” and if the “driver” component of spark job will not run on the local machine from which job is submitted then the deploy mode is basically “cluster mode”. Here spark job will launch “driver” component inside the cluster
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
The Spark UI is available on port 4040 of the driver node. If you are running in local mode, this will be http://localhost:4040. The Spark UI displays information on the state of your Spark jobs, its environment, and cluster state. It’s very useful, especially for tuning and debugging.
A DataFrame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. Due to it tabular format, a DataFrame has additional metadata, which allows Spark to run certain optimizations on the finalized query.
An RDD is a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
However, one can go from a DataFrame to an RDD via its rdd method. Similarly, from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method.
Receivers are special entities in Spark Streaming that consume data from various data sources and move them to Apache Spark. Receivers are usually created by streaming contexts as long running tasks on various executors and scheduled to operate in a round robin manner with each receiver taking a single core.
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
DStreams, or discretized streams, are high-level abstractions provided in Spark Streaming that represents a continuous stream of data. DStreams can be either created from input sources such as Kafka, Flume or Kinesis; or by applying high-level operations on existing DStreams.
Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream contains data from a certain interval.
The VectorAssembler is a tool that is used in nearly every single pipeline API you generate. It helps concatenate all your features into one big vector you can then pass into an estimator. It’s used typically in the last step of a machine learning pipeline and takes as input a number of columns of Boolean, Double, or Vector. This is particularly helpful if you’re going to perform a number of manipulations using a variety of transformers and need to gather all of those results together.
Caching is an optimization techniques for iterative and interactive Spark computations. They help saving interim partial results so they can be reused in subsequent stages. This helps to speed up applications that access the same RDD multiple times. An RDD that is not cached, nor checkpointed, is re-evaluated again each time an action is invoked on that RDD.
There are two function calls for caching an RDD: cache() and persist(level: StorageLevel). The difference among them is that cache() will cache the RDD into memory, whereas persist(level) can cache in memory, on disk, or off-heap memory according to the caching strategy specified by level. persist() without an argument is equivalent with cache().
Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, RDD. RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that partition.This removes the need for replication to achieve fault tolerance.
There are two types transformation on DStream:
1) Stateless transformation : In stateless transformation, the processing of each batch does not depend on the data of its previous batches. Each stateless transformation applies separately to each RDD.
Examples: map(), flatMap(), filter(), repartition(), reduceByKey(), groupByKey().
2) Stateful transformation : stateful transformation use data or intermediate results from previous batches to compute the result of the current batch. The stateful transformations on the other hand allow us combining data across time.
Example:updateStateByKey and mapWithState
Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster. Whenever Spark ships a code to every executor the variables become local to that executor and its updated value may not relay back to the driver. To avoid this problem we need to make an accumulator such that all the updates to such variable in every executor is relayed back to the driver.
CreateorReplaceTempView is used when you want to store the table for a particular spark session and CreateGlobalTempView is used when you want to share the temp table across multiple spark sessions.
On the map side, each map task in Spark writes out a shuffle file (os disk buffer) for every reducer – which corresponds to a logical block in Spark. These files are not intermediary in the sense that Spark does not merge them into larger partitioned ones. Since scheduling overhead in Spark is lesser, the number of mappers and reducers is far higher than in Hadoop. Thus, shipping map-reduce files to the respective reducers could result in significant overheads.
On the reduce side, Spark requires all shuffled data to fit into memory of the corresponding reducer task, on the contrary of Hadoop that had an option to spill this over to disk. This would of course happen only in cases where the reducer task demands all shuffled data for a GroupByKey or a ReduceByKey operation, for instance. Spark throws an out-of-memory exception in this case.
A typed dataset is a dataset that is first derived from the base DataSet class and then uses information from the Dataset Designer, to generate a new, strongly typed dataset class
An untyped dataset, in contrast, has no corresponding built-in schema. An untyped dataset contains tables, columns, and so on—but those are exposed only as collections
Taking user code and converting it into a logical plan is the first phase of execution. The logical plan only converts the user’s set of expressions into the most optimized version. It does this by converting the user code into an unresolved logical plan. This logical plan is unresolved because although your code might be valid, the tables or columns that it refers to may or may not exist. Spark uses a repository of all table, catalog and DataFrame information to resolve columns and tables in the analyzer. The analyzer may reject the unresolved logical plan if the required table or column name does not exist in the catalog. If the analyzer is able to resolve it, the result is passed through the Catalyst Optimizer. The Packages can extend Catalyst to include their own rules for domain specific optimizations.
After successfully creating optimized logical plan, Spark begins the physical planning process. The physical plan, called Spark plan, specifies how the logical plan will execute on the cluster by generating different physical execution strategies and comparing them through a cost model. An example of the cost comparison could be choosing how to perform a given join by looking at the physical attributes of a given table. A series of RDDs and transformations is the results of Physical Planning. This result is why it is sometimes said that Spark is referred to as a compiler—it takes queries in DataFrames, Datasets, and SQL and compiles them into RDD transformations.
Cost-Based Optimization (aka Cost-Based Query Optimization or CBO Optimizer) is an optimization technique in Spark SQL that uses table statistics to determine the most efficient query execution plan of a structured query (given the logical query plan). The Cost-based optimization is disabled by default. Spark SQL uses spark.sql.cbo.enabled configuration property to control whether the CBO should be enabled and used for query optimization or not. Cost-Based Optimization uses logical optimization rules (e.g. CostBasedJoinReorder) to optimize the logical plan of a structured query based on statistics.
We first use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute table statistics and then with DESCRIBE EXTENDED SQL command, inspect the statistics.Logical operators have statistics support that is used for query planning.
Prior to spark 2.0.0 SparkContext was used as a channel to access all spark functionality. The spark driver program uses spark context to connect to the cluster through a resource manager.
sparkConf is required to create the spark context object, which stores configuration parameter like appName (to identify your spark driver), application, number of core and memory size of executor running on worker node In order to use APIs of SQL,HIVE , and Streaming, separate contexts need to be created.
SPARK 2.0.0 onwards, SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with Dataframe and Dataset APIs. All the functionality available with sparkContext are also available in sparkSession. In order to use APIs of SQL, HIVE, and Streaming, no need to create separate contexts as sparkSession includes all the APIs.
At high level, when any action is called on the RDD, Spark creates the DAG and submits it to the DAG scheduler.
The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. DAG allows the user to dive into the stage and expand on detail on any stage. In the stage view, the details of all RDDs belonging to that stage are expanded. The Scheduler splits the Spark RDD into stages based on various transformation applied. Each stage is comprised of tasks, based on the partitions of the RDD, which will perform same computation in parallel.
There are two types of transformations:
Narrow transformation — In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().
Wide transformation — In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey and reducebyKey.
To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A partition is a collection of rows that sit on one physical machine in your cluster. A DataFrame’s partitions represent how the data is physically distributed across the cluster of machines during execution. If you have one partition, Spark will have a parallelism of only one, even if you have thousands of executors. If you have many partitions but only one executor, Spark will still have a parallelism of only one because there is only one computation resource. With DataFrames you do not manipulate partitions manually or individually. You simply specify high-level transformations of data in the physical partitions, and Spark determines how this work will actually execute on the cluster
The coalesce operation effectively collapses partitions on the same worker in order to avoid a shuffle of the data when repartitioning.
The repartition operation allows you to repartition your data up or down but performs a shuffle across nodes in the process. Increasing the number of partitions can increase the level of parallelism when operating in map- and filter-type operations.
The sole goal of custom partitioning is to even out the distribution of your data across the cluster so that you can work around problems like data skew. To perform custom partitioning you need to implement your own class that extends Partitioner.
The Pipelines API encapsulates a simple, tidy view of these machine learning tasks: at each stage, data is turned into other data, and eventually turned into a model, which is itself an entity that just creates data (predictions) from other data too (input).
ML Pipelines consists of the following key components.
DataFrame - The Apache Spark ML API uses DataFrames provided in the Spark SQL library to hold a variety of data types such as text, feature vectors, labels and predictions.
Transformer - A transformer is an algorithm that transforms one dataframe into another dataframe.
Estimators - An estimator is an algorithm that can be applied on a dataframe to produce a Transformer.
Serialization is an important concept in most distributed applications, of course Spark is included. A serialization framework helps you convert objects into a stream of bytes and vice versa in new computing environment. This is very helpful when you try to save objects to disk or send them through networks. Those situations happen in Spark when things are shuffle around.
In spark development, it is very common to come across serialization errors, especially in Spark Streaming applications. There are typically two ways to handle this:
1) make the object/class serializable;
2) declare the instance within the lambda function