top

Apache Spark vs Hadoop MapReduce - Who wins the Battle?

This article is supposed to be concentrated on Apache Spark vs Hadoop. But before jumping into the river we should be aware of swimming. In this context, we have referred swimming to Big Data. Quite Intelligent of you to understand that!! So, what actually is Big Data and from where this term came? Big Data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. This will surely increase your curiosity about Big Data, over 90% of world’s data has been created in the last two years! This is tremendous. Data growth has just been unbelievable. In 2009 estimates, we had about 0.8 Zettabytes. In 2020, it is expected that we’re gonna go up to 35 zettabytes. According to New York Times, it is estimated that by 2020, the size of the digital universe will be 40 Trillion Gigabytes. Also Read: Understanding Big Data Concepts II Big Data Analytics with Apache Hadoop Apache Spark and Hadoop are two of the most significant part of Big Data family. Some view these two frameworks as competitors in the big data space. It’s not that easy to compare Apache Spark and Hadoop because they do many things same, but there are some areas where both of them don’t overlap. For Example, Apache Spark has no file system and therefore it rely on Hadoop’s Distributed File System. Google Trends showing the interest of Apache Spark and Hadoop over the years. As shown in Google Trends, Hadoop has more popularity as compared to Apache Spark. Besides being more popular, companies like Yahoo, Intel, Baidu, Trend Micro, and Groupon are already  using Apache Spark. Apache Spark and Hadoop are comparable on different parameters. In this post let’s compare Apache Spark and Hadoop and see why spark has gained popularity. Performance: Spark vs Hadoop performance is somewhat difficult to compare as both of them process the data differently. But, there is also a reason on why Spark is able to process the data faster, this is because it can process the data directly in the memory and it also stores the data in disk if it does not fit into the memory. Ease of use: Spark comes with user-friendly APIs for Scala, Python, Java and Spark SQL, which is also considered to be very similar to SQL92, so again it makes it easier to use. On the other hand, MapReduce has add-ons such as Pig and Hive, making it somewhat easier to use too. Graph Processing: MapReduce has Pregel and GraphLab, which are Graph processing tools that are scalable and fast, but not suitable for post-processing of complex multi-stage algorithms. These problems are solved through graph processing in Spark as it includes GraphX, which has an inbuilt graph and in-memory computation which are efficient than MapReduce. We can see from the below graph that MapReduce performs better than Spark at a smaller number of iterations but falls behind if the iterations are more. Security: The Security of the Apache Spark is not considered up to the mark as of now. It supports authentication by shared secret. The Spark can also run on YARN, making it possible to use Kerberos authentication. If Spark is made to run on HDFS, it is possible to use file-level permissions. Hadoop, on the other hand, will have all the benefits of the Hadoop security projects such as Sentry and Knox Gateway. HDFS supports Access Control Lists(ACLs) and Hadoop also supports Service Level Authorization, ensuring proper permissions for the clients. Who wins the Battle!! Without a doubt, it’s Apache Spark which dominates Hadoop MapReduce in many different areas. Apache Spark is considered as faster than MapReduce in most of the cases. Hadoop wins over Spark when the memory size is significantly smaller than the size of the data. In the near future, it is possible that Spark will replace MapReduce. The points covered are differences between Hadoop and Spark. However, the comparison and differences are not limited to this article and further research is recommended on the topic for more information.
Rated 4.0/5 based on 20 customer reviews
Normal Mode Dark Mode

Apache Spark vs Hadoop MapReduce - Who wins the Battle?

Susan May
Blog
23rd Nov, 2016
Apache Spark vs Hadoop MapReduce - Who wins the Battle?

This article is supposed to be concentrated on Apache Spark vs Hadoop. But before jumping into the river we should be aware of swimming. In this context, we have referred swimming to Big Data. Quite Intelligent of you to understand that!! So, what actually is Big Data and from where this term came? Big Data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.

This will surely increase your curiosity about Big Data, over 90% of world’s data has been created in the last two years! This is tremendous. Data growth has just been unbelievable. In 2009 estimates, we had about 0.8 Zettabytes. In 2020, it is expected that we’re gonna go up to 35 zettabytes. According to New York Times, it is estimated that by 2020, the size of the digital universe will be 40 Trillion Gigabytes.

Also Read: Understanding Big Data Concepts II Big Data Analytics with Apache Hadoop

Apache Spark and Hadoop are two of the most significant part of Big Data family. Some view these two frameworks as competitors in the big data space. It’s not that easy to compare Apache Spark and Hadoop because they do many things same, but there are some areas where both of them don’t overlap. For Example, Apache Spark has no file system and therefore it rely on Hadoop’s Distributed File System.

Google Trends showing the interest of Apache Spark and Hadoop over the years.

google_trends_hadoop_spark

As shown in Google Trends, Hadoop has more popularity as compared to Apache Spark. Besides being more popular, companies like Yahoo, Intel, Baidu, Trend Micro, and Groupon are already  using Apache Spark.

Apache Spark and Hadoop are comparable on different parameters. In this post let’s compare Apache Spark and Hadoop and see why spark has gained popularity.

Performance:

Spark vs Hadoop performance is somewhat difficult to compare as both of them process the data differently. But, there is also a reason on why Spark is able to process the data faster, this is because it can process the data directly in the memory and it also stores the data in disk if it does not fit into the memory.

performance_hadoop_spark

Ease of use:

Spark comes with user-friendly APIs for Scala, Python, Java and Spark SQL, which is also considered to be very similar to SQL92, so again it makes it easier to use. On the other hand, MapReduce has add-ons such as Pig and Hive, making it somewhat easier to use too.

ease_of_use

Graph Processing:

MapReduce has Pregel and GraphLab, which are Graph processing tools that are scalable and fast, but not suitable for post-processing of complex multi-stage algorithms. These problems are solved through graph processing in Spark as it includes GraphX, which has an inbuilt graph and in-memory computation which are efficient than MapReduce.

We can see from the below graph that MapReduce performs better than Spark at a smaller number of iterations but falls behind if the iterations are more.

logistics_regression

Security:

The Security of the Apache Spark is not considered up to the mark as of now. It supports authentication by shared secret. The Spark can also run on YARN, making it possible to use Kerberos authentication. If Spark is made to run on HDFS, it is possible to use file-level permissions.

Hadoop, on the other hand, will have all the benefits of the Hadoop security projects such as Sentry and Knox Gateway. HDFS supports Access Control Lists(ACLs) and Hadoop also supports Service Level Authorization, ensuring proper permissions for the clients.

security_hadoop_spark

Who wins the Battle!!

Without a doubt, it’s Apache Spark which dominates Hadoop MapReduce in many different areas. Apache Spark is considered as faster than MapReduce in most of the cases. Hadoop wins over Spark when the memory size is significantly smaller than the size of the data. In the near future, it is possible that Spark will replace MapReduce.

The points covered are differences between Hadoop and Spark. However, the comparison and differences are not limited to this article and further research is recommended on the topic for more information.

Susan

Susan May

Writer, Developer, Explorer

Susan is a gamer, internet scholar and an entrepreneur, specialising in Big Data, Hadoop, Web Development and many other technologies. She is the author of several articles published on Zeolearn and KnowledgeHut blogs. She has gained a lot of experience by working as a freelancer and is now working as a trainer. As a developer, she has spoken at various international tech conferences around the globe about Big Data.


Website : https://www.zeolearn.com

Leave a Reply

Your email address will not be published. Required fields are marked *

SUBSCRIBE OUR BLOG

Follow Us On

Share on

other Blogs

20% Discount