This article is supposed to be concentrated on Apache Spark vs Hadoop. But before jumping into the river we should be aware of swimming. In this context, we have referred swimming to Big Data. Quite Intelligent of you to understand that!! So, what actually is Big Data and from where this term came? Big Data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.
This will surely increase your curiosity about Big Data, over 90% of world’s data has been created in the last two years! This is tremendous. Data growth has just been unbelievable. In 2009 estimates, we had about 0.8 Zettabytes. In 2020, it is expected that we’re gonna go up to 35 zettabytes. According to New York Times, it is estimated that by 2020, the size of the digital universe will be 40 Trillion Gigabytes.
Apache Spark and Hadoop are two of the most significant part of Big Data family. Some view these two frameworks as competitors in the big data space. It’s not that easy to compare Apache Spark and Hadoop because they do many things same, but there are some areas where both of them don’t overlap. For Example, Apache Spark has no file system and therefore it rely on Hadoop’s Distributed File System.
Google Trends showing the interest of Apache Spark and Hadoop over the years.
As shown in Google Trends, Hadoop has more popularity as compared to Apache Spark. Besides being more popular, companies like Yahoo, Intel, Baidu, Trend Micro, and Groupon are already using Apache Spark.
Apache Spark and Hadoop are comparable on different parameters. In this post let’s compare Apache Spark and Hadoop and see why spark has gained popularity.
Spark vs Hadoop performance is somewhat difficult to compare as both of them process the data differently. But, there is also a reason on why Spark is able to process the data faster, this is because it can process the data directly in the memory and it also stores the data in disk if it does not fit into the memory.
Ease of use:
Spark comes with user-friendly APIs for Scala, Python, Java and Spark SQL, which is also considered to be very similar to SQL92, so again it makes it easier to use. On the other hand, MapReduce has add-ons such as Pig and Hive, making it somewhat easier to use too.
MapReduce has Pregel and GraphLab, which are Graph processing tools that are scalable and fast, but not suitable for post-processing of complex multi-stage algorithms. These problems are solved through graph processing in Spark as it includes GraphX, which has an inbuilt graph and in-memory computation which are efficient than MapReduce.
We can see from the below graph that MapReduce performs better than Spark at a smaller number of iterations but falls behind if the iterations are more.
The Security of the Apache Spark is not considered up to the mark as of now. It supports authentication by shared secret. The Spark can also run on YARN, making it possible to use Kerberos authentication. If Spark is made to run on HDFS, it is possible to use file-level permissions.
Hadoop, on the other hand, will have all the benefits of the Hadoop security projects such as Sentry and Knox Gateway. HDFS supports Access Control Lists(ACLs) and Hadoop also supports Service Level Authorization, ensuring proper permissions for the clients.
Who wins the Battle!!
Without a doubt, it’s Apache Spark which dominates Hadoop MapReduce in many different areas. Apache Spark is considered as faster than MapReduce in most of the cases. Hadoop wins over Spark when the memory size is significantly smaller than the size of the data. In the near future, it is possible that Spark will replace MapReduce.
The points covered are differences between Hadoop and Spark. However, the comparison and differences are not limited to this article and further research is recommended on the topic for more information.