Guest blog post by Tanmay Bhandari
Originally posted on Data Science Central
In the book Hadoop: The definitive guide, Tom white quotes Grace Hopper, “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” For long Hadoop has been the data analytics system preferred by businesses all over. The recent entry of the spark engine has however given businesses an option other than Hadoop for data analytics purposes.
A lot of discussion among experts in the field of big data analytics is over which of the two data analytics engines, the Hadoop or the Spark, is the better performer when it comes to applications in business. While Hadoop has been around for a long time, Spark is a new data analytics system released just couple of months ago. Both systems have been developed by apache, with both systems being an open source platform.
Both Hadoop and Spark have their own plus points with regard to performance. There are some applications in which Hadoop scores above Spark, but Sparks ease of use and speed of operations is way ahead of Hadoop. There are also some functions in both Hadoop and Spark which overlap with each other. All these factors need to be kept in mind when making a comparison of Hadoop and Spark.
The Hadoop data analytics engine:
In many projects undertaken nowadays, storage of data is distributed. This is done due to the huge volume of data, usually in petabytes, generated by businesses. Therefore rather than spending a lot on building custom storage devices to keep all the data in one place, it is feasible on the part of businesses to store this data in multiple storage devices such as disks. Hadoop is a framework used for the processing of the distributed data spread across several storage devices. Hadoop was initially created to go through millions of web pages and content and collecting data relevant to them. The Hadoop MapReduce is an important component of Hadoop, and is its distribution processing engine.
Hadoop vs Spark:
One of the biggest advantages of Spark over Hadoop is its speed of operation. Spark is said to process data sets at speeds 100 times that of Hadoop. Another USP of Spark is its ability to do real time processing of data, compared to Hadoop which has a batch processing engine. Spark’s real time processing allows it to apply data analytics to information drawn from campaigns run by businesses, internet of things systems, social media and data gathered from manufacturing facilities and factories. Hadoop on the other hand cannot apply real time processing to data.
Spark doesn’t have its own file distribution system; while Hadoop has the HDFS (Hadoop distributed file system). The file storing system basically allows for organizing of the files. Because Spark is compatible with Hadoop, most businesses use Spark along with Hadoop in order to take advantage of Spark’s superior data analytics and Hadoop’s HDFS system.
In case of Hadoop data is written back to the storage device, with the intention that in case of failure data can be recovered. This system however does not allow for optimum use of memory available. With Spark, the concept of RDD (Resilient distributed datasets) is used, where data is written back and saved only if the user wants it.
Another advantage of Spark is the lower costs involved. While Hadoop MapReduce and Spark both run on the same hardware, MapReduce requires more systems compared to Spark to distribute disc i/o over several systems. This leads to decreased costs, despite Spark using more RAM and memory compared to Hadoop, since the systems-each of whose individual cost is high-is less compared to Hadoop. For example Spark was used to process 100 terabyte of data 3 times faster than Hadoop on a tenth of the systems, leading to Spark winning the 2014 Daytona GraySort benchmark.
Which is better?
It is hard to say which of the two systems is better. While Spark certainly has its advantages over Hadoop, especially in the domain of speed and ease of use, it lacks certain applications which are present in Hadoop. Ultimately, it would be better for businesses to use both Hadoop and Spark data analytics systems in their operations. As is referenced in the first line of this article, Hadoop and Spark are but a pair of oxen, in order to lift the log-that is the business operations-and improve them to the benefit of businesses.