Hadoop is the hot new technology and SQL is the old, tried and tested tool for diving deep into big data, for analysis. This is true, but the number of projects that are putting an SQL front end on Hadoop data stores shows that there is a real need for data querying high level languages in the Hadoop environment. Hadoop MapReduce being a complicated tool for data analysis, developers had come up with Pig and Hive – similar to SQL, which makes it easy to implement Hadoop, without the need for coding in Java, to analyze data. It is important to understand how different these are from each other – this is so that each can be optimally utilized for the right use case.
In the present age of Big Data, a number of querying options are available. While the old giant SQL continues to rein supremacy, organizations’ affinity towards open source programming and querying languages to tame Big Data has created plenty of space for Apache based Pig and Hive. Choosing the right weapon is often half the war won. So, choosing the right platform and language would go a long way in giving you complete control in data extraction, processing and analytics. There is a growing belief that as big data gets bigger; it also needs to get easier. Requirement for faster and easier processing of Big Data is driving the demand for it to get more mainstream.
Talking about Big Data, Apache Pig, Apache Hive and SQL are major options that exist today. All of them have their own advantages in specific situations. Given that the Pig vs Hive, Pig vs SQL and Hive vs SQL debates are never ending, there is hardly a consensus on which is the one-size-fits-all language. Through this article, we present few tips that would help you in choosing the best option to suit a given situation. Before we get into comparisons, we briefly introduce each of them.
Structured Query Language (SQL) has been a programmer’s companion for decades. It was the de-facto solution for extracting data for further processing. Big Data has changed how we visualize and process data. SQL’s demand of storing data in a strict relational database schemas and its declarative nature often deflects focus from the ultimate purpose – to extract data for analysis. For all its popularity, advent of Big Data, challenged SQL’s ability and performance.
SQL programmers required languages that were relatively easy to learn for someone having SQL background and at the same time was –
- Free of SQL’s excess baggage mentioned above and
- Could easily handle large data sets.
Originally developed at Yahoo Research in 2006, Pig addressed all these issues and provided better optimization scope and extensibility. Apache Pig also allows developers to follow multiple query approach, which reduces the data scan iterations. It has provisions for a number of nested data types (Maps, Tuples and Bags) and commonly used data operations such as Filters, Ordering and Joins. These advantages have seen Pig being adopted by a large number of users around the globe. Its simplicity has resulted in Yahoo and Twitter resorting to Pig for the majority of their MapReduce operations.
For all its processing power, Pig requires programmers to learn something on top of SQL. It requires learning and mastering something new. Hive statements are remarkably similar to SQL and despite the limitations of Hive Query Language (HQL) in terms of the commands that it understands, it is still very useful. Hive provides an excellent open source implementation of MapReduce. It works well when it comes to processing data stored in a distributed manner, unlike SQL which requires strict adherence to schemas while storing data.
Out of the three approaches to data extracting, processing and analysis, there is no one-size-fits-all approach. A number of factors such as data storage approach, programming language architecture and expected results should be given due consideration before making the choice. In the following section we compare Pig, Hive and SQL – two at a time – to understand when to use what.
Pig vs SQL
The DBMS systems that SQL operates on, are considered to be faster than MapReduce (operated on by Pig through the PigLatin platform). However, it is the loading of data that is more challenging in case of RDBMS, making the set up difficult. PigLatin offers a number of advantages in terms of declaring execution plans, ETL routines and pipeline modification.
SQL is declarative and PigLatin is procedural to a large extent. What we mean by this is in SQL, we largely specify “what” is to be accomplished and in Pig, we mention “how” a task is to be performed. A script written in Pig is essentially converted to a MapReduce job in the background before it is executed. A Pig script is shorter than the corresponding MapReduce job, which significantly cuts down development time.
Hive vs SQL
SQL is a general purpose database language that has extensively been used for both transactional and analytical queries. Hive, on the other hand, is built with an analytical focus. What this means is Hive lacks update and delete functions but is superfast in reading and processing huge volumes of data faster than SQL. Hence, even though Hive SQL is SQL-like, lack of support for modifying or deleting data is a major difference.
Despite the working differences, once you enter the Hive world from SQL, similarity in language ensures smooth transition but it is important to note the differences in constructs and syntax, else you’re in for frustrating times.
Now that we have an introduction to all the three data mining languages and their head to head comparisons, we close this article by analyzing different situations and listing the best fit option for each.
When is it best to use Apache Pig?
When you are looking to process clusters of unorganized, unstructured, decentralized data and don’t want to deviate too much from your solid SQL foundation, Pig is the option to go with. You no longer need to get into writing core MapReduce jobs. If you already have SQL background, the learning curve will be smooth and development time will be faster.
When is it best to use Apache Hive?
Big Data enterprises require fast analysis of data collected over a period of time. Hive is an excellent tool for analytical querying of historical data. It is to be noted that the data needs to be well organized, which would allow Hive to fully unleash its processing and analytical prowess. Querying real time data with Hive might not be the best idea, as it would be a time consuming job, defeating the original goal of fast processing (HBase is the answer for real time analytics, successfully utilized by Facebook).
When is it best to use SQL?
SQL is the oldest data analysis option among the three and its ability to update itself in line with growing user expectations make it relevant even today. While it is definitely a better option than excel (or any other open source spreadsheet application) for serious data analysts, it comes short when business demands fast paced processing and analysis. However, when the requirements are not too demanding, SQL is an excellent tool. Its familiarity and flexibility find favor with developers. Given that a large portion of developer community around the globe is familiar with SQL, its usage makes them productive from day one. It provides developers the facility to extend and optimize it, which make it highly customizable.
Now that it is evident that not all kinds of data require the same kind of querying tool, it will be easier to pick which one to use depending on the use cases.