Guest blog post by Khushbu Shah.
Hadoop has continued to grow and develop ever since it was introduced in the market 10 years ago. Every new release and abstraction on Hadoop is used to improve one or the other drawback in data processing, storage and analysis. Apache Hive was introduced by Facebook to manage and process the large datasets in the distributed storage in Hadoop. Apache Hive is an abstraction on Hadoop MapReduce and has its own SQL like language HiveQL. Cloudera Impala was developed to resolve the limitations posed by low interaction of Hadoop Sql. Cloudera Impala provides low latency high performance SQL like queries to process and analyze data with only one condition that the data be stored on Hadoop clusters.
Data explosion in the past decade has not disappointed big data enthusiasts one bit. It has thrown up a number of challenges and created new industries which require continuous improvements and innovations in the way we leverage technology.
Big Data keeps getting bigger. It continues to pressurize existing data querying, processing and analytic platforms to improve their capabilities without compromising on the quality and speed. A number of comparisons have been drawn and they often present contrasting results. Cloudera Impala and Apache Hive are being discussed as two fierce competitors vying for acceptance in database querying space. While Hadoop has clearly emerged as the favorite data warehousing tool, the Cloudera Impala vs Hive debate refuses to settle down.
We try to dive deeper into the capabilities of Impala and Hive to see if there is a clear winner or are these two champions in their own rights on different turfs. We begin by prodding each of these individually before getting into a head to head comparison.
Step aside, the SQL engines claiming to do parallel processing! Impala’s open source Massively Parallel Processing (MPP) SQL engine is here, armed with all the power to push you aside. The only condition it needs is data be stored in a cluster of computers running Apache Hadoop, which, given Hadoop’s dominance in data warehousing, isn’t uncommon. Cloudera Impala was announced on the world stage in October 2012 and after a successful beta run, was made available to the general public in May 2013.
Cloudera Impala is an excellent choice for programmers for running queries on HDFS and Apache HBase as it doesn’t require data to be moved or transformed prior to processing. Cloudera Impala easily integrates with Hadoop ecosystem, as its file and data formats, metadata, security and resource management frameworks are same as those used by MapReduce, Apache Hive, Apache Pig and other Hadoop software. It is architected specifically to assimilate the strengths of Hadoop and the familiarity of SQL support and multi user performance of traditional database. Its unified resource management across frameworks has made it the de facto standard for open source interactive business intelligence tasks.
Cloudera Impala has the following two technologies that give other processing languages a run for their money:
Data is stored in columnar fashion which achieves high compression ratio and efficient scanning.
This is fundamental to attaining a massively parallel distributed multi – level serving tree for pushing down a query to the tree and then aggregating the results from the leaves.
Impala massively improves on the performance parameters as it eliminates the need to migrate huge data sets to dedicated processing systems or convert data formats prior to analysis. Salient features of Impala include:
- Hadoop Distributed File System (HDFS) and Apache HBase storage support
- Recognizes Hadoop file formats, text, LZO, SequenceFile, Avro, RCFile and Parquet
- Supports Hadoop Security (Kerberos authentication)
- Fine – grained, role-based authorization with Apache Sentry
- Can easily read metadata, ODBC driver and SQL syntax from Apache Hive
Impala’s rise within a short span of little over 2 years can be gauged from the fact that Amazon Web Services and MapR have both added support for it.
Initially developed by Facebook, Apache Hive is a data warehouse infrastructure build over Hadoop platform for performing data intensive tasks such as querying, analysis, processing and visualization. Apache Hive is versatile in its usage as it supports analysis of huge datasets stored in Hadoop’s HDFS and other compatible file systems such as Amazon S3. To keep the traditional database query designers interested, it provides an SQL – like language (HiveQL) with schema on read and transparently converts queries to MapReduce, Apache Tez and Spark jobs. Other features of Hive include:
- Indexing for accelerated processing
- Support for different storage types such as plain text, RCFile, HBase, ORC and others
- Metadata storage in RDBMS, bringing down time to perform semantic checks during query execution
- Has SQL like queries that get implicitly converted into MapReduce, Tez or Spark jobs
- Familiar built in user defined functions (UDFs) to manipulate strings, dates and other data – mining tools.
If you are looking for an advanced analytics language which would allow you to leverage your familiarity with SQL (without writing MapReduce jobs separately) then Apache Hive is definitely the way to go. HiveQL queries anyway get converted into a corresponding MapReduce job which executes on the cluster and gives you the final output. Hive (and its underlying SQL like language HiveQL) does have its limitations though and if you have a really fine grained, complex processing requirements at hand you would definitely want to take a look at MapReduce.
Impala vs Hive – 4 Differences between the Hadoop SQL Components
Impala has been shown to have performance lead over Hive by benchmarks of both Cloudera (Impala’s vendor) and AMPLab. Benchmarks have been observed to be notorious about biasing due to minor software tricks and hardware settings. However, it is worthwhile to take a deeper look at this constantly observed difference. The following reasons come to the fore as possible causes:
- Cloudera Impala being a native query language, avoids startup overhead which is commonly seen in MapReduce/Tez based jobs (MapReduce programs take time before all nodes are running at full capacity). In Hive, every query has this problem of “cold start” whereas Impala daemon processes are started at boot time itself, always being ready to process a query.
- Hadoop reuses JVM instances to reduce startup overhead partially but introduces another problem when large haps are in use. Cloudera benchmark have 384 GB memory which is a big challenge for the garbage collector of the reused JVM instances.
- MapReduce materializes all intermediate results, which enables better scalability and fault tolerance (while slowing down data processing). Impala streams intermediate results between executors (trading off scalability).
- Hive generates query expressions at compile time whereas Impala does runtime code generation for “big loops”.
Image Credit : csdn.net
The above graph demonstrates that Cloudera Impala is 6 to 69 times faster than Apache Hive.To conclude, Impala does have a number of performance related advantages over Hive but it also depends upon the kind of task at hand. That being said, Jamie Thomson has found some really interesting results through dumb querying published on sqlblog.com, especially in terms of execution time. For all its performance related advantages Impala does have few serious issues to consider. Being written in C/C++, it will not understand every format, especially those written in java. If you are starting something fresh then Cloudera Impala would be the way to go but when you have to take up an upgradation project where compatibility becomes as important a factor as (or may be more important than) speed, Apache Hive would nudge ahead.
In practical terms, Apache Hive and Cloudera Impala need not necessarily be competitors. As both have a MapReduce foundation for executing queries, there can be scenarios where you are able to use them together and get the best of both worlds – compatibility and performance. Hive is the more universal, versatile and pluggable language. Once data integration and storage has been done, Cloudera Impala can be called upon to unleash its brute processing power and give lightning fast analytic results.
Original article here.