Guest blog post by Michael Walker
The Hadoop stack includes more than a dozen components, or subprojects, that are complex to deploy and manage. Installation, configuration and production deployment at scale is challenging.
The main components include:
- Hadoop. Java software framework to support data-intensive distributed applications
- ZooKeeper. A highly reliable distributed coordination system
- MapReduce. A flexible parallel data processing framework for large data sets
- HDFS. Hadoop Distributed File System
- Oozie. A MapReduce job scheduler
- HBase. Key-value database
- Hive. A high-level language built on top of MapReduce for analyzing large data sets
- Pig. Enables the analysis of large data sets using Pig Latin. Pig Latinis a high-level language compiled into MapReduce for parallel data processing.
The range of applications that use Hadoop show the versatility of the MapReduce approach, and reviewing them provides some of the typical characteristics of problems suited to this approach:
- Massive data volumes,
- Little or no data dependence;
- Uses both structured and unstructured data;
- Amenable to massive parallelism;
- Requires limited communication.
Some good examples that display some or all of these characteristics include:
• Applications that boil lots of data down into ordered or aggregated results – sorting, word and phrase counts, building inverted indices mapping phrases to documents, phrase searching among large document corpuses.
• Batch analyses fast enough to satisfy the needs of operational and reporting applications, such as web traffic statistics or product recommendation analysis.
• Iterative analysis using data mining and machine learning algorithms, such as association rule analysis or k-means clustering, link analysis, classification, Naïve Bayes analysis.
• Statistical analysis and reduction, such as web log analysis, or data profiling
• Behavioral analyses such as click stream analysis, discovering content-distribution networks, viewing behavior of video audiences.
• Transformations and enhancements, such as auto-tagging social media, ETL processing, data standardization.