Guest blog post by Bernard Marr
Hadoop – the software framework which provides the necessary tools to carry out Big Data analysis – is widely used in industry and commerce for many Big Data related tasks.
It is open source, essentially meaning that it is free for anyone to use for any purpose, and can be modified for any use. While designed to be user-friendly, in its “raw” state it still needs considerable specialist knowledge to set up and run.
Because of this a large number of commercial versions have come onto the market in recent years, as vendors have created their own versions designed to be more easily used, or supplied alongside consultancy services to get you crunching through your data in no time.
These days, this is often provided in the form of “Hadoop-as-a-service” – all of the installation will actually take place within the vendors own cloud, with customers paying a subscription to access the services.
Here’s a run-down, in no particular order, of 10 of the most popular or interesting commercial Hadoop platforms on the market today.
One of the first commercial Hadoop offerings and still the most popular, reportedly with more installations running than any of its competitors. Cloudera also contribute Impala, which offers real-time massively parallel processing of Big Data to Hadoop.
Open source Big Data frameworks may not be the first thing that springs to mind when you think of Amazon, but the retailer was another one of the first to offer Hadoop in the cloud as part of its Amazon Web Services package. AWS is a hosted solution integrating Hadoop with Amazon’s Elastic Cloud Compute and Simple Storage Service (S3) cloud-based data processing and storage services.
Of the vendors listed here, Horton is one of the few which offer 100% open source Hadoop technology without any proprietary (non-open) modifications. They were also the first to integrate support for Apache HCatalog, which creates “metadata” – data within data – simplifying the process of sharing your data across other layers of service such as Apache Hive or Pig.
Uses some differing concepts, such as native support for UNIX file systems rather than HDFS, meaning it will be more familiar to DBAs used to working in a UNIX environment. MapR technologies is also spearheading development of the Apache Drill project, which provides advanced tools for interactive real-time querying of Big Datasets.
It might be a relative newcomer to the Hadoop ecosystem, but IBM has deep roots in the computing industry, particularly in distributed computing and data management. Its BigInsights package adds its proprietary analytics and visualization algorithms to the core Hadoop infrastructure.
Engineered to run on Microsoft’s Azure cloud platform, Microsoft’s Hadoop package is based on Hortonworks’, and has the distinction of being the only big commercial Hadoop offering which runs in a Windows environment.
Another giant of the tech world which has recently turned its attention towards Hadoop. Intel’s distribution adds the company’s Graph Builder and Analytics Toolkit functions to Hadoop, and claims that security updates to the infrastructure mean that their solution offers added security for your data.
Datastax offers its own distribution of the Apache Cassandra database management system on top of its Hadoop installation. It also includes custom proprietary systems to handle security, search, dashboard and visualization. Customers include Netflix, where it powers the recommendation engine by analyzing over 10 million data points every second!
Teradata offer hardware and software for implementing Big Data solutions, as well as their own Hadoop package, which is also based on the Hortonworks distribution. Proprietary technology supplied alongside the open source components include their QueryGrid analytics engine and Viewpoint dashboard.
Pivotal was formed as a joint venture between storage system provider EMC and virtualization specialists VMware. Pivotal HD (Hadoop Distribution) forms part of the company’s Big Data Suite, which also includes database tools Greenplum and analytics platform Gemfire. Customers include China’s national rail operator, China Railway – sorting out the logistics for rail journeys for 3.5 billion passengers certainly qualifies as Big Data!
Which others would you add to my list? What are your views on any of these, please share in the comments below.
- Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
- Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
- Buzz: Business News | Announcements | Events | RSS Feeds
- Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers
- 50 Articles about Hadoop and Related Topics
- 10 Modern Statistical Concepts Discovered by Data Scientists
- Top data science keywords on DSC
- 4 easy steps to becoming a data scientist
- 13 New Trends in Big Data and Data Science
- 22 tips for better data science
- Data Science Compared to 16 Analytic Disciplines
- How to detect spurious correlations, and how to find the real ones
- 17 short tutorials all data scientists should read (and practice)
- 10 types of data scientists
- 66 job interview questions for data scientists
- High versus low-level data science