Before we get to the specifics of your database needs, there’s an additional factor you’ll want to keep in mind as you move forward. Do you want an open source version, a distribution version, or a full suite solution. All four flavors of NOSQL database can be had in each of these versions and there are pros and cons to each. What is open source versus distribution versus suite? Here are the basics.
Like Russian nesting dolls, each of these database application types fits completely inside the next.
Open Source: These are free to the user to download, install, and utilize under open source licenses. For example Hadoop is open source from the Apache Institute. All the necessary components are there. However it up to the user to have the necessary expertise to configure, maintain, and update and no support is offered. Many users choose this path, but others regard open source being free, as in free puppy.
Distribution: A number of commercial companies add features such as easy installation, learning resources, updating services, and deployment and tooling support. Essentially all of the majors including Oracle, SAP, IBM, and Microsoft offer distribution versions for a fee. Many users think this level of professional support and maintenance is a smart investment.
Suites: Over and above both open source and distributions are suites. These ensembles may support different versions of the open source database (Hadoop for example) and include an integrated development environment, greatly simplified modeling such as code to generate complex MapReduce queries, scheduling functionality, and integration with other database types to allow the easy merging of information. Some also include predictive modeling, visualization, or other analytic tools with enhanced data extract and cleansing features and even graphical development interfaces.
When to Use Open Source: If you are an IT technical person seeking to learn and experiment with all the functionalities of Hadoop or one of the open source variants, and are willing to keep up with the updates, perform the initial configuration, and do without support then this might be an option. After all, it’s free. It’s difficult to imagine a corporation making this decision except perhaps in a very early prototyping or learning situation, but not for mission critical use. An exception might include when building a custom application for a specific purpose if you knew your professional staff was fully equipped and experienced.
When to Use Distribution or a Suite: Aside from cost the main considerations here are how useful the additional features of a suite might be for the mission at hand. Increasingly the answer is that they can be extremely valuable. However, if you have the time and inclination it is also possible to put together the full capabilities of a suite from the separate products of different vendors in a best-of-breed assembly of capabilities. An example might be to use a particular Hadoop distribution paired with the predictive analytics in SAS or Alteryx.
Another factor is that while distributions tend to be based on a single and sometimes proprietary version of the open source code, suites may support multiple versions. Depending on you circumstance there can be an advantage to being able to work with more than one type, especially if the additional tools and capabilities allow you to work across several different types of NOSQL databases (say column oriented for analytics and document based for transactional data), since as you will discover, you may need more than one.
An additional consideration will be if you are already committed to a single vendor’s software / hardware stack. In these instances there may be real value in sticking with your vendor. For example, EMC’s Pivotal HD has been fused with Greenplum’s analytic database to offer real SQL queries and very good performance on top of Hadoop, or Intel Distribution for Apache Hadoop, and has optimized its Hadoop distribution for solid-state drives, something that other Hadoop companies haven’t done so far.
The Fourth Alternative: Before we leave this discussion it’s worth mentioning that it may not be necessary to buy any of these. There are a variety of hosted solutions that let you rent instead of buy. This can be particularly sound during proof-of-concept. Amazon Elastic Map Reduce (EMR) is one of the best known, and runs on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Aside from proof-of-concept this is a good method of cost control for applications that may only occasionally require more processing capacity than you would find prudent to purchase. This is primarily for analytic applications (though MapR is widely used for web recommendation engines) as it has a somewhat higher latency (user waiting time) than you might experience using data located on your own nodes. It is a fully configured solution maintained by Amazon requiring only a small amount of effort to get up and running.
Stick with this series to learn more about how to select the right Big Data database. As of this writing there are about 150 to choose from. You can see an extensive listing at http://nosql-database.org/.
July 23, 2014
Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2014, all rights reserved.
About the author: Bill Vorhies is President & Chief Data Scientist of Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at:
This original blog can be viewed at:
All nine lessons can be downloaded as a White Paper at:
Originally posted on Data Science Central