Summary: This blog series is designed to help you understand which NOSQL Big Data database is right for you. It is addressed to business executives and managers who need a primer on how this decision should be made.
Starting a Big Data Initiative is challenging. It is after all not trivial in terms of direct costs or the time involvement of your brightest most valuable employees. On the benefit side this decision is rich with opportunity that can easily be lost or at least impaired by selecting the wrong solution. The good news is that there are several types of Big Data databases to pick from. The bad news is that there is no one type that does everything the best.
We promise to keep jargon and technical detail to a minimum. This is written for you folks who are going to write the check and whose careers may hinge on converting this investment into competitive advantage.
Still, this is a complex topic in a rapidly changing environment. But as Albert Einstein famously said, “Everything should be made as simple as possible, but not simpler”.
If you’re at the ‘should we at all’ stage, see our separate blog series on opportunities, strategies, and use cases. And let me say right up front, if you aren’t already at least a solid B student in the utilization of predictive analytics (predictive modeling, data visualization, and optimization) then you should fall back and start there.
The path to value from data always passes through predictive analytics. If you aren’t currently exploiting the data you have, or the data you could acquire that can reasonably be housed in your current relational database management system (RDBMS) data warehouse, start there. Big data promises incremental gains in profiting from data, but you need predictive analytics, not big data to achieve the lion’s share of the value.
In this series, we start with the assumption that you’ve read enough to understand that the siren song of big data is strong for your organization. The question to be answered is how to select the right database application to bring this from song to a symphony of opportunities and profits.
Lesson 1: What Are Your Options? NOSQL / NewSQL
Your current data warehouse and indeed every other database that exists in your business today have close to a 100% probability of being a relational database management system (RDBMS). Further, we say that this data is ‘structured’ as opposed to the terms ‘unstructured’ or ‘semi-structured’ that are typically associated with big data stores. We say this data is ‘structured’ specifically because when we stored it we imposed a structure on it. That is for example, field 1 always contains the customer’s name, field 2 his street address, field 3 the code for his most recent purchase, and so on. This is also known as its ‘data schema’.
Since at least the early 80s structured data in RDBMS data warehouses and transactional systems has been an unparalleled benefit. Only as data has become so large in volume, so fast in its generation, or so varied in its character that these benefits have begun to show weakness. It is not impossible to increase the size of a RDBMS data warehouse across multiple nodes (servers) (horizontal scaling) or to increase the speed and capacity of a single machine to house more and more data (vertical scaling with larger storage disks and faster multi-core processors to keep retrieval times reasonable) but it is difficult and expensive.
Worse, if your business or customers change in such a way that you need to significantly reorganize that data, well now you’re in for a real project. Changing the fundamental schema of data in a RDBMS is a project no one welcomes and can ripple through all the applications and reports that rely on that data. Unfortunately this type of change is increasingly common.
Big Data databases all fall into the category known as ‘NOSQL’. Specifically this was intended to mean that it was not possible to use SQL (structured query language) to retrieve and analyze the data. There are alternatives to SQL but this shorthand, ‘NOSQL’ is truly ubiquitous for describing the new types of databases that are not RDBMS and hence fall into the world of big data.
These NOSQL databases are also called ‘schema-less’ since it is not necessary to create a logical storage ‘structure’ before storing the data. Not to split this hair too finely, it is more correct to say that they are ‘late-schemaed’ or ‘poly-schemaed’ since there is a schema but it’s not fixed and doesn’t have to be determined in advance. This allows maximum flexibility later on without requiring a change to any predefined schema.
You are probably aware that NOSQL databases are quite new. The first major big data database, Hadoop, was birthed by Google in 2004 but only first became available commercially in 2008 with Yahoo as its first implementation. Hadoop is so widely used and written about it is easy to confuse Big Data with Hadoop. In fact this is only one of four flavors of NOSQL databases that you can choose from.
What all of these have in common is their ability to distribute processing across a large number of nodes (servers), and that these servers can be common utility machines, much less expensive than high reliability servers. This horizontal distribution of processing, all known as massive parallel processing (MPP) allows much faster storage and retrieval of data than RDBMS systems when the data to be stored is unstructured, semi-structured, or even mixed with structured data.
Here are your four choices. These vary in the way they store and retrieve data which makes each type more or less efficient at particularly types of tasks. Unfortunately there is no one data base type that is universally good at everything.
- Key Value Stores (e.g. Hadoop 2.0): Similar to dictionaries in that each element of data is assigned a unique ‘key’ for later identifying its location. The Hadoop File System (HDFS) is the most common core of these systems and excels at write-once-read-many applications. KVs allow the storage of data without a pre-existing schema and handle complex, unstructured, and semi-structured data well, with fast speed of storage and retrieval. This type of data encompasses most text, semi-structured text, social media, web logs, and the dominant types of business-oriented data. As a consequence many early Big Data projects defaulted to these Key-Value Stores like Hadoop.
- Document Oriented Databases: Second in popularity in the business world behind Key-Value-Stores are Document Oriented Databases. Here an entire document is treated as a record. While these can accommodate completely unstructured text, they excel at semi-structured text. That is text that has been encoded according to a known schema such as XML, YAML, JSON, PDF, email, or even MS Office. Search can be further facilitated by adding metadata or keys and several query languages exist depending on the specific flavor you are using. Document Oriented Databases excel at tasks like patent search, litigation support, legal precedent search, search of scientific papers and experimental data, email compliance searches, or simply retrieving knowledge on a particular topic hidden among a forest of internal or externally prepared reports and document. They are particularly good at integrating different data sources that may reside on incompatible DB types, and they do well at OLTP (on line transaction processing).
- Column Oriented Databases: Also known as ‘big table ’, ‘extensible record’ or ‘wide column’ stores. These excel at storing and particularly updating single records (such as your customer’s transaction history) and allowing fast retrieval of small amount of data. CODBs may be selected when the application focuses on calculating metrics from a particular set of columns or when updating tends to be of columnar data. CODBs are inefficient at analyzing across rows or writing new rows but excel at modifying or updating existing records. They do well at OLAP (on line analytic processing).
- Graph Databases: These specialize in the efficient management of heavily linked data such as the relationships among large groups of people (think Facebook). These highly specialized databases use an ‘associative model’ of data for which there is no specific index, are not record based, and associate nodes and edges for relationships. They are excellent for graph algorithms that may be found in image storage and for ‘semantic web’ applications. The semantic web gives meaning to data by defining relationships and properties, or ontologies. Specialized languages such as SPARQL are used for querying. Note that ‘Graph’ typically incorporates the category of ‘Object Databases’ but some folks think these should be separate. We’ll stick with the basic four. Fans of Graph Databases promote them for much broader general use, and this is possible, but requires a major revision in thinking about the way elements of data are related.
Increasingly you will read about NOSQL interpreted as ‘not only SQL’ and even more about NewSQL. Don’t be confused. There are still only the four types of databases above to choose from. Because companies are already heavily invested in RDBMS and by extension in SQL, and because literally all our current IT staffs are grounded in SQL skills, the major NOSQL database providers have all moved rapidly to create interfaces that allow SQL to be used in NOSQL databases.
The most common native data retrieval mechanism in NOSQL is called ‘MapReduce’. Essentially all NOSQL providers now offer an interface that translates SQL queries into MapReduce allowing our IT experts to keep right on working with a minimum or retraining. NewSQL providers started with SQL as their native query language.
This is not to say that the only difference between NOSQL and NewSQL is the ability to use SQL-like queries. NewSQL database developers are hewing closer to the ACID development rules used for RDBMS allowing NewSQL databases to be used more directly as replacements. This promises lower cost, higher availability, and easier scaling while retaining some critical elements such as immediate consistency. For more on immediate consistency see Lesson 2.
It’s also important to note that no one is suggesting that RDBMS / SQL databases are going away. In fact before you assume you will need a NOSQL DB it’s worth revisiting whether your project might not still work in your familiar RDBMS structure. Of the three “Vs” of Big Data (Volume, Variety, and Velocity) volume is the characteristic most likely to mislead. This popular graphic makes the point that 90% of web based companies will likely never rise to the volume levels requiring NOSQL.
On the other hand, the most common Big Data projects today involve mixing data from sources that are not compatible. When there is variety, or sufficient velocity, or the complexity of the relationships among your data elements exceeds the ability of your RDBMS to return results as fast as needed, then there’s a NOSQL DB for you.
RDBMS/SQL will continue to dominate transactional systems and structured data warehouses. Bringing all that new and traditional data together on a common platform so that it can be analyzed together is the real payoff of NOSQL/NewSQL. All of the major providers (Oracle, IBM, Microsoft, SAP, and others) are working to provide environments where RDBMS and NOSQL databases work easily together in a common architecture to bring more and more data to bear on your business problems.
And always remember, things are changing fast.
Watch for the next 8 lessons to be posted shortly.
July 23, 2014
Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2014, all rights reserved.
About the author: Bill Vorhies is President & Chief Data Scientist of Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at:
This original blog can be viewed at:
All nine lessons can be downloaded as a White Paper at:
Originally posted on Data Science Central