Summary: These are the features common to most NOSQL databases. Be on the lookout for any fundamental differences.
Before we get to the specific pros and cons of the four NOSQL database types there are some features and capabilities true of most of these that you should know. We underline ‘most’ since these may not be universally true of all offerings. You should be on the lookout for differences at this fundamental level since these are basic and important characteristics that can greatly impact the success of your project.
In this lesson we will also introduce some technical terms and issues. We continue with the assumption that the reader is not an information technology manager and will try to keep these basic. Some of our technologist readers may complain these are too basic. We harken back to the Einstein quote in lesson one “Everything should be made as simple as possible, but not simpler”.
Replication and Distributed Processing: The most important feature of NOSQL databases is their ability to break up storage and retrieval of information and distribute it across many low-cost servers. In some cases even among data centers that are physically distant from one another. At the same time, they automatically store duplicates of the information (replicas) on several different servers as protection against failure. Although something similar is possible with RDBMS using SANS (storage area networks) the process of doing this with RDBMS is both complex and labor intensive for your DBAs (database administrators). This one automated feature true of essentially all NOSQL databases guarantees high availability, high reliability (disaster recovery), and dramatically reduces the need for DBAs reducing associated labor overhead.
Dynamic Schemas: In previous lessons we described NOSQL databases as being schema-less. In fact depending on the type of NOSQL data base it would be more correct to say they are ‘poly-schemaed’ or ‘late schemaed’ all of which means it is not necessary to fully define a schema in advance of loading the data. Depending on type, the available types of poly-schemas may provide a benefit or a hindrance for certain types of applications that we will discuss later when describing each NOSQL type in more detail. By comparison with RDBMS this is a huge advantage. Changing the schema of a RDBMS for a large database can be a very slow process, in the range of weeks or months. In one example, a pharma company doing research wanted to add an element of data from a new diagnostic machine to its data alongside the data already being collected from other devices. Modifying its RDBMS schema required 90 days before any of the new data could be loaded or processed. After switching to NOSQL, the process took less than three days.
Latency: Latency is literally the time it takes for a system to return an answer. If this is a real time system serving up information or recommendations to your web shoppers ½ second (500 ms) is considered too long. On the other hand, if this is work being done by your data scientists to study an issue then a matter of hours or overnight might not be too long. High availability, fast data ingest speeds, and disaster recovery may be critically important but it is latency that your users notice most. Latency has many contributors. These are the ones you should pay attention to.
Transactional versus Analytic Applications: One of the most fundamental splits in how fast one must get an answer is whether or not there is a customer waiting for it (transactional data). However many times our analysts will work to find useful patterns over many days or weeks and can plan their efforts so that a delay of even 24 hours in returning data isn’t disruptive. Later when we describe specific NOSQL types you will see that Key-Value stores are essentially always run in batch mode. You wouldn’t select this type to provide instant returns to your customer. Other types of NOSQL can be run in near real time and are more suitable for transactional applications.
Auto Sharding and Automatic Load Balancing: It is a fact of all database types that it is faster to load data than it is to search and retrieve it. The way that the data is divided up (sharded) among servers is the first step to ensuring that too much highly searched data does not occur on a small number of servers creating a bottleneck. Automatic load balancing is a procedure the NOSQL system uses to constantly monitor that the sharding has been efficient by monitoring search and retrieve queries always on the lookout for too many queries hitting a single server. If this happens consistently, the NOSQL controller will automatically move some of the data or redirect the query to an alternate copy to ensure no bottleneck occurs.
Integrated Caching: Database designers both RDBMS and NOSQL have long been aware that certain types of queries are more common than others. In response, they design caches of high-use data in several places designed to speed up queries. The pursuit of ever quicker times for query response is a major area of development for all competitors. Many NOSQL databases have excellent automated and integrated caching capabilities which do indeed improve read performance. However, caches do not improve write performance and add complexity. So if your application is dominated by writes then caches are of little advantage. If your application is dominated by reads then caching capability is critical to performance. And while it doesn’t specifically relate to caching, we should mention here that some vendors (SAP’s HANNA for example) are ‘in memory’ databases where very large amounts of solid state memory are used so that the entire very large database is always in memory which can also dramatically reduce latency.
July 23, 2014
Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2014, all rights reserved.
About the author: Bill Vorhies is President & Chief Data Scientist of Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at:
This original blog can be viewed at:
All nine lessons can be downloaded as a White Paper at:
Originally posted on Data Science Central