In the beginning (about 2008) there was Hadoop. More correctly when people speak about Hadoop they are really talking about the a group of programs around the Hadoop File System (HDFS) which needs other applications (Apache Institute calls them projects) like YARN, MapReduce, HIVE, PIG and others to rise to the level of the first Big Data database. Today, Hadoop 2.0 is the base on which most offerings in this category are based.
Key value stores are the simplest of the NOSQL types consisting only of a unique key and a bucket containing any data you wish to store there. The value content of the buckets does not need to be consistent or follow any schema (schema-less).
The content of the bucket can be literally anything you like but applications around unstructured or semi-structured data are the most common. These can be used to store large blocks of unstructured data (e.g. customer service logs, the stored backup image of a smartphone, weblogs, the Gettysburg Address, anything). The buckets can hold quite large entries including BLOBs (Basic Large Objects). To read the value you need to know the key and bucket.
KVs are row based systems designed to efficiently return data for an entire bucket (interpreted as row or record) in as few operations as possible. Essentially all KVs run in batch mode and are therefore used for analytic or caching projects as opposed to transactional applications.
- Highly fault tolerant – always available.
- Schema-less offers easier upgrade path for changing data requirements (however see Document stores for even greater flexibility).
- Efficient at retrieving information about a particular object (bucket) with a minimum of disc operations. For example, returning a contact record in a Rolodex application.
- Very simple data model. Very fast to set up and deploy.
- Great at scaling horizontally across hundreds or thousands of servers.
- No requirement for SQL queries, indexes, triggers, stored procedures, temporary tables, forms, views, or the other technical overheads of RDBMS.
- Very high data ingest rates. Favors write once, read many applications.
- Powerful offline reporting with very large data sets.
- Some vendors are offering advanced forms of KVs that approach the capabilities of document stores or column oriented stores.
- Not suitable for complex applications.
- Not efficient at updating records where only a portion of a bucket is to be updated. Generally not for transactional processing.
- Generally runs in batch only. Expect queries to take minutes or hours.
- Not efficient at retrieving limited data from specific records. For example, in an employee database returning only records of employees making between $40K and $60K.
- Offers only eventual consistency (see Lesson 2 – some vendors are challenging in this area).
- Unsuited for interconnected (graph) data.
- As the volume of data increases maintaining unique values as keys becomes more difficult and requires some complexity in generating character strings that will remain unique over a large set of keys.
- Generally needs to read all the records in a bucket or you may need to construct secondary indexes.
- bucket or you may need to construct secondary indexes.
Particular Opportunities and Project Characteristics
Rapidly ingesting large volumes of unstructured and semi-structured text and data. Text analysis and customer sentiment analysis were among the earliest and most widely adopted project types for KVs. Examples include:
Text and document data from inside or outside your company.
Call center logs.
Social media feeds
Web logs, click data
Bit of web pages
Real-time data collection such as point-of-sale data or factory control systems.
Complex objects that were expensive to join in a relational database, to reduce latency.
High ingest rates lends itself to the “Velocity” elements of big data, where constant streams of data must be captured at speed.
Applications with lots of small continuous reads and writes that may be volatile (see also Document stores for even greater capability).
Create ever-growing datasets that are rarely accessed but grow over time. (Caching)
Retrieve data from an entire bucket such as the contact information in a rolodex system or product information in an online shopping system.
Where write performance is your highest priority.
July 23, 2014
Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2014, all rights reserved.
About the author: Bill Vorhies is President & Chief Data Scientist of Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at:
This original blog can be viewed at:
All nine lessons can be downloaded as a White Paper at:
Originally posted on Data Science Central