Originally posted on Data Science Central
RethinkDB is an open source noSQL database that stores JSON documents. This can be great for open ended data analytics. The company officially provides drivers for Ruby, Python and NodeJS and community supported drivers and ORMs are available in around a dozen languages.
The production ready version 2.0 was released very recently on April 14, 2015 after 5 years of development. RethinkDB is a boon when it comes to writing real time applications. Traditionally applications had to poll data bases to get the updated data which made them slow and hard to maintain. RethinkDB’s architecture solves this problem by pushing the updated results of a query when they are available.
Apart from solving real time data push problem RethinkDB offers many advantages such as:
Its advanced query language, ReQL, supports table joins and subqueries. The monitoring api also integrates with the query language, this makes scaling distributed databases very easy.
Unlike some previous noSQL systems RethinkDB never acknowledges a write until it's safely written to the disk.
Additionally, the database supports Mapreduce functionality out of the box & would not need an additional Hadoop type software to run the analysis.
RethinkDB has its own query language called ReQL. JSON documents in RethinkDB can be manipulated in ReQL(RethinkDB query language). Unlike other noSQL query languages ReQL embeds into the programing language being used. The queries are constructed by making function calls in the languages and not by concatenating many strings.
All the operations like fetching, updating the data, joining tables etc are accomplished by calling the appropriate methods in the language.
For example to create a table using the driver in python programing language
connection = rethinkdb.connect()
In the above code, the table_create was the appropriate method to create the table.
The queries are chainable with the . operator and are lazy. The data flows from left to right. You can select data from a table and pass it to methods which will transform the data and the transformed data can inturn be passed to an other transformer method and so on until the query is built
Assuming the data is in the following format
The queries are run on the server and are sent to the server only once the run command is called so, the above query can also be written as
query_var = rethinkdb.table(‘students’).pluck(‘name’).count()
Whenever possible the query is split across multiple CPU cores, servers in a cluster or anywhere possible, combine the data and returns the result
MapReduce in RethinkDB
RethinkDB supports map reduce with map and reduce commands. The map command filters and transforms the elements in the input sequence into a new sequence (sequence is a list data type). Reduce aggregates the values produced by map into a single value
The map and reduce commands can be preceded by an optional group command which partitions the input sequence into multiple groups based upon some field. In case of a group being passed to the map command it outputs grouped sequences, and the filter will produce a single value for each group
For example to count the number of students in a class using map reduce for the above students table, first we must map each document in the students table to a number one(to denote the count of a single student) and pass the sequence of 1s to a reduce to get the number of students in a class
rethinkdb.table(‘students’).map(lambda student: 1)
rethinkdb.table(‘students’).map(lambda student: 1).reduce(lambda a,b : a+b).run(connection)
RethinkDB has provided a case study on how they used the MapReduce function on the US presidential election data set.
Though the production ready version was released very recently, the company claims that hundreds of technology startups and even fortune 500 companies are using it in production.
Currently RethinkDB powers various things like big multi player games, reactive mobile and web apps and real time analytics for various companies. Some companies like Lavaboom have blogged about the company’s excellent support