Subscribe to our Newsletter

RethinkDB for Advanced Analytics

Originally posted on Data Science Central

RethinkDB is an open source noSQL database that stores JSON documents. This can be great for open ended data analytics. The company officially provides drivers for Ruby, Python and NodeJS and community supported drivers and ORMs are available in around a dozen languages. 

The production ready version 2.0 was released very recently on April 14, 2015 after 5 years of development. RethinkDB is a boon when it comes to writing real time applications. Traditionally applications had to poll data bases to get the updated data which made them slow and hard to maintain. RethinkDB’s architecture solves this problem by pushing the updated results of a query when they are available.

Apart from solving real time data push problem RethinkDB offers many advantages such as:

  • Its advanced query language, ReQL, supports table joins and subqueries. The monitoring api also integrates with the query language, this makes scaling distributed databases very easy.

  • Unlike some previous noSQL systems RethinkDB never acknowledges a write until it's safely written to the disk.

  • Additionally, the database supports Mapreduce functionality out of the box & would not need an additional Hadoop type software to run the analysis.

RethinkDB has its own query language called ReQL. JSON documents in RethinkDB can be manipulated in ReQL(RethinkDB query language). Unlike other noSQL query languages ReQL embeds into the programing language being used. The queries are constructed by making function calls in the languages and not by concatenating many strings.

All the operations like fetching, updating the data, joining tables etc are accomplished by calling the appropriate methods in the language.

For example to create a table using the driver in python programing language

import rethinkdb

connection = rethinkdb.connect()

rethinkdb.table_create(‘students’).run(connection)

In the above code, the table_create was the appropriate method to create the table.

The queries are chainable with the . operator and are lazy. The data flows from left to right. You can select data from a table and pass it to methods which will transform the data and the transformed data can inturn be passed to an other transformer method and so on until the query is built

Assuming the data is in the following format

{

‘id’:’a1’,

‘name’:’foo’,

‘number’:1

}

rethinkdb.table(‘students’).pluck(‘name’).count().run(connection)

The queries are run on the server and are sent to the server only once the run command is called so, the above query can also be written as

query_var = rethinkdb.table(‘students’).pluck(‘name’).count()

query_var.run(connection)

Whenever possible the query is split across multiple CPU cores, servers in a cluster or anywhere possible, combine the data and returns the result

MapReduce in RethinkDB

RethinkDB supports map reduce with map and reduce commands. The map command filters and transforms the elements in the input sequence into a new sequence (sequence is a list data type). Reduce aggregates the values produced by map into a single value

The map and reduce commands can be preceded by an optional group command which partitions the input sequence into multiple groups based upon some field. In case of a group being passed to the map command it outputs grouped sequences, and the filter will produce a single value for each group

For example to count the number of students in a class using map reduce for the above students table, first we must map each document in the students table to a number one(to denote the count of a single student) and pass the sequence of 1s to a reduce to get the number of students in a class

rethinkdb.table(‘students’).map(lambda student: 1)

rethinkdb.table(‘students’).map(lambda student: 1).reduce(lambda a,b : a+b).run(connection)

RethinkDB has provided a case study on how they used the MapReduce function on the US presidential election data set.

Usage

Though the production ready version was released very recently, the company claims that hundreds of technology startups and even fortune 500 companies are using it in production.

Currently RethinkDB powers various things like big multi player games, reactive mobile and  web apps and real time analytics for various companies. Some companies like Lavaboom have blogged about the company’s excellent support

E-mail me when people leave their comments –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Resources

Research