Guest Blog post by Michael Walker
Hadoop (MapReduce where code is turned into map and reduce jobs, and Hadoop runs the jobs) is the most well known technology used for "Big Data" because it allows an organization to store huge quantities of data at very low costs. R is a programming language and software environment for statistical computing and graphics. Put the two together to provide easy to use R interfaces for the distributed computing Hadoop environment and you have one king-hell data crunching tool for serious data analytics.
RHadoop is a small, open-source package developed by Revolution Analytics that binds R to Hadoop and allows for the representation of MapReduce algorithms using R - allowing data scientists access to Hadoop’s scalability from their favorite language, R. It allows users to write general MapReduce programs, offering the full power and ecosystem of an existing, established programming language.
RHadoop is comprised of three packages:
- RHDFS, which provides file level manipulation for HDFS, the Hadoop file system
- RHBASE, which provides access to HBASE, the Hadoop database
- rmr, which allows you to write MapReduce programs in R
There are two basic types of graph engines:
(1) Graph databases providing real-time, traversal-based algorithms over linked-list graphs represented on a single-server (vendors include Neo4j, OrientDB, DEX, and InfiniteGraph).
(2) Batch-processing using vertex-centric message passing within a graph represented across a cluster of machines. (Hama, Golden Orb, Giraph, and Pregel).
With Hadoop, instead of focusing on a particular vertex-centric BSP-based graph-processing package such as Hama or Giraph, the results presented are via Hadoop (HDFS + MapReduce). Moreover, instead of developing the MapReduce algorithms in Java, the R programming language is used.
When a graph is on the order of 100+ billion elements (vertices+edges), then a single-server graph database will not be able to represent nor process the graph. A multi-machine graph engine is required. While Hadoop is not a graph engine, a graph can be represented in its distributed HDFS file system and processed using its distributed processing MapReduce framework.
The graph generated previously is loaded up in R and a count of its vertices and edges is conducted. Next, the graph is represented as an edge list. An edge list (for a single-relational graph) is a list of pairs, where each pair is ordered and denotes the tail vertex id and the head vertex id of the edge. The edge list can be pushed to HDFS using RHadoop.
Click Here for a brief introduction to Hadoop and R describing how to program in the MapReduce framework, and provides an alternative way to implement MapReduce programs that strikes a delicate compromise between power and usability.