Subscribe to our Newsletter

Apache Spark and R : The best of both worlds

Guest blog post by Kumaran Ponnambalam

As folks working in the field of Data Science and Analytics would know, R is one of the best languages to do data analytics and machine learning. Its simple and easy to use syntax and support for a huge library of capabilities makes it a top Data Science language. But the biggest limitation of R is the amount of data it can process. Its data processing capacity is limited to memory on a single node (at least the free version.).

Apache Spark is taking the Big Data world by storm. On one side, it has fast parallel computing capabilities that can extend over hundreds of nodes. On the other, it is easy to program. Libraries like Spark SQL and ML are pretty easy to learn and code with. Data transformation and processing is a scream.It supports Scala, Python and Java with the same set of features and libraries which makes transitioning from a known language easier. And its interpreter mode provides for an adhoc analytics mode that data analysts would love.

Now there is SparkR from Apache Spark . SparkR provides an interface from R to Spark. You can use R language and R Studio IDE to connect and work with data in Spark. Sitting on a windows laptop running RStudio, you can process data on parallel nodes in a Spark Cluster. The syntax is simple, straightforward and powerful. Data Cleansing and Transformation operations can be done as Map-Reduce activities across the cluster in real time. Summarized data can then be visualized using R's graphics capabilities.

A marriage made in heaven for R loyalists? Not yet. SparkR does not support the same suite of machine language algorithms like the other languages in Spark. In fact, only a couple of algorithms are available. I expect and hope that Spark developers are working to add these in later versions to make SparkR feature compatible with PySpark and Scala.

So is SparkR not useful until then? Not really. You can still use SparkR for data cleansing and transformation activities without having to breakup your data into smaller pieces to fit your R's memory limitations. Spark transformations are also many times faster since they happen on parallel nodes in memory.

Let us hope that Spark developers answer our prayers and provide us with a fully capable version of SparkR.

E-mail me when people leave their comments –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Featured Blog Posts - DSC