Subscribe to our Newsletter

Guest blog by Bernard Marr

If you're looking at ways you can harness the power of Big Data analytics in your business, but are not necessarily a techie person yourself, it can be a confusing field at first.

For this reason I'm publishing a series of short posts aimed at explaining some of the key concepts and technologies behind Big Data and data analytics, aimed at an audience which is not primarily composed of IT specialists or data scientists.

I firmly believe that any business can benefit from the new wave of analytics applications and services which can crunch through as much data as you can throw at them, in order to come out with surprising and valuable insights to drive growth.

These projects usually require a mix of skills, and communication between people with different skillsets (i.e data science and marketing) is essential. So in this post I'll give an overview of R  -the programming language favored by many statisticians.

R is a computer programming language which is particularly well suited to handling and sorting the large datasets associated with Big Data projects.

Like Python which I covered previously, the software environment used to create code in R is open sourced, meaning it is free to download, anyone can use it, and there is a plethora of guidance and advice available on how to use it most effectively. However commercial distributions are also available, which often offer additional proprietary functionality or support packages.

Named from the initials of the two men who first developed the language at the University of Auckland, Robert Gentleman and Ross Ihaka, R has become very popular in recent years and is continuing to become more so, due to the explosion in analytic activities being carried out by business.

R's strengths as a statistical programming language draw from the fact it is designed from the ground up to facilitate matrix arithmetic - carrying out complex, often automated calculations on data which is held in a grid of rows and columns. R is very good for creating programs which can carry out calculations on these datasets, even when the datasets are constantly growing in size at an ever-increasing rate, and producing real-time visualizations based on this data.  

Its capability at producing these visualizations is another core strength of R. Its designers realised that visualization was key to being able to understand the complex datasets that are being explored, so incorporated functionality to translate data into charts, graphs and complex multi-dimensioned matrices - as well as many user-defined methods of visualization - into its core.

Online, R code is everywhere although you won't see it, as it's always hidden behind pretty graphical interfaces. But when you use Google, Facebook or Twitter you are almost certainly executing R code running on the servers of those organizations. In fact it is often cited as the most widely used programming language for data science. APIs exist for almost all of these services, allowing applications written in R to access data from these outside sources and include it in their own analytics routines.

Thanks to this huge user base, just about every function that you might need for data analysis is available, often through open source extensions (known as packages) made available by the community. It is also capable of executing code written in other languages such as C++ or Java, so resources coded in those languages can be made available. Because it can be compiled to run on any major operating system, R code can easily be ported between Unix, Windows or Mac environments.

Python is probably R's biggest rival - but as both are non-commercial entities (as are most languages, computer or otherwise!) it's not necessarily a rivalry in the traditional sense. However coders will often argue vociferously for their favorite of the two. Python, having more in common with more traditional, longer established programming languages, is often cited as being easier to learn, particularly for someone with prior experience of different high-level programming languages. The R environment, on the other hand, is likely to be more familiar to someone with an academic background in statistics.  It's worth noting that Python tends to have a wider range of uses outside of the world of statistics and analytics, whereas R is generally exclusively used for those purposes.

With a reported two million users worldwide, and thousands of deployed applications created using it, R is undoubtedly one of the backbone technologies of the Big Data revolution. If you are thinking of getting involved with the techie end of data analysis, then a thorough grounding in the language should be considered an essential element of your toolbox. If you want to learn more, or have a go at creating your own code in R to see what it can do, there are plenty of great resources online, such as those at Coursera, Code School and R Studio .

You might also want to read:

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

E-mail me when people leave their comments –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Featured Blog Posts - DSC