The R language is often perceived as a language for statisticians and data scientists. Quite a long time ago, this was mostly true. However, over the years the flexibility R provides via packages has made R into a more general purpose language. R was open sourced in 1995, and since that time repositories of R packages ar constantly growing. Still, compared to languages like Python, R is strongly based around the data.
Speaking about data, tabular data deserves particular attention, as it’s one of the most commonly used data types. It is a data type which corresponds to a table structure known in databases, where each column can be of a different type, and processing performance of that particular data type is the crucial factor for many applications.
In this article, we are going to present how to achieve tabular data transformation in an efficient manner. Many people who use R already for machine learning are not aware that data munging can be done faster in R, and that they do not need to use another tool for it.
High-performance Solution in R
Base R introduced the
data.frame class in the year 1997, which was based on S-PLUS before it. Unlike commonly used databases which store data row by row, R
data.frame stores the data in memory as a column-oriented structure, thus making it more cache-efficient for column operations which are common in analytics. Additionally, even though R is a functional programming language, it does not enforce that on the developer. Both opportunities have been well addressed by data.table R package, which is available in CRAN repository. It performs quite fast when grouping operations, and is particularly memory efficient by being careful about materializing intermediate data subsets, such as materializing only those columns necessary for a certain task. It also avoids unnecessary copies through its reference semantics while adding or updating columns. The first version of the package has been published in April 2006, significantly improving
data.frame performance at that time. The initial package description was:
This package does very little. The only reason for its existence is that the white book specifies that data.frame must have rownames. This package defines a new class data.table which operates just like a data.frame, but uses up to 10 times less memory, and can be up to 10 times faster to create (and copy). It also takes the opportunity to allow subset() and with() like expressions inside the . Most of the code is copied from base functions with the code manipulating row.names removed.
Since then, both
data.table implementations have been improved, but
data.table remains to be incredibly faster than base R. In fact,
data.table isn’t just faster than base R, but it appears to be one of the fastest open-source data wrangling tool available, competing with tools like Python Pandas, and columnar storage databases or big data apps like Spark. Its performance over distributed shared infrastructure hasn’t been yet benchmarked, but being able to have up to two billion rows on a single instance gives promising prospects. Outstanding performance goes hand-in-hand with the functionalities. Additionally, with recent efforts at parallelizing time-consuming parts for incremental performance gains, one direction towards pushing the performance limit seems quite clear.
Data Transformation Examples
Learning R gets a little bit easier because of the fact that it works interactively, so we can follow examples step by step and look at the results of each step at any time. Before we start, let’s install the
data.table package from CRAN repository.
Loading Data into R
There are tons of packages for extracting data from a wide range of formats and databases, which often includes native drivers. We will load data from the CSV file, the most common format for raw tabular data. We don’t have to bother about
CSV reading performance as the
fread function is highly optimized on that.
In order to use any function from a package, we need to load it with the
If our data is not well modeled for further processing, as they need to be reshaped from long-to-wide or wide-to-long (also known as pivot and unpivot) format, we may look at
?melt functions, known from reshape2 package. However,
data.table implements faster and memory efficient methods for data.table/data.frame class.
If You’re Familiar with
data.table is very similar to query
data.frame. While filtering in
i argument, we can use column names directly without the need to access them with the
$ sign, like
df[df$col > 1, ]. When providing the next argument
j, we provide an expression to be evaluated in the scope of our
data.table. To pass a non-expression
j argument use
with=FALSE. Third argument, not present in
data.frame method, defines the groups, making the expression in
j to be evaluated by groups.
Sorting Rows and Re-Ordering Columns
Sorting data is a crucial transformation for time series, and it is also imports for data extract and presentation. Sort can be achieved by providing the integer vector of row order to
i argument, the same way as
data.frame. First argument in query
order(carrier, -dep_delay) will select data in ascending order on
carrier field and descending order on
dep_delay measure. Second argument
j, as described in the previous section, defines the columns (or expressions) to be returned and their order.
Let’s create a subset dataset for flight origin “JFK” and month from 6 to 9. In the second argument, we subset results to listed columns, adding one calculated variable
By default, when subsetting dataset on single column
data.table will automatically create an index for that column. This results in real-time answers on any further filtering calls on that column.
Adding a new column by reference is performed using the
:= operator, it assigns a variable into dataset in place. This avoids in-memory copy of dataset, so we don’t need to assign results to each new variable.
To aggregate data, we provide the third argument
by to the square bracket. Then, in
j we need to provide aggregate function calls, so the data can be actually aggregated. The
.N symbol used in the
j argument corresponds to the number of all observations in each group. As previously mentioned, aggregates can be combined with subsets on rows and selecting columns.
Base R joining and merging of datasets is considered a special type of subset operation. We provide a dataset to which we want to join in the first square bracket argument
i. For each row in dataset provided to
i, we match rows from the dataset in which we use
[. If we want to keep only matching rows (inner join), then we pass an extra argument
nomatch = 0L. We use
on argument to specify columns on which we want to join both datasets.
Be aware that because of the consistency to base R subsetting, the outer join is by default
RIGHT OUTER. If we are looking for
LEFT OUTER, we need to swap the tables, as in the example above. Exact behavior can also be easily controlled in
data.table method, using the same API as base R
If we want to simply lookup the column(s) to our dataset, we can efficiently do it with
:= operator in
j argument while joining. The same way as we sub-assign by reference, as described in the Update dataset section, we just now add a column by reference from the dataset to which we join. This avoids the in-memory copy of data, so we don’t need to assign results into new variables.
A non-equi join feature to join datasets using non-equal condition is currently being developed.
When exploring our dataset, we may sometimes want to collect technical information on the subject, to better understand the quality of the data.
We can check the uniqueness of data by using
uniqueN function and apply it on every column. Object
.SD in the query below corresponds to Subset of the Data.table:
Fast export tabular data to CSV format is also provided by the
At the time of writing this, the
fwrite function hasn’t yet been published to the CRAN repository. To use it we need to install data.table development version, otherwise we can use base R
write.csv function, but don’t expect it to be fast.
There are plenty of resources available. Besides the manuals available for each function, there are also package vignettes, which are tutorials focused around the particular subject. Those can be found on the Getting started page. Additionally, the Presentations page lists more than 30 materials (slides, video, etc.) from
data.table presentations around the globe. Also, the community support has grown over the years, recently reaching the 4000-th question on Stack Overflow
data.table tag, still having a high ratio (91.9%) of answered questions. The below plot presents the number of
data.table tagged questions on Stack Overflow over time.
This article provides chosen examples for efficient tabular data transformation in R using the
data.table package. The actual figures on performance can be examined by looking for reproducible benchmarks. I published a summarized blog post about
data.table solutions for the top 50 rated StackOverflow questions for the R language called Solve common R problems efficiently with data.table, where you can find a lot of figures and reproducible code. The package
data.table uses native implementation of fast radix ordering for its grouping operations, and binary search for fast subsets/joins. This radix ordering has been incorporated into base R from version 3.3.0. Additionally, the algorithm was recently implemented into H2O machine learning platform and parallelized over H2O cluster, enabling efficient big joins on 10B x 10B rows.
Article by JAN GORECKI , original post here