Guest blog post by Dr. Vincent Granville
This could be a new startup idea: creating an Excel-compatible software, working just like Excel, but able to handle bigger datasets, much faster.
Like most data scientists, I've been using Excel a lot in my career, and it definitely has some very powerful features. Probably the greatest one is its ability to help design, test, and update models in real time: just changing the value of a few core parameters, and all your cells, thousands of them, and all you charts, get updated at once. If you are not familiar with this, download our most recent spreadsheet to see how it works.
Other nice features includes Excel ability to connect to databases or the Internet (e.g. to the Bing API) and extract useful information, and summarize it in cubes and pivot tables. Although cubes and pivot tables have a very strong feel of old-fashioned, SQL relational database environment. But it is still useful in many contexts. Not sure if Excel can easily retrieve data via the Internet, from non-Microsoft API's, e.g. Google API's. It should.
Yet the main drawback is Excel slowness. It is slow in ways that are unexpected. If you sort 500,000 observations (one column) it's actually quite fast. But let's say that you simultaneously sort two columns: A, and B, where B is the log of A values. So B contains a formula in each row. This dramatically slows down the sort. It is much faster to sort A alone and leave B as is. The formulas in B will correctly update all the values, very fast, and automatically.
As a general rule, any operation that involves re-computing (say) 200,000+ rows across multiple cells linked via dependence relationships, is done very slowly if:
- one of the columns includes functions such as VLOOKUP at each cell
- SORT or other sub-linear processes are required (by sub-linear, I mean processes that are O(n log n) or worse)
There's a very easy, efficient (but ugly) way around this, and I'm wondering why it's not built-in in Excel, and transparent to the user. For instance:
- Replace VLOOKUP formulas by hard-coded values, perform the update on hard-coded values, then put back the formulas
- Perform SORT only on the minimum number of columns where necessary, then update cells in columns involving formulas
Also, I don't know why VLOOKUP is so slow in the first place. I use it all the time to join a (say) 3,000,000 row dataset with a 200,000 lookup table in Perl, and it's very fast. In Excel, the dataset and the lookup table would be stored in two separate worksheets within the same spreadsheet, and it would take days if it was possible to perform this "join" (it isn't). Maybe Excel indirectly performs a full join, thus exponentially slowing down the operation? I almost never do, and people almost never do a full join on large datasets. This is an area where significant improvements could be done.
Finally, not sure if Excel leverages the cloud, but this would be a great way to further speed computations and process data sets far bigger than 1 million rows. Microsoft should allow the user to export the data on some (Microsoft?) cloud in a transparent way, in one click (e.g. just click on "export to cloud") then allow the user to simulate the time-consuming operations on the local Excel version on her desktop (this amounts to using your local Excel spreadsheet as a virtual spreadsheet), and when done, click on "retrieve from cloud" and get your spreadsheet updated. The "retrieve from cloud" would:
- Send your Excel formulas to the cloud via the Internet
- Apply your formula to your cloud version of your data, leveraging Map Reduce as needed
- Get the processed data back to your local spreadsheet on your desktop
Another painfully slow process is when you need to apply a formula to a whole column with 500,000 cells. Fortunately, there is a trick. Let's say you want to store the product of A and B in column C.
- Firstly, select the whole Column C.
- Enter the formula of =A1*B1
- Press the Ctrl key and Enter key together.
I wish it would be easier than that, something like ==A1*A1 (formula with double equal to indicate that the formula applies to the entire column, not just one cell). This is another example of how Excel is not user-friendly. Many times, there are some obscure ways to efficiently do something. We'll see an another example in my next spreadsheet, which will teach you how to write a formula that returns multiple cells - in particular with LINEST which returns the regression coefficients associated with a linear regression. Yes you can do it in basic Excel, without add-in!
For those interested in creating a column with 1,000,000 values (e.g. to test the speed of some Excel computations), here's how to proceed - this would indeed be a good job interview question:
- Start with 200 cells.
- Duplicate these cells, now you have 400.
- Duplicate these cells, now you have 800.
Another 10 iterations of this process, and you'll be at 800,000 cells. It's much faster than the naive approach. And if your initial 200 numbers consist of the formula =RAND(), at the end you'll end up with one million pseudo-random numbers, though of poor quality.
Finally, I use Perl, Python, R, C, Sort, Grep and other languages to do data summarization, before feeding stuff to Excel. But if Excel came with the features I just discussed, much more could be done within Excel. Users would spend more time in Excel.
And one weird question to finish: why does Microsoft not show ads in Excel? They would get advertising dollars that could be used to improve Excel. They could then ship a version of Excel that transparently uses the cloud, a version that even the non technical guys could use, with all the benefits that I mentioned.