 ### A synthetic variance designed for Hadoop and big data

The new variance introduced in this article fixes two big data problems associated with the traditional variance and the way it is computed in Hadoop, using a numerically unstable formula. Synthetic Metrics

This new metric is synthetic: It was not derived naturally from mathematics like the variance taught in any statistics 101 course, or the variance currently implemented in Hadoop (see above picture). By synthetic, I mean that it was built to address issues with big data (outliers) and the way many big data computations are now done: Map Reduce framework, Hadoop being an implementation. It is a top-down approach to metric design - from data to theory, rather than the bottom-up traditional approach - from theory to data.

Other synthetic metrics designed in our research laboratory include:

There are two issues with the formula used for computing Variance in Hadoop. First, the formula used, namely Var(x1, ... , xn) = {SUM(xi^2)/n} - {SUM(xi)/n}^2, is notoriously unstable. For large n, while both terms cancel out somewhat, each one taken separately can take a huge value, because of the squares aggregated over billions of observations. It results in numerical inaccuracies, with people having reported negative variances. Read the comments attached to my article The curse of Big Data for details. Besides, there are variance formula that do not require two passes of the entire data sets, and that are numerically stable.

Of course, the formula currently used is so easy to implement in a Map Reduce environment such as Hadoop. You first split your data into 200 buckets. You compute both sums separately on these 200 data buckets (each computation simultaneously done on a separate server to decrease computation time by a factor 200), then perform 200 aggregations and a final square to get the value for the right-hand term.

My main concern though is not about numerical instability, but with the fact that in large data sets, outliers are a big problem and will screw up your variance. This variance formula will make variance comparisons across data bins meaningless, and will result in highly volatile, inaccurate predictions if the data is used for predictive modeling. The solution is to use a different type of variance, one that is not sensitive to outliers yet easy to compute with Hadoop.

The abstract concept of variance

Before introducing a new definition of variance, lets's first discuss what a variance metric should satisfy. The following are desirable properties, listed in order of importance: #1, #2 and #3 being the most fundamental, while #5, #6 and #7 are rarely needed.

1. The variance is positive. It is equal to 0 if and only if x1 = ... = xn.
2. The variance (denoted as V) is symmetrical. Any permutation of (x1, ... , xn) has same variance as V(x1, ... , xn).
3. The further away you are from x1 = ... = xn, the higher the variance.
4. The variance is translation-invariant: V(x1+c, ... , xn+c) = V(x1, ... , xn) for any constant c.
5. The variance is scale-invariant: V(c*x1, ... , c*xn) = V(x1, ... , xn) for any constant c. Read first technical note below for explanations.
6. The variance is invariant under a number of matrix transformations such as rotations, or reflections.
7. The variance is bounded. For instance, V(x1, ... , xn) < 1.

Properties #1 and #2 are mandatory. Property #5 is not necessary; it is easy to derive a scale-invariant variance W from any variance V: define W(x1, ... , xn) as V(x1, ... , xn) divided by the average V(xi, xj) computed on all subsets {xi, xj} of {x1, ... , xn}. Property #7 is not necessary. You can turn any non-bounded variance V into a bounded one using W = V / (1+V). The disadvantage of a non-bounded variance is that a single outlier can make the variance arbitrarily large.

Property #4 is not necessary. You can derive a translation-invariant variance W from V as follows: W(x1, ..., xn) = max V(x1+c, ..., xn+c), where the maximum is computed over all c's. This works only when the c achieving maximum V is finite, regardless of (x1, ..., xn).

I'd like to redefine property #3 differently. All the variances described here share a property similar to #3 anyway. But strictly speaking, property #3 means that the higher the distance d between (x1, ... ,xn) and the line containing the points (0, ... , 0) and (1, ..., 1) in the n-dimensional space, the higher the variance. It turns out that d is exactly the traditional standard deviation computed on (x1, ..., xn), and thus, the only variances satisfying property #3 would be monotonic functions of the traditional variance.

The new variance introduced here meets property #1 and #2. It is defined as follows:

Let g be an arbitrary strictly convex function, for instance g(x) = |x|^a (with a>1). Then

V(x1, ... , xn) = {SUM(g(xi))/n} - g(SUM(xi)/n).

If you use g(x)=|x|^a with a=2, then the formula coincides with the traditional variance. I recommend a much lower value, for instance a=1.25, to solve both the numerical instability and outlier issues. I wish we could use a=1, but then g would be convex but not strictly convex, and it just does not work at all. Interestingly, when transforming V (defined using g(x)= |x|^a) to make it translation-invariant, we get a new variance W which works perfectly well even with a=1, making computations faster: it takes far more time to compute |x|^1.25 than |x|^1.

When g(x) = |x|^a and 1 < a < 2, the c that maximizes V(x1+c, ..., xn+c) is c = -(x1 + ... + xn)/n. I haven't proved this result yet; I invite you to come with a formal proof. But I tested it on several data sets. Clearly, the result no longer holds when a=1 or a=2; if (and only if) a=2, the variance V and its translation-invariant version W are identical. So our final variance, after this transformation, is

W(x1, ... , xn) = SUM{g(xi - u)/n}, with u = (x1 + ... + xn)/n, g(x) = |x|^a and a=1.

Technical note: W satisfies property #4 (translation-invariant) under the L^a norm. If you transform the above W into a scale-invariant variance using the technique previously described, you will end up with a final metric that is good at detecting clusters and data structures with holes: The resulting metric takes higher values when strong cluster patterns are present, and conversely, low values when data points are more evenly spread. This works for 1 < a < 2, as well as for a=1. It does not work for a=2 (standard variance) as the resulting metric would then depend only on n: it would always be equal to (n-1)/(2n) regardless of (x1, ... ,xn), if a=2. So, the traditional variance is the only variance (with parameter a between 1 and 2) that can not be made scale-invariant using the approach described here. As an exercise, compute the variance for a=1, after transforming W (the translation-invariant version of V) to make it scale-invariant, on four data sets with n=10 points: first data set is {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, second data set is {0,  0, 10/3, 4.5, 4.5. 4.5. 4.5, 9, 9, 9}, third data set is {0, 0, 0, 0, 0, 9, 9, 9, 9, 9} and forth data set is {0, 0, 0, 0, 0, 1, 1, 1, 1, 1}. This will help you understand what a scale-invariant variance is. Note the the values computed on the third and fourth data sets  are identical precisely because of the scale-invariant property. Finally, note that our formula to obtain a scale-invariant variance is O(n^2) in terms of computation, that is, very time consuming. Everything else discussed here is O(n). How can you improve this O(n^2)?

A New Big Data Theorem

Our research lab occasionally produces new theorems with immediate application to better process data - big and small.

Theorem: The formula V(x1, ... , xn) = {SUM(g(xi))/n} - g(SUM(xi)/n) defines a variance, satisfying properties #1 and #2, when g is a strictly convex function.

This result - I call it the Fourth AnalyticBridge Theorem - is a direct consequence of Jensen's inequality. For those interested, here are our three previous theorems:

Exercise: Assuming X=(x1, ..., xn) and Y=(y1, ...,yn) are two vectors, then Cov(X, Y) = {V(X+Y) - [V(X)+V(Y)]}/2 defines the traditional covariance, if g(x)=|x|^a and a=2. How to adapt this formula to 1 < a < 2? Can you generalize this covariance definition to more than two vectors?

Transformation-Invariant Metrics

It is easy to prove that when a=2 (the traditional variance), V satisfies property #4, that is, it is translation invariant: adding a constant to all data points x1, ... , xn does not change V. This is no longer true when 1 < a < 2. For instance, with n=3, we have:

• when a=2 and x1=3, x2=5, x3=10, then V=8.67
• when a=2 and x1=10,003, x2=10,005, x3=10,010, then V=8.67 (unchanged)
• when a=1.25 and x1=3, x2=5, x3=10, then V=0.34
• when a=1.25 and x1=10,003, x2=10,005, x3=10,010, then V=0.00

This is one of the reasons why I don't like translation invariance and prefer a=1.25 over the standard a=2.

It can be easily be proved that to satisfy the translation-invariant property, the function g must satisfy g(x) - 2g(x+1) + g(x+2) = constant. The only functions g satisfying this constraint are quadratic forms (for instance when a=2), and are thus linked to the traditional variance. Likewise, to satisfy the scale-invariant property, the function g would have to satisfy the equation g(2x) - 2g(x) = constant. There is no strictly convex function satisfying this constraint, so none of our V's is scale-invariant. But you can use a transformation described above to build a variance W which is scale-invariant.

Technical note: The traditional variance defined by a=2 is also invariant under orthogonal transformations (reflections or rotations) that preserve the mean, that is, it is invariant under linear operators that are both orthogonal and stochastic. In matrix notations, such an operator is a n x n matrix A applied to the column vector X = (x1, ..., xn)' and having the sum of each column equal to 1. Here ' denotes the transposed vector (from row to column).  In short, V(AX) = V(X) if A is such a matrix and a=2. A particular interesting case is when A is both doubly stochastic and orthogonal. Then A is a permutation matrix. Such matrices A can be produced using the following iterative algorithm: Start with any doubly stochastic matrix A_0; define A_{k+1}= (A_k - s*I) / (I - s*A'_k) for k>0. Here we assume 0 < s < 1 for convergence, and the symbol I denotes the n x n identity matrix.

Implementation: Communications versus Computational Costs

One of the issues with Hadoop and Map Reduce is the time spent in data transfers, as you split a task into multiple parralel processes, each running on a different server. It shifts the problem of computational complexity from algorithm complexity to communication or bandwidth costs. By communication, I mean the amount of time moving data around - from a centralized server (data base of file management system) or distributed system, to memory or to hard disks. How to optimize these data transfers? How to optimize in-memory computations? You can not rely on a built-in function such as variance if it is not properly implemented, as shown here.

Indeed, the variance computation is an interesting problem that illustrates some Hadoop technical challenges. If no efficient built-in algorithm is available, it is necessary to do home-made computations that leverage the Map Reduce architecture, while at the same time moving as little data as possible over intranet networks (in particular, not downloading all the data once to compute the mean, then once again for the variance). Given the highly unstable variance formula used (a=2) in Hadoop, the solution is either to stabilize it, by removing a large value to all xi's before computing it. After all, with a=2, the variance is translation-invariant, so why not exploit this property? Even better, use a=1.25, which is much more robust against outliers. Or better, its translation-invariant version W, with a=1. But then you to have to write the code yourself or use some libraries.

Bayesian Models

Much of this discussion focused on a specific parameter a, with potential values between 1 and 2. This parameter could be assigned a prior distribution in a Bayesian framework.

• Here's one recent comment from one of our readers:

Online (single pass) algorithm have been studied extensively. For variance Welford's method is the most popular http://jonisalonen.com/2013/deriving-welfords-method-for-computing-...For higher statistical moments there is work by Timothy B. Terriberryhttp://people.xiph.org/~tterribe/notes/homs.html Higher moment are less stable statistical measures though.

• Interesting approach to teaching statistics: start with variance (the most visual metric), then move to covariance, and finally to the mean, using the concept of translation-independence to introduce the mean.  It also shows that such an approach can not be done with traditional theory (a=2). It only works if a<2.

• Vincent,

When I wrote "As for your proposal to fix the unstable textbook algorithm for numerical computation of variance, it is not a proper fix, and I hope that whoever does fix it has a solid grounding in numerical analysis and mathematical computation, which you apparently do not." I chose to do so in as non-inflammatory a manner as possible, consistent with the evidence provided in your article.  Despite my comment in the other article with link to numerically stable methods of computation, you proposed the fix "Given the highly unstable variance formula used (a=2) in Hadoop, the solution is either to stabilize it, by removing a large value to all xi's before computing it. After all, with a=2, the variance is translation-invariant, so why not exploit this property?"

Removing a large value to all xi's before computing it is NOT a proper fix to the numerically unstable textbook formula for computing variance.  If your "removed value"  happens to be the exact mean, then your proposal results in the same numerical accuracy as the two-pass method.  And if the removed value is close enough to the mean relative to the size of the data set, its variance, and the number of digits used in the computation, then it may be of adequate accuracy.  But in general, in a one-pass algorithm, you don't have advance knowledge of the mean, and your "guess" may not be good enough.  That is not a proper way to write a robust library function.  Using stable one-pass methods, including those which are massively parallelizable, does not require advance knowledge of the mean, and ensure that all summands are non-negative, thereby precluding cancellation errors.

• Mark,

My purpose here is to explain concepts in an intuitive, simple way. Sure a=1 is L^1, a=2 is L^2, I'm not sure mentioning this fact makes my article easier to understand. You can call my variance (defined in my theorem) a generalized variance if you think it is not a variance. Here the word variance is to be understood from a layman (not statistician) perspective: a metric that measures dispersion, spread, volatility or indeed variance (if you look at the definition in an English dictionary rather than in a statistics textbook).

I am aware of much stable algorithms when a=2, you can find them in my referenced article (the curse of big data, cited in my article). I attended advanced lectures on numerical analysis and optimization when I did my math studies, so it's not that I don't know about it. Heck, I might even know more than you, but here I'm trying to help people, not to prove that I know more than someone else (trying to look smarter than someone else is usually a sign of low self-confidence).

Evidently I am familiar with the Mahalanobis distance. The scale-invariant variance is useful when dealing for instance with data not expressed in units, where scale is not important. My approach to get a scale-invariant metric is to divide by a U-statistics, which is more easy to understand and straightforward and stable from a numerical point of view, unless all your data points are identical. There are always multiple approaches to solve a problem.

Vincent

• Vincent,

May I suggest you just think about this in terms of L^p norms? The square of the 2 norm of (X-mean(X)) corresponds to variance, which is sensitive to outliers.  Increasing p all the way to infinity is based on maximum absolute deviation, which is the most sensitive to outliers, and going in the direction from 2 down toward or equals 1 decreases sensitivity to outliers.  The L^1 norm is frequently used in robust statistics calculations.  It is hardly a new concept in statistics.  Regressions can be performed in the L^1 norm, or with an L^1 regularization term.

Your proposed "variances", except for the real variance, are not variances, but may be useful measures of data dispersion, and potentially useful for optimizing a fit.   Also note that variance is not scale invariant, Var(c*X) where c is a constant, is c^2*Var(x).  If you desire a scale-invariant metric related to variance, or indeed for n-dimensional Covariance matrices, then use Mahalanobis Distance squared, which is scale invariant and is x' *inv(C)*x, where x is the vector whose Mahalanobis Distance squared is being calculated relative to the covariance matrix C.  Take the sqrt to get Mahalanobis Distance, which is the scale invariant distance of x relative to its covariance.  In one dimension, Mahalanobis Distance of x is just abs(x)/(standard deviation of random variable).

As for your proposal to fix the unstable textbook algorithm for numerical computation of variance, it is not a proper fix, and I hope that whoever does fix it has a solid grounding in numerical analysis and mathematical computation, which you apparently do not.  As described in the paper http://webmail.cs.yale.edu/publications/techreports/tr222.pdf , which I referenced in the comments to http://www.analyticbridge.com/profiles/blogs/the-curse-of-big-data , there are stable one-pass algorithms which essentially use running estimates of the mean, with suitable adjustments such that the computation, when performed in exact arithmetic, is exactly equivalent to the two-pass or unstable one-pass formulas, and do not require a priori, possibly erroneous, estimates of the mean.  These formulas are readily massively parallellizabe by performing and aggregating pairs computations.