Subscribe to our Newsletter

Errors are the norm, not the exception, when working with data. By now, you’ve probably heard the statistic that 88% of spreadsheets contain errors. Since we cannot safely assume that any of the data we work with is error-free, our mission should be to find and tackle errors in the most efficient way possible. Everyone has their own way of catching errors, but here are a few quick strategies that have helped me over the years:

Gauge min and max values

For continuous variables, checking the minimum and maximum values for each column can give you a quick idea of whether your values are falling within the correct range. For example, with a variable like age, I want to make sure that my minimum value makes sense, and that I’m not maxing out somewhere in the 400’s. Checking min and max values is a great way to spot extra zeroes or missing digits, and fix them before they enter an analysis.

Look for missings

The easiest way to find missings is to perform a count, if you have this function available. If not, there are other ways to find missing values. Try sorting your columns (both ‘ascending’ and ‘descending’) to see if any missing values exist in your columns, or filtering your dataset such that you’re only looking at records with a missing value. While sometimes missing values are inevitably due to chance, it’s worth double-checking to see if there might be an underlying reason for missingness, and address them as best you can.

Check the values of categorical variables

Depending on your methodology and the number of people contributing to a database, there can be lots of room for error when entering data. One quick way to find these is to pull up all of the different categories that a categorical variable can take on. A quick example: recently I was looking at a field set to store U.S. states. The values in the upper part of the spreadsheet looked good – ‘NH’, ‘ME’, etc. – but when I got to the lower part, the entry method had switched to more full values – ‘New Hampshire’ and ‘Maine’. Correcting the categories was a quick fix, but it’s best to catch these errors as early on as possible.

Look at the ‘incidence rate’ of binary variables

If we think of a true binary variable as one made up of 1’s and 0’s, looking at its mean (or the incidence rate) will tell you the proportion of 1’s you have in your dataset. It’s worth double-checking this to make sure that your binary is set up correctly. One common mistake I’ve seen is to have 1’s and nulls, rather than 1’s and 0’s.  This becomes easy to spot because the “rate” of the binary variable will be equal to 1. The proportion of 1’s you get should make sense for the behavior you are trying to flag within your dataset.

Catching errors early on in the process is important so that early mistakes don’t influence decisions later on in the analysis. After all, our analysis is only as good as the data we use to create it. These are some of my most efficient mistake-catching techniques, but I’m sure there are others. So I’d love to know -what are your best strategies for finding errors in your data?

-Caitlin Garrett, Statistical Analyst

 > Original article here. <

E-mail me when people leave their comments –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Featured Blog Posts - DSC