Why Worry About Data Quality?


data-qualityIf you just assume your data are OK and all you’ll need, you’re probably going to have a lot to worry about.

A good data analyst may seem like a pretty neurotic person to everyone else. Why do they worry so darn much about everything? Why do they keep asking me all these questions?

Because, when the client spots something funny in your report, you rush to us in a panic! The role of a statistician can in some ways be likened to that of a relief pitcher in baseball…Statisticians consider worrying as a kind of upfront investment – worry now so we won’t have to worry later. This way, we’ll also have less to worry about overall.

Though data science and recently, statistics, has been called sexy and in various ways hyped as if it were an exotic occupation reminiscent of Indiana Jones, the reality is less spectacular. There is data science and there is data science fiction. Like real archaeology, there is a considerable amount of grunt work that is part of it. A more positive analogy perhaps would be warming up and stretching out before you do vigorous exercise – you may not like it but skip it and feel the pain.

Means, standard deviations and the results of most statistical procedures – even exotic ones – can be seriously distorted by poor quality data, and costly or even disastrous decisions may be made as a result.

Never assume your data are OK and are all the data you’ll need.

Whenever possible the person doing the data analysis – especially if there will be advanced statistical analysis – should be involved in research design and data collection. This will help minimize the risk that the data they’ll need won’t be available or will require an inordinate amount of time to set up for analysis.

Data cleaning, recoding and restructuring, including merging and appending data files, are part of any data analysis project. This is also a great way for the analyst to get to know the data – even when they’ve been part of the team designing the research or collecting the data, there will always be surprises in the data. Exploratory data analysis is an essential part of data analysis.

Recoding is case by case, e.g., there may be too many categories (levels) in a variable (data field), which complicates analysis. Postal code is one example and occupation is another, in which categories may be combined to create a more meaningful and usable variable. New variables may also be created, for instance, using principal components analysis or variable clustering to reduce the number of variables or account for their underlying dimensionality. Another example is when decision makers use classification schemes based on several customer characteristics (e.g., Recency, Frequency, Monetary Value).

Missing data will always be some cause for concern with real-life data. As with anomaly detection (discussed next), what to do is seldom straightforward. The use of Full Information Maximum Likelihood estimation (FIML) and Multiple Imputation (MI) is common with the computers we now have but not the end of the story. Applied Missing Data Analysis by Craig K. Enders and Handbook of Missing Data Methodology (Molenberghs et al.) are two comprehensive sources if you or your statistician would like a more detailed look.

Anomaly detection is a full-time occupation for some people working in security or fraud prevention fields. For most data analysts, it’s also an important part of our work. Detecting outliers – extreme data points – can be done in many ways (see Outlier Analysis by Charu C. Aggarwal) and most data contain unusual cases. They may distort the analysis and should be deleted, or they may have trivial impact and should be retained. There are many other options, in addition.

Conversely, someone attempting fraud will make efforts not stand out, and cases that appear very typical may actually be problematic. Outliers tell us a lot about our data and our models – some cases may be outliers in one model but not another. Residual analysis is a part of anomaly detection in addition to being a critical part of model diagnostics.

Statistical Process Control is a big area most of you probably have heard of. It utilizes various control charts and statistical methods to ensure product and service quality. The product may include data. It can be used ad hoc by statisticians once the data have arrived, but some aspects of it should be part of quality control in any organization or organizational unit producing data. Statistical Quality Control (Montgomery) and Introduction to Statistical Process Control (Qiu) are two books covering these topics in depth.

Total Survey Error is a framework developed by survey methodologists with utility in many applications of data science. Briefly, it decomposes errors in surveys into sampling and non-sampling error. This short entry in Wikipedia introduces the basic idea and this longer article in the Public Opinion Quarterly gets into its philosophy as well as the nitty-gritty.

This has been a snapshot of a very big, complex and important topic. Garbage In Garbage Out (GIGO) is a popular phase which can also be interpreted as please don’t load the bases on your statistician.

I hope you’ve found this interesting and useful!


Kevin Gray is President of Cannon Gray, a marketing science and analytics consultancy. He has more than 30 years’ experience in marketing research with Nielsen, Kantar, McCann and TIAA-CREF.