“You need to give yourself permission to be human.”
—Joyce Brothers
Data pitfalls. Anyone who has worked with data has fallen into them many, many times. I certainly have. It's as if we've used data to pave the way for a better future, but the road we've made is filled with craters we just don't seem to notice until we're at the bottom looking up. Sometimes we fall into them and don't even know it. Finding out about it much later can be quite humbling.
If you've worked with data before, you know the feeling. You're giving an important presentation, your data is insightful beyond belief, your charts and graphs are impeccable and Tufte-compliant, the build to your grand conclusion is unassailable and awe-inspiring. And then that one guy in the back of the room – the guy with folded arms and furrowed brow – waits until the very end to ask you if you're aware that the database you're working with is fundamentally flawed, pulling the rug right out from underneath you, and plunging you to the bottom of yet another data pitfall. It's enough to make a poor data geek sweat bullets.
The nature of data pitfalls is that we have a particular blindness to them. It makes sense if you think about it. The human race hasn't needed to work with billions of records of data in the form of zeros and ones until the second half of the last century. Just a couple of decades later, though, our era is characterized by an ever-increasing abundance of data and a growing array of incredibly powerful tools. In many ways, our brains just haven't quite caught up yet.
These data pitfalls don't doom our every endeavor, though. Far from it. We've accomplished great things in this new era of data. We've mapped the human genome and begun to understand the complexity of the human brain, how its neurons interact so as to stimulate cognition. We've charted vast galaxies out there and we've come to a better understanding of geological and meteorological patterns right here on our own planet. Even in the simpler endeavors of life like holiday shopping, recommendation engines on e-commerce sites have evolved to be incredibly helpful. Our successes with data are too numerous to list.
But our slipups with data are mounting as well. Misuse of data has led to great harm and loss. From the colossal failure of Wall Street quants and their models in the financial crisis of the previous decade to the parable of Google Flu Trends and its lesson in data-induced hubris,1 our use of data isn't always so successful. In fact, sometimes it's downright disastrous.
Why is that? Simply because we have a tendency to make certain kinds of mistakes time and time again. Noticing those mistakes early in the process is quite easy – just as long as it's someone else who's making them. When I'm the one committing the blunder, it seems I don't find out until that guy in the back of the room launches his zinger.
And like our good friend and colleague, we're all quite adept at spotting the screw-ups of other people, aren't we? I had an early lesson in this haphazard trade. In my seventh-grade science fair exhibition, a small group of budding student scientists had a chance to walk around with the judges and explain our respective science fair projects while the other would-be blue-ribbon winners listened along. The judges, wanting to encourage dialogue and inquisitiveness, encouraged the students to also ask questions after each presentation. In spite of the noble intention behind this prompting, we basically just used the opportunity to poke holes in the methods and analysis of our competition. Kids can be cruel.
I don't do science fair projects anymore, unlike many other parents at my sons' schools, but I do work with data a lot. And I work with others who work with data a lot, too. In all of my data wrangling, data remixing, data analyzing, data visualizing, and data surmising, I've noticed that there are specific types of pitfalls that exist on the road to data paradise.
In fact, in my experience, I've found that the pitfalls we fall into can be grouped into one of seven categories.
Seven Types of Data Pitfalls
Pitfall 1: Epistemic Errors: How We Think About Data
What can data tell us? Maybe even more importantly, what can't it tell us? Epistemology is the field of philosophy that deals with the theory of knowledge – what's a reasonable belief versus what is just opinion. We often approach data with the wrong mind-set and assumptions, leading to errors all along the way, regardless of what chart type we choose, such as:
- Assuming that the data we are using is a perfect reflection of reality
- Forming conclusions about the future based on historical data only
- Seeking to use data to verify a previously held belief rather than to test it to see whether it's actually false
Avoiding epistemic errors and making sure we are thinking clearly about what's reasonable and what's unreasonable is an important foundation for successful data analysis.
Pitfall 2: Technical Traps: How We Process Data
Once we've decided to use data to help solve a particular problem, we have to gather it, store it, join it with other data sets, transform it, clean it up, and get it in the right shape. Doing so can result in:
- Dirty data with mismatching category levels and data entry typos
- Units of measurement or date fields that aren't consistent or compatible
- Bringing together disparate data sets and getting nulls or duplicated rows that skew analysis
These steps can be complex and messy, but accurate analysis depends on doing them right. Sometimes the truth contained within data gets “lost in translation,” and it's possible to plow ahead and make decisions without even knowing we're dealing with a seriously flawed data set.
Pitfall 3: Mathematical Miscues: How We Calculate Data
Working with data almost always involves calculations – doing math with the quantitative data we have at our disposal:
- Summing at various levels of aggregation
- Calculating rates or ratios
- Working with proportions and percentages
- Dealing with different units
These are just a few examples of how we take data fields that exist and create new data fields out of them. Just like in grade school, it's very possible to get the math wrong. These mistakes can be quite costly – an error of this type led to the loss of a $125 million Mars orbiter in 1999.2 That was more like falling into a black hole than a pitfall.
Pitfall 4: Statistical Slipups: How We Compare Data
“There are lies, damned lies, and statistics.” This saying usually implies that someone is fudging the numbers to mislead others, but we can just as often be lying to ourselves when it comes to statistics. Whether we're talking about descriptive or inferential statistics, the pitfalls abound:
- Are the measures of central tendency or variation that we're using leading us astray?
- Are the samples we're working with representative of the population we wish to study?
- Are the means of comparison we're using valid and statistically sound?
These pitfalls are numerous and particularly hard to spot on the horizon, because they deal with a way of thinki...