1 Introduction
- 1.1 The problem
- 1.2 The purpose of research
- 1.3 What causes problems in the research process?
- 1.4 About this book
- 1.5 The most important sections in this book
- 1.6 Quantitative vs. qualitative research
- 1.7 Stata and R code
- 1.8 Chapter summary
If you don’t read the newspaper you are uninformed, if you do read the newspaper you are misinformed.
– Mark Twain
I’m on a mission, and I need your help.
My mission is to make this a better world.
I want society to make better choices and decisions. Sometimes, these decisions should be based on what is morally and ethically the right thing to do. I cannot help on that front. But, at other times, decisions need to be based on how they affect people and institutions. And, in that realm, sometimes statistical analysis can speak to best practices. Statistical analysis sometimes can tell us what health practices and interventions are most beneficial to people, what factors lead to better economic outcomes for people (individually or collectively), what factors contribute to the academic achievement of children, how to make government functions more cost-efficient (which could reduce the tax burden on society), and much more.
All that said, sometimes statistical analysis is unable to speak on many of these issues. This may be because the data are not adequate in terms of having sufficient observations and sufficient accuracy. Or, it could be because the statistical analysis was flawed, or that there are no solutions to certain biases. So, we need to be careful in how we interpret and use the results from statistical analyses so that we draw the correct and prudent conclusions, without overreaching or being affected by our pre-conceived notions and biases.
My goal with this book is not to answer the important questions on how to make the world better. In fact, I will address some research issues that some of you will care nothing about, such as whether discrimination is a self-fulfilling prophecy in France, whether a Medicaid expansion in Oregon improved health outcomes for participants, or whether the hot hand in basketball is real or just a figment of our imaginations. But, these research issues that I use will serve as useful applications to learn the concepts and tools of regression analysis.
So, my goal is to teach you the tools needed to address important issues. This book is designed to teach you how to better conduct, interpret, and scrutinize statistical analyses. From this, I hope you will help others make better decisions that will help towards making this world, eventually, a better place.
1.1 The problem
Jay Leno, in one of his Tonight Show monologues several years ago, mentioned a study that found that 50% of all academic research is wrong. His punchline: there’s a 50% chance this study itself is wrong.
The study Leno referred to may actually understate the true percentage of studies that are inaccurate. The major causes of all these errors in research are likely faulty research designs and improper interpretations of the results. These accuracy issues bring into doubt the value of academic research.
Most quantitative academic research, particularly in the social sciences, business, and medicine, rely on regression analyses. The primary objective of regressions is to quantify cause-effect relationships. These cause-effect relationships are part of the knowledge that should guide society to develop good public policies and good strategies for conducting business, educating people, promoting health and general welfare, and more. Regressions are useful for estimating such relationships because they are able to “hold constant” other factors that may confound the cause-effect relationship in question. That is, regressions, if done well, can rule out reasons for two variables to be related, other than the causal-effect reason.
Here are some examples of how regressions can be used to estimate the causal effect of one factor (or a set of factors) on some outcome:
- How does some new cancer drug affect the probability of a patient surviving 10 years after diagnosis?
- How does a parental divorce affect children’s test scores?
- What factors make teachers more effective?
- What encourages people to save more for retirement?
- What factors contribute to religious extremism and violence?
- How does parental cell phone use affect children’s safety?
- How does oat-bran consumption affect bad cholesterol levels?
- Do vaccines affect the probability of a child becoming autistic?
- How much does one more year of schooling increase a person’s earnings?
- Does smiling when dropping my shirt off at the cleaners affect the probability that my shirt will be ready by Thursday?
A regression is a remarkable tool in its ability to measure how certain variables move together, while holding certain factors constant. A natural human reaction is to be mesmerized by things people do not understand, such as how regressions can produce these numbers. And so, in the roughly 10 times that I have used regression results in briefings to somewhat-high-level officials at the Department of Defense (mostly as a junior researcher, with a senior researcher tagging along to make sure I didn’t say anything dumb), the people I was briefing never asked me whether there were any empirical issues with the regression analysis I had used or how confident I was with the findings. Most of the time, based on the leading official’s response to the research, they would act as if I had just given them the absolute truth on an important problem based on these “magic boxes” called “regressions.” Unfortunately, I was caught up in the excitement of the positive response from these officials, and I wasn’t as forthright as I should have been about the potential pitfalls (and uncertainty) in my findings. And so, I usually let them believe the magic.
But, regressions are not magic boxes. The inaccuracy Leno joked about is real, as there are many pitfalls of regression analysis. And, from what I have seen in research, at conferences, from journal referees, etc., many researchers (most of whom have Ph.D.s) have a limited understanding of these issues. And so, published quantitative research is often rife with severely biased estimates and erroneous interpretations and conclusions.
How bad is it? In the medical-research field, where incorrect research has the potential to result in lost lives, John Ioannidis has called out the entire field on its poor research methods and records. The Greek doctor/medical-researcher was featured in a 2010 article in The Atlantic (Freedman 2010). Ioannidis and his team of researchers have demonstrated that a large portion of the existing medical research is wrong, misleading, or highly exaggerated. He attributes it to several parts of the research process: bias in the way that research questions were being posed, how studies and empirical models were set up (e.g., establishing the proper control group), what patients were recruited for the studies, how results were presented and portrayed, and how journals chose what to publish.
Along these lines, the magazine The Economist had a much-needed op-ed and accompanying article in 2013 on how inaccurate research has become.1 Among the highlights they note are:
- Amgen, a biotech company, could replicate only 6 of 53 “landmark” cancer-research studies;
- Bayer, a pharmaceutical company was able to replicate just one-quarter of 67 important health studies;
- Studies with “negative results,” meaning insignificant estimated effects of the treatment variables, constituted 30% of all studies in 1990 and just 14% today, suggesting that important results showing no evidence that a treatment has an effect are being suppressed – and/or extra efforts are being made to make results statistically significant.
All of this highlights an interesting irony. The potential for valuable research has perhaps never been greater, with more data available on many important outcomes (such as student test scores, human DNA, health, logistics, and consumer behavior, and ball and player movements in sports), yet the reputation of academic research has perhaps never been so low.
This is fixable!
This book is meant to effectively train the next generation of quantitative researchers.
1.2 The purpose of research
To understand where research goes wrong, we first have to understand the overall purpose of research. We conduct research to improve knowledge, which often involves trying to get us closer to understanding cause-effect and other empirical relationships. To demonstrate, let’s start with the highly contentious issue of global warming. You may have some belief on the probability that the following statement is true:
Human activity is contributing to global warming.
And, hopefully that probability of yours lies somewhere between 0.3% and 99.7% – that is, you may have your beliefs, but you recognize that you probably are not an expert on the topic and so there is a possibility that you are wrong. I’m guessing that most people would be below 10% or above 90% (or, even 5% and 95%). But, for the sake of the argument, let’s say that you have a subjective probability of the statement being true 45% of the time.
Suppose a study comes out that has new evidence that humans are causing global warming. This may shift your probability upwards. If the new research were reported on cable news channel MSNBC (which leans toward the liberal side of politics) and you tended to watch MSNBC, then let’s say that it would shift your probability up by 7 percentage points (to 52%). If you tended to watch Fox News (a more conservative channel) instead, then the news from MSNBC may shift your probability up by some negligible amount, say 0.2 percentage points (up to 45.2%). But, ideally, the amount that your subjective probability of the statement above would shift upwards should depend on:
- How the study contrasts with prior research on the issue;
- The validity and extensiveness of the prior research;
- The extent to which any viable alternative explanations to the current findings can be ruled out – i.e., how valid the methods of the study are.
With regression analysis, it should be the same thinking of shifting beliefs. People have some prior beliefs about some issue, say on whether class size is important for student achievement. Using regression analysis, a new study finds no evidence that class size has an effect on student achievement. This finding should not necessarily be taken as concrete evidence for that side of the issue. Rather, the evidence has to be judged based on the strength of the study relative to the strength of other studies, or the three criteria listed above. And, people would then shift their subjective probability appropriately. The more convincing the analysis, the more it should swing a person’s belief in the direction of the study’s conclusions.
This is where it is up to researchers, the media, and the public to properly scrutinize research to assess how convincing it is. As I will describe below, you cannot always rely on the peer-review process that determines what research gets published in journals.
1.3 What causes problems in the research process?
The only real fiction is non-fiction.
– Mark Totten
Where do I begin?
Well, let’s discuss some structural issues first, which lead to misguided incentives for researchers.
One major problem in research is publication bias (discussed with more detail in Section 13.2), which results from the combination of the pressure among academics to publish and journals seeking articles with interesting results that will sell to readers, get publicity, and get more citations from subsequent research. All of this improves the standing of the journal. But, it leads to published research being biased towards results with significant (statistically and meaningfully) effects – so, studies finding statistically insignificant effects tend not to be disseminated. Given the greater likelihood of getting published with significant and interesting results, researchers at times will not spend time attempting to publish research that has insignificant results. In addition, research can be easily finagled, as adding or cutting a few semi-consequential variables can sometimes push the coefficient estimate on a key treatment variable over the threshold of “significance,” which could make ...