
- English
- ePUB (mobile friendly)
- Available on iOS & Android
eBook - ePub
Thinking Through Statistics
About this book
Simply put, Thinking Through Statistics is a primer on how to maintain rigorous data standards in social science work, and one that makes a strong case for revising the way that we try to use statistics to support our theories. But don't let that daunt you. With clever examples and witty takeaways, John Levi Martin proves himself to be a most affable tour guide through these scholarly waters.
Martin argues that the task of social statistics isn't to estimate parameters, but to reject false theory. He illustrates common pitfalls that can keep researchers from doing just that using a combination of visualizations, re-analyses, and simulations. Thinking Through Statistics gives social science practitioners accessible insight into troves of wisdom that would normally have to be earned through arduous trial and error, and it does so with a lighthearted approach that ensures this field guide is anything but stodgy.
Martin argues that the task of social statistics isn't to estimate parameters, but to reject false theory. He illustrates common pitfalls that can keep researchers from doing just that using a combination of visualizations, re-analyses, and simulations. Thinking Through Statistics gives social science practitioners accessible insight into troves of wisdom that would normally have to be earned through arduous trial and error, and it does so with a lighthearted approach that ensures this field guide is anything but stodgy.
Frequently asked questions
Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
- Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
- Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weâve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere â even offline. Perfect for commutes or when youâre on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Thinking Through Statistics by John Levi Martin in PDF and/or ePUB format, as well as other popular books in Social Sciences & Business General. We have over one million books available in our catalogue for you to explore.
Information
Publisher
University of Chicago PressYear
2018Print ISBN
9780226567631, 9780226567464eBook ISBN
9780226567778* 1 *
Introduction
Map: Iâm going to start by identifying a meta-problem: most of what we learn in âstatisticsâ class doesnât solve our actual problems, which have to do with the fact that we donât know what the true model isânot that we donât know how best to fit it. This book can help with thatâbut first, we need to understand how we can use statistics to learn about the social world. I will draw on pragmatismâand falsificationismâto sketch out what I think is the most plausible justification for statistical practice.
Statistics and the Social Sciences
What Is Wrong with Statistics
Most of statistics is irrelevant for us. What we need are methods to help us adjudicate between substantively different claims about the world. In a very few cases, refining the estimates from one model, or from one class of models, is relevant to that undertaking. In most cases, it isnât. Hereâs an analogy: thereâs a lot of criticism of medical science for using up a lot of resources (and a lot of monkeys and rabbits) trying to do something we know it canât doâmake us live forever. Why do researchers concentrate their attention on this impossible project, when there are so many more substantively important ones? I donât deny that this might be where the money is, but still, there are all sorts of interesting biochemical questions in how you keep a ninety-nine-year-old millionaire spry. But if you look worldwide, and not only where the âeffective demandâ is, you note that the major medical problems, in contrast, are simple. Theyâre things like nutrition, exercise, environmental hazards, things weâve known about for years. But those things, simple though they are, are difficult to solve in practice. Itâs a lot more fun to concentrate on complex problems for which we can imagine a magic bullet.
So too with statistical work. Almost all of the discipline of statistics is about getting the absolutely best estimates of parameters from true models (which Iâll call âbestimatesâ). Statisticians will always admit that they consider their job only thisâto figure out how to estimate parameters given that we already know the most important things about the world, namely the model we should be using. (Yes, there is also work on model selection that Iâll get to later, and work on diagnostics for having the wrong model that I wonât be able to discuss.) Unfortunately, usually, if we knew the right model, we wouldnât bother doing the statistics. The problem that we have isnât getting the bestimates of parameters from true models, itâs about not having model results mislead us. Because what we need to do is to propose ideas about the social world, and then have the world be able to tell us that weâre wrong . . . and having it do this more often when we are wrong than when we arenât.
How do we do this? At a few points in this book, Iâll use a metaphor of carpentry. To get truth from data is a craft, and you need to learn your craft. And one part of this is knowing when not to get fancy. If you were writing a book on how to make a chair, you wouldnât tell someone to start right in after sawing up pieces of wood with 280 grit, extra fine, sandpaper. Youâd tell them to first use a rasp, then 80 grit, then 120, then 180, then 220, and so on. But most of our statistics books are pushing you right to the 280. If youâve got your piece in that kind of shape, be my guest. But if youâre staring at a pile of lumber, read on.
Many readers will object that it simply isnât true that statisticians always assume that you have the right model. In fact, much of the excitement right now involves adopting methods for classes of models, some of which donât even require that the true model be in the set you are examining (Burnham and Anderson 2004: 276). These approaches can be used to select a best model from a set, or to come up with a better estimate of a parameter across models, or to get a better estimate of parameter uncertainty given our model uncertainty. In sociology, this is going to be associated with Bayesian statistics, although there are also related information-theoretic approaches. The Bayesian notion starts from the idea that we are thinking about a range of models, and attempting to compare a posteriori to a priori probability distributionsâbefore and after we look at the data.
Like almost everyone else, Iâve been enthusiastic about this work (take a look at Raftery 1985; Western 1996). But we have to bear in mind that even with these criteria, we are only looking at a teeny fraction of all possible models. (There are some Bayesian statistics that donât require a set of models, but those donât solve the problem Iâm discussing here.) When we do model selection or model averaging, we usually have a fixed set of possible variables (closer to the order of 10 than that of 100), and we usually donât even look at all possible combinations of variables. And we usually restrict ourselves to a single family of specifications (link functions and error distributions, in the old GLM [General Linear Models] lingo).
Now I donât in any way mean to lessen the importance of this sort of work. And I think because of the ease of computerization, weâre going to see more and more such exhaustive search through families of models. This should, I believe, increasingly be understood as âbest practices,â and it can be done outside of Bayesian framework to examine the robustness of our methods to other sorts of decisions. (For example, in an awesome paper recently, Frank et al. [2013] compared their preferred model to all possible permutations of all possible collapsings of certain variables to choose the best model.) But it doesnât solve our basic problem, which is not being able to be sure weâre somewhere even close to the true model.
You might think that even if it doesnât solve our biggest problems, at least it canât hurt to have statisticians developing more rigorously defined estimates of model parameters. If weâre lucky enough to be close to the true model, then our estimates will be way better, and if they arenât, no harm done. But in fact, it is oftenâthough, happily, not invariablyâthe case that the approaches that are best for the perfect model can be worse for the wrong model.
When I was in graduate school, there was a lot of dumping on Ordinary Least Squares (OLS) regression. Almost never was it appropriate, we thought, and so it was, we concluded, the thing that thoughtless people would do and really, the smartest people wouldnât be caught dead within miles from a linear model anyway. We loved to list the assumptions of regression analysis, thereby (we thought) demonstrating how implausible it was to believe the results.
I once had two motorcycles. One was a truly drop dead gorgeous, 850 cc parallel twin Norton Commando, the last of the kick start only big British twins, with separate motor, gearbox, and primary chain, and a roar that was like music. The other was a Honda CB 400 T2âboring, straight ahead, what at the time was jokingly called a UJMâa âUniversal Japanese Motorcycle.â No character whatsoever.
I knew nearly every inch of that Commandoâfrom stripping it down to replace parts, from poring over exploded parts diagrams to figure out what incredibly weird special wrench might be needed to get at some insignificant part. And my wife never worried about the danger of me having a somewhat antiquated motorcycle when we had young children. The worst that happened was that sometimes Iâd scrape my knuckles on a particularly stuck nut. Because it basically stayed in the garage, sheltering a pool of oil on the floor, while I worked on it.
The Honda, on the other hand, was very boring. You just pressed a button, it started, you put it in gear, it went forward, until you got where you were going and turned it off.1 If I needed to make an impression, Iâd fire up the Norton. But if I needed to be somewhere âright now,â Iâd jump on the Honda. OLS regression is like that UJM. Easy to scorn, hard to appreciateâuntil you really need something to get done.
Proof by Anecdote
I find motorcycle metaphors pretty convincing. But if you donât, hereâs a simple example from some actual data, coming from the American National Election Study (ANES) from 1976. Letâs say that you were interested in consciousness raising at around this time, and youâre wondering whether the partiesâ different stances on womenâs issues made some sort of difference in voting behavior, so you look at congressional voting as a simple dichotomy, with 1 = Republican and 0 = Democrat. Youâre interested in the gender difference primarily, but with the idea that education may also make a difference. So you start out with a normal OLS regression. And we get what is in table 1.1 as model 1 (the R code is R1.1).
| Table 1.1. Proof by Anecdote | |||||
|---|---|---|---|---|---|
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | |
| Type of Model | OLS | LOGISTIC | HGLM | OLS | HGLM |
| GENDER | â.017 | â.071 | â.011 | â.039 | â.107 |
| (.031) | (.130) | (.152) | (.033) | (.164) | |
| EDUCATION | â.083*** | â.348*** | â.383*** | â.029 | â.166 |
| (.018) | (.078) | (.093) | (.022) | (.108) | |
| CONSTANT | .620 | .497 | .734 | .633 | .808 |
| RANDOM EFFECTS VARIANCE | 1.588 | 1.552 | |||
| SECRET | â.081*** | â.352** | |||
| (.021) | (.111) | ||||
| R2 | .021 | .028 | |||
| AIC | 1374.9 | 1264.5 | 1112.6 | ||
N = 1088; Number of districts = 117; *** p < .001; ** p < .01
Gender isnât significantâthere goes that theoryâbut education is. Thatâs a finding! You write up a nice paper for submission, and show it to a statistician friend, very interested that those with more education are more likely to vote Republican. He smiles, and says that makes a lot of sense given that education would make people better able to understand the economic issues at hand (I think he is a Republican), but he tells you that you have made a major error. Your dependent variable is a dichotomy, and so you have run the wrong model. You need to instead use a logistic regression. He gives you the manual.
You go back, and re-run it as model 2. You know enough not to make the mistake of thinking that your coefficients from model 1 and model 2 are directly comparable. You note, to your satisfaction, that your basic findings are the same: the gender coefficient is around half its standard error, and the education coefficient around four times its standard error. So you add this to your paper, and go show it to an even more sophisticated statistician friend, and he says that your results make a lot of sense (I think he too is a Republican) but that youâve made a methodological error. Actually, your cases are not statistically independent. ANES samples congressional districts,2 and persons in the same congressional district have a non-independent chance of getting in the sample. This is especially weighty because that means that they are voting for the same congressperson. âWhat do I do?â you ask, befuddled. He says, well, a robust standard error model could help with the non-independence of observations, but the âcommon congresspersonâ issue suggests that the best way is to add a random intercept at the district level.
So you take a minicourse on mixed models, and finally, you are able to fit what is in model 3: a hierarchical generalized linear model (HGLM). Your statistician friend (some friend!) was rightâyour coefficient for education has changed a little bit, and now its standard error is a bit bigger. But your results are all good! Pretty robust! But then you show it to me. It doesnât make sense to me that education would increase Republican vote by making people smarter (strike one) or by helping you understand economic issues (strike two). I tell you that I bet the problem is that educated people tend to be richer, not smarter. Your problem is a misspecification one, not a statistical one.
I have the data and quickly run an OLS and toss income in the mix (model 4). The row marked âSECRETâ is the income measure (I didnât want you to guess where this is goingâbut you probably did anyway). Oh no! Now your education coefficient has been reduced to a thirteenth of its original size! It really looks like itâs income, and not education, that predicts voting. Your paper may need to go right in the trash. âHold on!â you think. âSteady now. None of these numbers are right. I need to run a binary logistic HGLM model instead! That might be the ticket and save my finding!â So you do model 5. And it basically tells you the exact same thing.
At this point, you are seriously thinking about murdering your various statistician friends. But itâs not their fault. They did their jobs. But never send in a statistician to do a sociologistâs job. Theyâre only able to help you get bestimates of the right parameters. But you donât know what they are. The lessonâI know you get it, but it needs to stickâis that it rarely makes any sense to spend a lot of time worrying about the bells and whistles, being like the fool mentioned by Denis Diderot, who was afraid of pissing in the ocean because he didnât want to contribute to drowning someone. Worry about the omitted variables. Thatâs whatâs really drowning you.
So movi...
Table of contents
- Cover
- Title Page
- Copyright Page
- Contents
- Preface
- Chapter 1: Introduction
- Chapter 2: Know Your Data
- Chapter 3: Selectivity
- Chapter 4: Misspecification and Control
- Chapter 5: Where Is the Variance?
- Chapter 6: Opportunity Knocks
- Chapter 7: Time and Space
- Chapter 8: When the World Knows More about the Processes than You Do
- Chapter 9: Too Good to Be True
- Conclusion
- References
- Index
- Footnotes