1
Introduction
Christof Wolf and Henning Best
In recent years, the social sciences have made tremendous progress in quantitative methodology and data analysis. The classical linear model, while still remaining an important foundation for more advanced methods, has been increasingly complemented by specialized techniques. Major improvements include the widespread use of non-linear models, advances in multilevel modeling and Bayesian estimation, the diffusion of longitudinal analyses and, more recently, the focus on novel methods for causal inference.
The interested reader can chose from a number of excellent textbooks on a wide range of topics: starting from general econometrics books such as Wooldridge (2009, 2010) or Greene (2012), ranging over volumes on regression and Bayesian methods (Gelman et al., 2003; Fox, 2008; Gelman and Hill, 2007), multilevel modeling (Hox, 2010), non-linear models for limited dependent variables (Long, 1997; Train, 2009), event history techniques (Blossfeld et al., 2007), right up to trend-setting textbooks on causal inference (Pearl, 2009; Angrist and Pischke, 2009; Morgan and Winship, 2007) or specialized handbooks like the one edited by Morgan (2013).
Having so many excellent monographs on matters of regression analysis and causal inference makes it difficult for scholars and researchers to obtain an overview of these different approaches. Our aim with this Sage Handbook of Regression and Causal Inference is to give readers an accessible outline of a broad set of regression techniques and methods for causal inference written by international experts in the field. Many students and researchers in the social sciences will find this handbook useful as it provides an overview of a range of different methods: ordinary least squares and logistic regression, multilevel and panel regression, time-series cross-section models as well as methods for causal inference ā for example, instrumental variables regression, regression discontinuities or propensity score matching. Hence, this volume covers the most commonly used techniques for the statistical analysis of cross-sectional and longitudinal data as well as a number of newer and advanced regression models. Each chapter provides an accessible yet at the same time rigorous presentation of a statistical method. With few exceptions, the contributions follow a common structure, making it easy for readers to navigate through the text. Each chapter begins with an easily accessible, non-technical introduction to the respective method, providing a basic understanding of the methodās logic, scope and unique features. The introduction is followed by a presentation of the statistical foundations of the method. To give readers a better understanding of how a particular method can be applied, the next step consists of a comprehensive discussion of the methodās application in an example analysis based on publicly available real-world data. Whenever possible, authors used the European Social Survey (see http://www.europeansocialsurvey.org/). Readers can download Stata or R code from the companion website to this book and reproduce the analyses (see https://study.sagepub.com/bestwolf). The example is followed by discussion of frequently made errors and caveats of the methods and their applications. Each chapter ends with a brief annotated list of references for further reading.
The book is divided into three major blocks: two chapters on estimation techniques, eight chapters on regression models for cross-sectional data, and six chapters focusing on causal inference and the analysis of longitudinal data.
The volume opens with two chapters on different estimation techniques used in regression analysis. In the first of these Martin Elff discusses ordinary least squares and maximum likelihood methods for the estimation of parameters of linear regression and other statistical models. One of the caveats discussed by Elff is that maximum likelihood estimation can become very difficult if sample sizes are small. A technique particularly suited to this situation is Bayesian estimation, which Susumu Shikano presents in the following chapter. After an introduction to the general idea of Bayesian analysis, Shikano shows how the coefficients of a regression model are estimated in the Bayesian framework.
The second block of chapters in this volume deals with regression analysis for cross-sectional data. Linear regression, a powerful tool often termed the workhorse of the social sciences, is introduced by Christof Wolf and Henning Best. Sound applications can only be expected if the assumptions underlying this model are understood. These are elaborately discussed in the next chapter by Bart Meuleman, Geert Loosveldt and Viktor Emonds. They also present the tools used to diagnose deviations from the assumptions. In the following chapter Henning Lohmann shows how we can incorporate non-linear and non-additive effects into linear regression models. In great detail he discusses interaction effects, polynomials and splines and demonstrates how flexible multiple linear regression is. Joop Hox and Leoniek Wijngaards-de Meijās contribution focuses on regression models for hierarchical, multilevel data. These models are suitable if the units of observations are ānestedā within higher-level units (e.g. students in schools, residents in neighborhoods or employees in firms). The authors discuss these models for both metric and binary dependent variables. An in-depth coverage of regression models for binary outcomes can be found in the next chapter by Henning Best and Christof Wolf on logistic regression. This is directly followed by a presentation of regression models for multinomial and ordinal variables authored by Scott Long. In both chapters dealing with non-metric outcome variables the authors emphasize that interpreting the results of these kinds of models is anything but straightforward. An indispensable tool to successfully meet the challenge to correctly interpret regression results are graphical displays. These are presented and discussed in the subsequent chapter by Gerrit Bauer. The block on regression analysis for cross-sectional data closes with a contribution by Steven Heeringa, Brady West, and Patricia Berglund who address regression modeling for complex sample survey data.
The third block of chapters is devoted to methods for longitudinal data analysis and causal inference that are based on a counterfactual model of causality. Markus Gangl opens this part with a contribution on matching estimators for treatment effects. The chapter discusses analytical goals and mathematical foundations that underlie the use of matching estimators for causal inference. As the name of the method suggests, two types of units ā the ātreatedā and ānontreatedā ā are matched based on some common characteristic. An alternative method for causal inference is introduced in the chapter by Christopher Muller, Christopher Winship, and Stephen Morgan. They provide a non-technical introduction to instrumental variables regression. This kind of regression helps in dealing with endogeneity by using an additional instrument variable that is correlated with the causal factor of interest, but otherwise exogenous. Another important method, regression discontinuity designs, is presented by Thomas Lemieux and David Lee. They present the conceptual framework behind this research design and draw a parallel between regression discontinuity and randomized experiments. The next chapter, by Josef Brüderl and Volker Ludwig, offers a description of fixed-effects panel regression which they compare to random-effects models and models including a lagged dependent variable. In addition to the basic model of fixed-effects panel regression, the authors discuss a more advanced variant of this approach allowing for heterogeneous change, that is, a model with individual slopes. Another form of longitudinal data is event history data that provides information on a sequence of different states occupied by each unit of analysis and the timing of changes among these states. Hans-Peter Blossfeld and Gwendolin Blossfeld present regression models to analyze such data structures. For them event history models are closely linked to an understanding of causation as a generative process. The book closes with a contribution by Jessica Fortin-Rittberger on models for time-series cross-section. These models are particularly useful if we have data on a comparatively small number of units for a comparatively large number of time points. This type of data structure arises often in comparative political science applications.
We hope that readers will find this Sage Handbook useful for their daily practice in social science teaching and research. We are confident that the book will help students and researchers in conducting quantitative social research and contribute to the further diffusion of important methods for causal inference. If the book helps advance the methodologically sound analysis of society, the time invested will have been well spent.
REFERENCES
Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton: Princeton University Press.
Blossfeld, H.-P, Golsch, K., and Rohwer, G. (2007). Event History Analysis with Stata. Mahwah: Erlbaum.
Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. Thousand Oaks: Sage.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian Data Analysis, Second Edition. Chapman and Hall/CRC Texts in Statistical Science. Taylor & Francis.
Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multilevel/hierarchical Models. Cambridge: Cambridge University Press.
Greene, W. H. (2012). Econometric Analysis. New York: Prentice Hall.
Hox, J. J. (2010). Multilevel Analysis. Techniques and Applications. New York: Routledge.
Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks: Sage.
Morgan, S. L., (Ed.) (2013). Handbook of Causal Analysis for Social Research. New York: Springer.
Morgan, S. L. and Winship, C. (2007). Counterfactuals and Causal Inference. New York: Cambridge University Press.
Pearl, J. (2009). Causality: Models, reasoning, and inference. Cambridge: Cambridge University Press.
Train, K. (2009). Discrete Choice Methods with Simulation. Cambridge/New York: Cambridge University Press.
Wooldridge, J. M. (2009). Introductory Econometrics: A modern approach. Mason: Thomson/South-Western.
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. Cambridge: MIT Press.
PART I
Estimation and Inference
2
Estimation techniques: Ordinary least squares and maximum likelihood
Martin Elff
INTRODUCTION
A major task in regression analysis and in much of data analysis in the social sciences in general is the construction of a model that best represents (1) substantial assumptions and hypotheses a researcher may entertain and (2) auxiliary information or assumptions about the way the data under analysis are generated. To complete this task of model specification successfully, a researcher will need a fair knowledge of a variety of statistical models and their assumptions. Introducing these is one of the main purposes of this volume. In contrast to most other chapters, the present one presumes all questions with regard to model specification as already addressed and focuses on the theoretical foundations of a step that comes thereafter, the step of estimating model parameters.
While model specification sometimes appears to be something of an art, estimation clearly is a technique, the application of which researchers often gladly delegate to their computers. But for scholars intent on gaining a full understanding of the research process it is important to know the foundations of estimation. Therefore it is the purpose of this chapter to introduce these foundations, to provide an understanding of what it means to estimate parameters and to give some idea of what a āgoodā estimator is.
The task of model specification usually leads us to a probability model of the process by which the data under analysis are generated. That is, we assume that each piece of data that we have observed, could have observed or may observe in the future has done or will do so with a particular probability. In other words, our data are observations of random variables. Roughly speaking, a random variable is a set of numbers, called the sample space, together with probabilities assigned to them or to subsets of the sample space. The set of rules by which probabilities are assigned to numbers or sets of numbers is the probability distribution of the random variable. For example, if we roll a die, then the number it shows is an observation of a random variable t...