Suggested readings
•Angrist, J. D. and Pischke, J. S. (2008). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press, Princeton, NJ.
•Dunning, T. (2012). Natural Experiments in the Social Sciences: A Design-Based Approach. Cambridge University Press, Cambridge.
•Lewis-Beck, C. and Lewis-Beck, M. (2016). Applied Regression: An Introduction. SAGE, Thousand Oaks, CA.
•Wooldridge, J. M. (2016). Introductory Econometrics: A Modern Approach. Cengage Learning, Boston, MA, 6th edition.
Packages you need to install
•tidyverse (Wickham, 2019), politicalds (Urdinez and Cruz, 2020), skimr (Waring et al., 2020), car (Fox et al., 2020), ggcorrplot (Kassambara, 2019), texreg (Leifeld, 2020), prediction (Leeper, 2019), lmtest (Hothorn et al., 2019), sandwich (Zeileis and Lumley, 2019), miceadds (Robitzsch et al., 2020).
5.0.1Introduction
In this chapter, we will learn how to do linear regressions. Here the function is linear, that is, it is estimated by two parameters: the slope and the intercept. When we face a multivariate analysis, the estimation gets more complex. We will cover how to interpret the different coefficients, how to create regression tables, how to visualize predicted values, and we will go further into evaluating the Ordinary Least Squares (OLS) assumptions, so that you can evaluate how well your models fit.
5.1OLS in R
In this chapter, the dataset we will work is a merge of two datasets constructed by Evelyne Huber and John D. Stephens3. These datasets are:
•Latin America Welfare Dataset, 1960-2014 (Evelyne Huber and John D. Stephens, Latin American Welfare Dataset, 1960-2014, University of North Carolina at Chapel Hill, 2014.): it contains variables on Welfare States in all Latin American and Caribbean countries between 1960 and 2014.
•Latin America and Caribbean Political Data Set, 1945-2012 (Evelyne Huber and John D. Stephens, Latin America and Caribbean Political Dataset, 1945-2012, University of North Carolina at Chapel Hill, 2012): it contains political variables for all Latin American and Caribbean countries between 1945 and 2012.
The resulting dataset contains 1074 observations for 25 countries between 1970 and 2012 (data from the 1960s was excluded since it contained many missing values).
First, we load the tidyverse package.
We will import the dataset from the book’s package:
library(politicalds) data("welfare") Now, the dataset has been loaded into our R session
In the chapter, we will use the paper of Huber et al. (2006) as the example for analysis. In this article, they estimate the determinants of inequality in Latin America and Caribbean. Working from this article allows us to estimate a model with multiple control variables that have already been identified as relevant for explaining the variation of inequality in the region. Thus, the dependent variable we are interested in explaining is income inequality in Latin American and Caribbean countries, operationalized according to the Gini Index (gini). The control variables that we will incorporate into the model are the following:
•Sectorial dualism (it refers to the coexistence of a traditional low-productivity sector and a modern high-productivity sector) - sector_dualism
•GDP - gdp
•Foreign Direct Investment (net income as % of the GDP) - foreign_inv
•Ethnic diversity (dummy variable coded as 1 when at least the 20% but no further than the 80% of the population is ethnically diverse) - ethnic_diversity
•Democracy (type of regime) - regime_type
•Education expenditure (as percentage of the GDP) - education_budget
•Health expenditure (as percentage of the GDP) - health_budget
•Social security expenditure (as percentage of the GDP) - socialsec_budget
•Legislative balance - legislative_bal
During this chapter, we will try to estimate what is the effect of education expenditure in the levels of inequality in Latin American and Caribbean countries. Thus, our independent variable of interest will be education_budget.
5.1.1Descriptive Statistics
Before estimating a linear model with Ordinary Least Squares (OLS) it is recommended you first identify the distribution of the variables you are interested in: the dependent variable y (also called response variable) and the independent variable of interest x (also called explanatory variable or regressor). In general, our models will have, besides the independent variable of interest, other independent (or explanatory) variables that we will call “controls”, since t...