Part II
Statistical Modeling
Chapter 5
Linear Models
5.1 Introduction
5.2 From t-test to Linear Models
5.3 Simple and Multiple Linear Regression Models
5.3.1 The Least Squares
5.3.2 Regression with One Predictor
5.3.3 Multiple Regression
5.3.4 Interaction
5.3.5 Residuals and Model Assessment
5.3.6 Categorical Predictors
5.3.7 Collinearity and the Finnish Lakes Example
5.4 General Considerations in Building a Predictive Model
5.5 Uncertainty in Model Predictions
5.5.1 Example: Uncertainty in Water Quality Measurements
5.6 Two-Way ANOVA
5.6.1 ANOVA as a Linear Model
5.6.2 More Than One Categorical Predictor
5.6.3 Interaction
5.7 Bibliography Notes
5.8 Exercises
5.1 Introduction
In Chapter 4, we defined a model as a probability distribution model. Once a model is proposed, we make inference about the unknown model parameters based on data. In a one sample t-test problem, we are interested in learning about the mean of a normal distribution.
It is often convenient to think of the data yi in terms of the mean and a remainder:
That is, we can split an observed value into two parts, the mean (μ) and the remainder (εi). Mathematically the above two expressions are equivalent. The remainder is the difference between the observed and the mean, often known as residuals, has a normal distribution with mean 0 and standard deviation σ (εi ∼ N(0, σ2)). In a two sample t-test problem, we are interested in the difference between the means of two populations or groups. We present the problem as follows:
| (5.3) |
and we are interested in the difference between the two means δ = μ2 − μ1. We can present the problem in the format of equation (5.2) by combining the data from the two groups together into a data frame with a second column to indicate the group association (or “treatment”). A mathematically convenient construction of the treatment column is to use a column of 0’s (for y1i) and 1’s (for y2j). The data frame consists of two columns, the data column (y) and the treatment (or more generally, group) column (g). Each row represents an observed data point and its group association (0 for group 1 and 1 for group 2). The two-sample t-test problem in equation (5.3) can be expressed in the form of equation (5.4):
where j is the index for the combined data, gj is the group association of the j th observation. For data from group 1 (gj = 0), equation (5.4) reduces to and for data from group 2 (gj = 1), the model is .
The group indicator g is often known as a “dummy variable.” A dummy variable takes value 0 or 1. When we have data from more than two groups, we will use p − 1 dummy variables to represent the p groups. For example, if we have three groups in an ANOVA problem (e.g., Exercise 7 in Chapter 4), we combine observed data from all three groups into one column. The first dummy variable g1 takes value 1 if the observation is from group 2 and 0 otherwise. The second dummy variable g2 takes value 1 if the observation is from group 3 and 0 otherwise. The ANOVA problem can now be expressed as a linear model problem:
| (5.5) |
For data from group 1, the model is reduced to yi = μ1 + εi. For data from group 2, the model is yi = μ1 + δ1 + εi, and for group 3, yi = μ1 + δ2 + εi.
By represent the t-test and ANOVA problems in terms of a “statistical model,” I want to convey two main messages. First, we use different models for different problems. Second, statistical inference is mostly about the relationship among variables. Likewise, a main goal in science is the understanding of the relationship among important variables. The relationship, either described qualitatively or quantitatively, is a model. In a statistical problem, we define a model as the probability distribution of the variable of interest (the response variable). A probability distribution has a mean (or location) parameter and a parameter representing spread (e.g., standard deviation). When a distribution model is specified, we want to understand how the mean of the distribution varies as a function of other variables (predictor variables). In equation (5.2), the mean is a constant (no predictor variable). In equations (5.4) and (5.5), the mean varies by groups (g is the predictor variable). For a response variable with a normal distribution, the standard deviation can be estimated from the residuals. As a result, we can often express a statistical model as yi = f (x, θ) + εi, where x represents predictor variable(s), θ represents unknown parameter(s) to be estimated, and εi is a normal random variable with mean 0 and an unknown standard deviation (σ). In equation (5.5), x represents both g1 and g2, and θ includes μ1, δ1, and δ2. The function f(x, θ) is an example of a mean function of a statistical model – a function defines the relationship between the mean parameter of the response variable distribution and a number of predictors. Using equation (5.5), we can define a statistical modeling problem as follows:
• Model formulation – response variable is a normal random variable with different group means and a constant standard deviation (e.g., equation (5.5)).
• Parameter estimation – how to estimate unknown parameters (e.g., μ1, δ1, δ2, σ in equation (5.5...