1 Statistical modelling: An overview
1.1 Introduction
In defining a model, the dictionary talks of making a representation, an imitation, an image, a copy or a paradigm.
We talk of models and use them in many different contexts. Perhaps the term is used too readily or too loosely, devaluing its meaning and power. Nevertheless, in the built environment it is common to physically build a mock-up (models) of how the construction will look. Engineers will build prototypes (models) of a new machine, for example a car. Planners build simulation models to represent traffic flows as a way of assessing new road layouts.
In their objective, models in the social sciences are no different. What the theoretician or researcher is attempting to produce is a representation of how phenomena or concepts (as measured by their variables) relate to each other. In other words, the social scientist is attempting to understand the complex social world and represent the essential inter-relationships in a simplified but meaningful way.
What distinguishes a statistical model is that it is constructed from empirical quantitative data and uses statistical theory to guide its development. The details of this will be given later but at this stage it is important to note that the theory or the research question guides the construction of the model. Statistical modelling is a technique to aid understanding, it is not an end in itself.
1.2 Why model?
Statistical modelling is an important analytical tool as it enables social researchers to consider in a coherent and unified procedure complex inter-relationships between social phenomena and to isolate and make judgements about the separate effects of each. More specifically, in social science statistical modelling is undertaken for one of four main reasons: (1) to improve understanding of causality and the development of theory, (2) to make predictions, (3) to assess the effect of different characteristics, (4) to reduce the dimensionality of data.
To aid the development of theory
Constructing models can help develop theoretical perspectives or test the claims of competing perspectives. Relatively recently attention has been given to the part lifestyle plays in crime victimisation and routine activity theories have been promulgated.
These theoretical perspectives can be informed by developing models examining how aspects of peopleās lives, such as where they work, what recreational activities they take part in, how they travel, what time they travel and who with, relate to their experience of being a victim of crime.
To make predictions
Many models, particularly in the economic sphere, are constructed with the purpose of making forecasts or predictions about the future. The ability to anticipate any changes in unemployment or interest rates or the cost of living offers decision makers the opportunity to take any necessary remedial action. Similarly, statistical models can be used to estimate the relative risks of certain outcomes, for example, the risk that an offender will reoffend within a particular time period. Knowledge of the risk can be an aid to decision making; in this example it may inform decisions about when the offender is to be released from prison or whether or not the offender should be placed on a particular programme.
To assess the effect of different characteristics
Often the aim of a social research project is to evaluate the effect of a particular characteristic on an outcome, for example are women offenders treated differently from male offenders in terms of the sentence they are awarded at court? Are women discriminated against in the workplace and paid less than men? In answer to the first question, women are generally awarded lesser sentences than men but can it be inferred that women are treated differently? Other attributes contribute to the sentence imposedāthe seriousness of the offence, the age of the offender and the previous criminal history of the offender. Compared with men, women offenders generally commit less serious offences and have less extensive criminal careers so it is not surprising that on average they receive lesser sentences. To answer the important question of whether women are treated differently from men, account has to be taken of the offence committed, age and criminal record, in order that we may treat ālike with likeā. A statistical model enables us to assess the effect of gender on sentencing after adjusting for the other important characteristics known to influence sentencing decisions.
To reduce the dimensionality of data and to uncover latent variables
A situation may exist where many variables are highly inter-correlated, for example a childās marks on various school tests, or a personās answers to a large number of similar attitudinal questions. Leaving until later the technical problems encountered by including all these similar variables in a model, one might, in any case, prefer a summary measure of them. An average (which is itself a linear model) could be calculated and used to represent the set of variables or a more sophisticated model, which weighted each of the variables differently, could be constructed.
In other applications the purpose may be to uncover latent variables, which are underlying social constructs (such as social deprivation, quality of life or fear of crime) but for which no direct measurement scale exists. In order to undertake research and perform analysis on these concepts, a measurement scale has to be constructed from manifest variables, that is, from variables that can be measured. Latent variables are discussed further in Chapter 10.
Before concluding this section it should be emphasised that the four purposes outlined above are not themselves mutually exclusive. Models that are firmly grounded in theory are likely to achieve better predictions and will be better placed to isolate the relative effects of different characteristics. That is, the better the model represents true relationships between the underlying phenomena the better able it is to achieve any of the objectives set out above.
1.3 The general linear statistical model
A statistical model takes the form of a mathematical equation in which the concepts of interest (as measured by their variables) are hypothesised to be related to each other in some way. The statistical model of interest in this book is known as the general linear statistical model and is defined in Equation (1.1).
where:
y is the dependent variable, yi is the value of the dependent variable for the ith subject
x1i, x2i ⦠xpi are any number (from 1 to p) of explanatory/independent variables.
The value of xp will vary between the i subjects.
b1, b2 ⦠bp are the coefficients, or parameters, of the corresponding explanatory variables, x1, x2 ⦠xp.
e is the error term or residual, ei is the residual for the ith subject.
Each variable is explained in turn.
Dependent variable, also known as response variable or outcome variable
The dependent variable is the variable of prime interest in our research, that is, the variable which we wish to explain or predict. The dependent variable is regarded as a random variable, which is free to vary in response (hence response variable or outcome variable) to the explanatory variables. Note that in Equation (1.1) there is only one dependent variable although we may have alternative definitions and measures of it. (For example, if the focus of the study was the remuneration people received, the dependent variable could be annual salary or hourly rate of pay. Remuneration could also include pensions and/or interest on savings and investments, etc.) Although there is only one dependent variable per equation, we will also consider in Chapter 11 the situation where there is more than one equation (each with its own y) which are related or need to be analysed simultaneously.
SPSS and Stata consistently use the term dependent variable in all their models and it is the convention in texts and computer program manuals to denote the dependent variable by the letter y.
Explanatory variables, also known as predictor variables, covariates, factors or independent variables
In most real-life applications there is more than one, and potentially many, explanatory variables. As the name implies, these variables are associated with the dependent variable and explain or predict values of the dependent variable.
Although in common usage, the term independent variable is not favoured by statisticians as it does not accurately convey the nature of the relationship between such a variable and the dependent variable. As will be seen later, these variables are far from independent in statistical models but are often highly correlated with the dependent variable and each other. I prefer the term explanatory variable as it better describes the nature of the relationship between it and the dependent variable. However, I concede that the use of independent variable is pervasive and I will continue to use it interchangeably with explanatory variable.
Explanatory (independent) variables can obviously be continuous or categorical (these terms are defined in section 2.4). Continuous explanatory (independent) variables are often called covariatesāto signify that they vary in some relationship with the dependent variable. Categorical explanatory (independent) variables are also called factors or indicators and the individual categories of the factor are sometimes called levels.
Stata consistently uses the term independent variable throughout. However, SPSS is not so consistent. Rather confusingly it uses independent variable in the regression component, but covariate in the logistic regression component, regardless of whether the variable is continuous or categorical. The terminology changes again for the multinomial where covariates and factors are separately identified.
In most texts and computer program manuals it is the convention to denote explanatory variables by the letter x, and if there is more than one to number them sequentially with a subscript: x1, x2, x3 and so on.
Whether a variable is considered to be a dependent variable or an independent variable depends wholly on the context, that is, the nature of the research being undertaken. In one study a variable may be considered as the dependent variable but in another as an explanatory variable. For example in a study of school pupilsā achievements, some measure of educational attainment might be taken as the dependent variable. However, in a study of adultsā occupations or salary, school attainment might well be considered as an explanatory variable. Furthermore, in any one study, a variable may be considered as both a dependent variable and an explanatory variable at different stages of the analysis, especially when developing causal models (which are the subject of Chapter 11).
Coefficients or parameters
Associated with each explanatory (independent) variable xp is a coefficient or parameter bp. bp indicates the magnitude by which y changes as xp changes after taking into account (or adjusting for) the other explanatory variables included in the model. To be more precise, for a one-unit change in xp, y will change by an amount bp after adjusting for the contribution of the other explanatory variables. Thus bp is the partial effect of xp given the other xs in the model. It is important to understand that the value of bp may well change from model to model depending on which other variables are included. In addition, bp will also depend on the units of measurement chosen. For example, if y represented a personās weight and x his or her height, b would take a different value if weight was measured in pounds or kilograms and height was measured in feet or centimetres. However, whatever units of measurement were chosen, b would represent the same change in the quantum of weight due to a one-unit change in the quantum of height.
b0 is a constant term and represents the value of y when all the explanatory variables equal zero, that is, when no explanatory variables are included in the model. It will be seen that the constant value, b0, is itself often meaningless; its only importance is to help determine the values of bp. (The exception is where the explanatory variables are categorical when b0 represents the effect on y of being in the reference category; ...