![]()
CHAPTER ONE
OVERVIEW OF MULTIVARIATE AND REGRESSION METHODS
1.1 INTRODUCTION
More information about human functioning has accrued in the past five decades than in the preceding five millennia, and many of those recent gains can be attributed to the application of multivariate and regression statistics. The scientific experimentation that proliferated during the 19th century was a remarkable advance over previous centuries, but the advent of the computer in the mid-20th century opened the way for the widespread use of complex analytic methods that exponentially increased the pace of discovery. Multivariate and regression methods of data analysis have completely transformed the bio-behavioral and social sciences.
Multivariate and regression statistics provide several essential tools for scientific inquiry. They allow for detailed descriptions of data, and they identify patterns impossible to discern otherwise. They allow for empirical testing of complex theoretical propositions. They enable enhanced prediction of events, from disease onset to likelihood of remission. Stated simply, multivariate statistics can be applied to a broad variety of research questions about the human condition.
Given the widespread application and utility of multivariate and regression methods, this book covers many of the statistical methods commonly used in a broad range of bio-behavioral and social sciences, such as psychology, business, biology, medicine, education, and sociology. In these disciplines, mathematics is not typically a studentās primary focus. Thus, the approach of the book is conceptual. This does not mean that the mathematical account of the methods is compromised, just that the mathematical developments are employed in the service of the conceptual basis for each method. The math is presented in an accessible form, called simplest case. The idea is that we seek a demonstration for each method that uses the simplest case we can find that has all the key attributes of the full-blown cases of actual practice. We provide exercises that will enable students to learn the simplified case thoroughly, after which the focus is expanded to more realistic cases.
We have learned that it is possible to make these complex mathematical concepts accessible and enjoyable, even to those who may see themselves as nonmathematical. It is possible with this simplest-case approach to teach the underlying conceptual basis so thoroughly that some students can perform many multivariate and regression analyses on simple āstudent-accommodatingā data sets from memory, without referring to written formulas. This kind of deep conceptual acquaintance brings the method up close for the student, so that the meaning of the analytical results becomes clearer.
This first chapter defines multivariate data analysis methods and introduces the fundamental concepts. It also outlines and explains the structure of the remaining chapters in the book. All analysis method chapters follow a common format. The main body of each chapter starts with an example of the method, usually from an article in a prominent journal. It then explains the rationale for each method and gives complete but simplified numerical demonstrations of the various expressions of each method using simplest-case data. At the end of each chapter is the section entitled Study Questions, which consists of three types: essay questions, calculation questions, and data-analysis questions. There is a complete set of answers to all of these questions available electronically on the website at https://mvgraphics.byu.edu.
1.2 MULTIVARIATE METHODS AS AN EXTENSION OF FAMILIAR UNIVARIATE METHODS
The term multivariate denotes the analysis of multiple dependent variables. If the data set has only one dependent variable, it is called univariate. In elementary statistics, you were probably introduced to the two-way analysis of variance (ANOVA) and learned that any ANOVA that is two-way or higher is referred to as a factorial model. Factorial in this instance means having multiple independent variables or factors. The advantage of a factorial ANOVA is that it enables one to examine the interaction between the independent variables in the effects they exert upon the dependent variable.
Multivariate models have a similar advantage, but applied to the multiple dependent variables rather than independent variables. Multivariate methods enable one to deal with the covariance among the dependent variables in a way that is analogous to the way factorial ANOVA enables one to deal with interaction.
Fortunately, many of the multivariate methods are straightforward extensions of the corresponding univariate methods (Table 1.1). This means that your considerable investment up to this point in understanding univariate statistics will go a long way toward helping you to understand multivariate statistics. (This is particularly true of Chapters 7, 8, and 9, where the t-tests are extended to multivariate t-tests, and various ANOVA models are extended to corresponding multiple ANOVA [MANOVA] models.) Indeed, one can think of multivariate statistics in a simplified way as just the same univariate methods that you already know (t-test, ANOVA, correlation/regression, etc.) rewritten in matrix algebra with the matrices extended to include multiple dependent variables.
Table 1.1 Overview of Univariate and Multivariate Statistical Methods
|
|
| No predictor variable | ā | Factor analysis |
| Principal component analysis |
| Cluster analysis |
| One categorical predictor variable, two levels | t tests | Hotellingās T2 tests |
| z tests | Profile analysis using Hotellingās T2 |
| One categorical predictor, variable, three or more levels | ANOVA, one-way models | MANOVA, one-way models |
| Two or more categorical predictor variables | ANOVA, factorial models | MANOVA, factorial models |
| Categorical predictor(s) with one or more quantitative control variables | ANCOVA, one-way or factorial models | MANCOVA, one-way or factorial models |
| One quantitative predictor variable | Bivariate regression | Multivariate regression |
| Two or more quantitative predictor variables | Multiple regression | Multivariate multiple regression Canonical correlation* |
Matrix algebra is a tool for more efficiently working with data matrices. Many of the formulas you learned in elementary statistics (variance, covariance, correlation coefficients, ANOVA, etc.) can be expressed much more compactly and more efficiently with matrix algebra. Matrix multiplication in particular is closely connected to the calculation of variances and covariances in that it directly produces sums of squares and sums of products of input vectors. It is as if matrix algebra were invented specifically for the calculation of covariance structures. Chapter 3 provides an introduction to the fundamentals of matrix algebra. Readers unfamiliar with matrix algebra should therefore carefully read Chapter 3 prior to the other chapters that follow, since all are based upon it.
The second prerequisite for understanding this book is a knowledge of elementary statistical methods: the normal distribution, the binomial distribution, confidence intervals, t-tests, ANOVA, correlation coefficients, and regression. It is assumed that you begin this course with a fairly good grasp of basic statistics. Chapter 2 provides a review of the fundamental principles of elementary statistics, expressed in matrix notation where applicable.
1.3 MEASUREMENT SCALES AND DATA TYPES
Choosing an appropriate statistical method requires an accurate categorization of the data to be analyzed. The four kinds of measurement scales identified by S. Smith Stevens (1946) are nominal, ordinal, interval, and ratio. However, there are almost no examples of interval data that are not also ratio, so we often refer to the two collectively as an interval/ratio scale. So, effectively, we have only three kinds of data: those that are categorical (nominal), those that are ordinal (ordered categorical), and those that are fully quantitative (interval/ratio). As we investigate the methods of this book, we will discover that ordinal is not a particularly meaningful category of data for multivariate methods. Therefore, from the standpoint of data, the major distinction will be between those methods that apply to fully quantitative data (interval/ratio), those that apply to categorical data, and those that apply to data sets that have both quantitative and categorical data in them.
Factor analysis (Chapter 4) is an example of a method that has only quantitative variables, as is multiple regression. Log-linear models (Chapter 9) are an example of a method that deals with data that are completely categorical. MANOVA (Chapter 8) is an example of an analysis that requires both quantitative and categorical data; it has categorical independent variables and quantitative dependent variables.
Another important issue with respect to data types is the distinction between discrete and continuous data. Discrete data are whole numbers, such as the number of persons voting for a proposition, or the number voting against it. Continuous data are decimal numbers that have an infinite number of possible points between any two points. In measuring cut lengths of wire, it is possible in principal to identify an infinitude of lengths that lie between any two points, for example, between 23 and 24 inches. The number possible, in practical terms, depends on the accuracy of oneās measuring instrument. Measured length is therefore continuous. By extension, variables measured in biomedical and social sciences that have multiple possible values along a continuum, such as oxytocin levels or scores on a measure of personality traits, are treated as continuous data.
All categorical data are by definition discrete. It is not possible for data to be both categorical and also continuous. Quantitative data, on the other hand, can be either continuous or discrete. Most measured quantities, such as height, width, length, and weight, are both continuous and also fully quantitative (interval/ratio). There are also, however, many other examples of data that are fully quantitative and yet discrete. For example, the count of the number of persons in a room is discrete, because it can only be a whole number, but it is also fully quantitative, with interval/ratio properties. If there are 12 persons in one room and twenty-four in another, it makes sense to say that there are twice as many persons in the second room. Counts of number of persons therefore have interval/ratio properties.1
When all the variables are measured on the same scale, we refer to them as commensurate. When the variables are measured with different scales, they are noncommensurate. An example of commensurate data would be width, length, and height of a box, each one measured in inches. An example of noncommensurate would be if the width of the box and its length were measured in inches, but the height was measured in centimeters. (Of course, one could make them commensurate by transforming all to inches or all to centimeters.) Another example of noncommensurate variables would be IQ scores and blood lead levels. Variables that are not commensurate can always be made so by standardizing them (transforming them into Z-scores or percentiles). A few multivariate methods, such as profile analysis (associated with Chapter 7 in connection with Hotellingās T2), or principal component analysis of a covariance ma...