An Introduction to Categorical Data Analysis

Alan Agresti

A valuable new edition of a standard reference

The use of statistical methods for categorical data has increased dramatically, particularly for applications in the biomedical and social sciences. An Introduction to Categorical Data Analysis, Third Edition summarizes these methods and shows readers how to use them using software. Readers will find a unified generalized linear models approach that connects logistic regression and loglinear models for discrete data with normal regression for continuous data.

Adding to the value in the new edition is:

• Illustrations of the use of R software to perform all the analyses in the book

• A new chapter on alternative methods for categorical data, including smoothing and regularization methods (such as the lasso), classification methods such as linear discriminant analysis and classification trees, and cluster analysis

• New sections in many chapters introducing the Bayesian approach for the methods of that chapter

• More than 70 analyses of data sets to illustrate application of the methods, and about 200 exercises, many containing other data sets

• An appendix showing how to use SAS, Stata, and SPSS, and an appendix with short solutions to most odd-numbered exercises

Written in an applied, nontechnical style, this book illustrates the methods using a wide variety of real data, including medical clinical trials, environmental questions, drug use by teenagers, horseshoe crab mating, basketball shooting, correlates of happiness, and much more.

An Introduction to Categorical Data Analysis, Third Edition is an invaluable tool for statisticians and biostatisticians as well as methodologists in the social and behavioral sciences, medicine and public health, marketing, education, and the biological and agricultural sciences.

From helping to assess the value of new medical treatments to evaluating the factors that affect our opinions on controversial issues, scientists today are finding myriad uses for categorical data analyses. It is primarily for these scientists and their collaborating statisticians – as well as those training to perform these roles – that this book was written.
This first chapter reviews the most important probability distributions for categorical data: the binomial and multinomial distributions. It also introduces maximum likelihood, the most popular method for using data to estimate parameters. We use this type of estimate and a related likelihood function to conduct statistical inference. We also introduce the Bayesian approach to statistical inference, which utilizes probability distributions for the parameters as well as for the data. We begin by describing the major types of categorical data.


A categorical variable has a measurement scale consisting of a set of categories. For example, political ideology might be measured as liberal, moderate, or conservative; choice of accommodation might use categories house, condominium, and apartment; a diagnostic test to detect e-mail spam might classify an incoming e-mail message as spam or legitimate. Categorical variables are often referred to as qualitative, to distinguish them from quantitative variables, which take numerical values, such as age, income, and number of children in a family.
Categorical variables are pervasive in the social sciences for measuring attitudes and opinions, with categories such as (agree, disagree), (yes, no), and (favor, oppose, undecided). They also occur frequently in the health sciences, for measuring responses such as whether a medical treatment is successful (yes, no), mammogram-based breast diagnosis (normal, benign, probably benign, suspicious, malignant with cancer), and stage of a disease (initial, intermediate, advanced). Categorical variables are common for service-quality ratings of any company or organization that has customers (e.g., with categories excellent, good, fair, poor). In fact, categorical variables occur frequently in most disciplines. Other examples include the behavioral sciences (e.g., diagnosis of type of mental illness, with categories schizophrenia, depression, neurosis), ecology (e.g., primary land use in satellite image, with categories woodland, swamp, grassland, agriculture, urban), education (e.g., student responses to an exam question, with categories correct, incorrect), and marketing (e.g., consumer cell-phone preference, with categories Samsung, Apple, Nokia, LG, Other). They even occur in highly quantitative fields such as the engineering sciences and industrial quality control, when items are classified according to whether or not they conform to certain standards.

1.1.1 Response Variable and Explanatory Variables

Most statistical analyses distinguish between a response variable and explanatory variables. For instance, ordinary regression models describe how the mean of a quantitative response variable, such as annual income, changes according to levels of explanatory variables, such as number of years of education and numb...

