CHAPTER 1
Descriptive Statistics
Introduction
A simple fact of life is that most phenomena have a random component. Human beings have a natural height that is different from the natural height of a dog or a tree. However, human beings are not all of the same height. The usually small range is governed by random variation, called error. For example, the range of height for adult human beings is roughly from 52 to 75 inches. This does not mean that 100% of all mankind are in this range. The small portions that are outside this range are considered outliers. Summarizing the height of human beings is very common in statistics. However, in science it is helpful to provide the associated level of confidence in a statement. For example, it is important to state that a particular percentage, say 90%, of women living in the United States have a height between 59 and 69 inches.
One might think that it is important or maybe even necessary to provide a range that covers all cases. However, such a range may prove to be too wide to be of actual use. For example, one might be able to say with 100% certainty that the annual income in the United States is between zero and $100,000,000,000. However, although the lower end is a certainty, the upper end is not as definite. Granted that the chance of anyone making $100,000,000,000 in a year is very low, nevertheless there is no compelling reason against it. Therefore, one has to provide the probability of someone making such a huge income. Since the likelihood is very low, it would be more meaningful to state an income range for a meaningful majority, such as the income range of 95% of people.
For example, it is useful to know that 99% of all people in the United States earned less than $434,682 per individual return in 2012 (Internal Revenue Service 20141), which is the same as saying that the top 1% made at least that much per return in the same year. According to the same source, the top 10% made more than $125,195 per return. The particular percentage is not important as the choice of the top 1% versus the top 10% (or some other percentages) depends on the task at hand.
For example, the government might want to help the middle class by granting them a tax break to lower their tax burden to an equivalent burden as the upper or lower classes. One way of determining the middle class income of a population is to find the 50% of people whose incomes are in the middle. Another way of stating this is to identify the cutoff income level for the lower 25% of incomes, and the cutoff income level for the upper 25% of incomes. The two cutoffs mark the income range that contains the 50% of incomes in the middle. Computations necessary to determine these and other useful values are the subject of descriptive statistics.
Descriptive statistics provide quick and representative information about a population or a sample, such as that a typical man is 5 ft 10 in, the average high temperature on Fourth of July in Washington, DC, is 85° Fahrenheit,2 eight out of nine runners in the men’s 100-meter dash at the 2012 Olympics finished in less than 10 seconds,3 etc. These statistics are describing something of interest about the population and condense all the facts into a single parameter. Note the subtle differences in terms such as the “most common,” “typical,” or “average.” Descriptive statistics is the science of summarizing and condensing information in few parameters.
There are many ways of condensing information to create descriptive statistics. Different types of data require different tools. Data can be qualitative or quantitative. These naming conventions actually refer to the way variables are measured and not to an inherent characteristic of a phenomenon. Variables are used for statistical analysis and are measured based on their characteristics. The preferred name for qualitative variables is categorical variables, because the word “qualitative” has a value connotation, which is often reflected in the literature.
In many cases, analyzing qualitative and quantitative variables requires different tools, but in some cases the tools are similar for both, if not identical. However, the interpretations of qualitative and quantitative variables are usually different. Note that a population is not defined as either qualitative or quantitative. Rather, it is the variable of interest in the population that is either qualitative or quantitative. For example, the population may consist of people. If the age of the person is of interest, then the variable is quantitative; but if the gender of the person is of interest, then the variable is qualitative. If the population under study is a firm and the variable is the firm’s status as a polluter (i.e., the firm either pollutes or does not pollute), then it is a qualitative variable. However, if the amount of pollution is of interest, then it is a quantitative variable.
Definition 1.1 Qualitative variables are nonnumeric. They represent a label for a category of similar items. For example, the status of a firm as a polluter is a qualitative variable.
Definition 1.2 Quantitative variables are numerical. The distance each student travels to get to school is a quantitative variable.
Measurement Scales
Variables must be measured in a meaningful way. The following definitions provide brief descriptions of different types of measurement scales. Most of the methods in this text require interval or measurement scales with stronger relational requirements.
Definition 1.3 Nominal or categorical data are the “count” of the number of times an event occurs. Countries might be grouped according to their policy toward trade and might be classified as open or closed economies. Care must be taken to assure that each case belongs to only one group. An ID number is an example of nominal data. Since the relative size does not matter for nominal data, the customary arithmetic computations and statistical methods do not apply to these numbers.
Definition 1.4 When there are only two nominal types, the data is dichotomous. When there is no particular order, a dichotomous variable is called a discrete dichotomous variable. Gender is an example of a discrete dichotomous variable. Alternatively, when one can place an order on the type of data, as in the case of young and old, then the variable is a continuous dichotomous variable.
Definition 1.5 An ordinal scale indicates that data is ordered in some way...