One of the data analysis tasks that you will be using quite often is to compute frequencies on categorical data and to analyze two-way frequency tables (tables with frequencies of one variable making up rows of the table and frequencies of another variable making up columns of the table). You might want to see survey results showing the number and percentage of subjects in one or more categories. Another common statistical test performed on categorical data is to see whether two categorical variables are related. For example, are people with high cholesterol more likely to suffer a heart attack compared to people with normal cholesterol levels? On average, do women earn less than men based on a sample of weekly salaries? This chapter covers two of the SAS Studio statistical tasks that deal with frequency data: the first is called One-Way Frequencies, the other, Table Analysis.
One of the data sets created when you ran the Create_Datasets.sas program is called Salary, and it is placed in the STATS library. This data set contains simulated data on weekly salaries broken down by gender, age group, and educational level. Besides the actual weekly salary (variable Weekly_Salary), there is another numeric variable (Salary) that has values of 0 (salary below the median) and 1 (salary above the median). Although this data set was simulated, the values were based on a data set from the U.S Department of Labor. The Department of Labor data contains salaries for multiple categories: age group, level of education, gender, and several other variables. For simplicity, the Salary data set contains only two levels for most of these variables (typically ones with extreme differences, such as ages 20–24 compared to ages 45–54, where large differences in salaries exist).
Before we get into the actual analysis of this data set, it is important to understand how these statistics are collected. One method, called uncontrolled, simply looks at median salaries for each category of predictor, such as gender or age group. The other, called controlled, looks at salaries for identical jobs and other factors such as years of employment. All of the data values for this simulated data set are based on the uncontrolled labor statistics values.
Variables in the Salary data set are displayed in the table that follows.
| | | |
| | Weekly salary | Actual amount in dollars |
| | Above or below the median salary | 0 = Below, 1 = Above |
| | Gender | M = Male, F = Female |
| | Age group | 20–24, 45–54 |
| | Educational level | Less than HS, College Degree or higher |
| | |
To compute one-way frequencies (frequencies for a single variable), go to the Statistics tab under Tasks to see a list of statistical tasks (Figure 14.1).
Figure 14.1: Demonstrating the One-Way Frequency Task
Double-click One-Way Frequencies in the Statistics task list to bring up the following screen (Figure 14.2).
Figure 14.2: DATA Tab Selections
On the DATA tab, choose the Salary data set stored in the permanent STATS library. Next, select the variables Salary, Gender, Age_Group, and Education in the Analysis variables box. Notice that the variable Weekly_Salary (which is a numeric variable) is included in your list of choices. If you include numeric variables with many different values (such as Weekly_Salary), the One_Way Frequency task will list frequencies for every unique value of the selected variable. If you included the variable Weekly_Salary as an analysis variable, it would tell you how many people earned $800, how many people earned $801 dollars, and so on. The only numeric variable chosen in this example is Salary. This is OK because this numeric variable is coded as 0 or 1, representing below the median and above the median.
Before you run the procedure, click the OPTIONS tab to select additional options.
Figure 14.3: One-Way Frequency Options
In this example, you have chosen to suppress plots and to deselect the default option to include cumulative frequencies (which you rarely need). You are now ready to run the procedure. The output is shown in Figure 14.4.
Figure 14.4: Frequency Tables
You see the frequency and percent for each unique value of these variables. Although this is useful information, it could be improved by replacing the values of Gender (F and M) with the labels Male and Female and replacing the values of Salary (0 and 1) with the labels Below the Median and Above the Median...