Mathematics

Chi Square Test for Independence

The Chi Square Test for Independence is a statistical test used to determine if there is a significant association between two categorical variables. It compares the observed frequencies of the variables with the frequencies that would be expected if there was no relationship between them. The test is commonly used in research and data analysis to assess the independence of variables.

Written by Perlego with AI-assistance

11 Key excerpts on "Chi Square Test for Independence"

  • Book cover image for: Compassionate Statistics
    eBook - ePub

    Compassionate Statistics

    Applied Quantitative Analysis for Social Services (With exercises and instructions in SPSS)

    p value that is equal to or less than .05, you can assume that a relationship between the variables does exist (i.e., the variables are not independent of each other), and that relationship is probably not due simply to sampling error within the data you have collected.
    Reviewing now what Kurz et al. (2005) reported in this example cited above, you should be able to understand three of their findings: (1) The variable English language is related to (i.e., not independent of) the variable migration status since the data show that more immigrants than nonimmigrants do not have English as their first language, (2) the variable age is related to (i.e., not independent of) the variable migration status since the data show that fewer immigrants than nonimmigrants are reported as young in age, and (3) the variable health insurance is related to (i.e., not independent of) the variable migration status since the data show that more immigrants than nonimmigrants do not have any health insurance.
    This chapter will delve a lot deeper into the meaning and the usefulness of the chi-square test of independence.
    T he chi-square statistical test of independence, also referred to simply as χ2 , appears often in articles published in professional journals, especially journals that appeal to practitioners in the areas of human services or social services. The reason for the popularity of using the chi-square test is simple: It requires only nominal-level data, and social agency case records tend to contain a lot of demographic data at the nominal level (e.g., gender, race/ethnicity, religion, place of residence, marital status) as well as other identifying information measured at the nominal level (e.g., type of presenting problem, types of services needed, characteristics of professional staff providing those services). If a researcher is not depending on agency case records but is collecting data by sending out a survey or interviewing respondents, then at least some of the data collected are typically demographic and nominal in nature (e.g., in addition to the variables noted above, a researcher might want to know level of education, type of social agency, kinds of social services provided). These nominal-level variables are also referred to as categorical
  • Book cover image for: Statistical Reasoning in the Behavioral Sciences
    • Bruce M. King, Patrick J. Rosopa, Edward W. Minium(Authors)
    • 2018(Publication Date)
    • Wiley
      (Publisher)
    21 Chi-Square and Inference about Frequencies When you have finished studying this chapter, you should be able to: • Understand that the chi-square test is used to test hypotheses about the number of cases falling into the categories of a frequency distribution; • Understand that  2 provides a measure of the difference between observed frequencies and the frequencies that would be expected if the null hypothesis were true; • Explain why the chi-square test is best viewed as a test about proportions; • Compute  2 for one-variable goodness-of-fit problems; • Compute  2 to test for independence between two variables; and • Compute effect size for the chi-square test. In previous chapters, we have been concerned with numerical scores and testing hypotheses about the mean or the correlation coefficient. In this chapter, you will learn to make inferences about frequencies—the number of cases falling into the categories of a frequency distribution. For example, among four brands of soft drinks, is there a difference in the proportion of consumers who prefer the taste of each? Is there a difference among registered voters in their preference for three candidates running for local office? To answer questions like these, a researcher com- pares the observed (sample) frequencies for the several categories of the distribution with those frequencies expected according to his or her hypothesis. The difference between observed and expected frequencies is expressed in terms of a statistic named chi-square ( 2 ), introduced by Karl Pearson in 1900. 21.1 The Chi-Square Test for Goodness of Fit The chi-square (pronounced “ki”) test was developed for categorical data; that is, for data com- categorical data data comprising quali- tative categories prising qualitative categories, such as eye color, gender, or political affiliation. Although the chi-square test is conducted in terms of frequencies, it is best viewed conceptually as a test about proportions.
  • Book cover image for: Essential Statistics for Economics, Business and Management
    • Teresa Bradley(Author)
    • 2014(Publication Date)
    • Wiley
      (Publisher)
    The calculation of the χ 2 test statistic is based on the difference between the observed and expected frequencies. If the χ 2 test statistic is so large that it exceeds the critical value of χ 2 then the null hypothesis is rejected (there is insufficient evidence to support the null hypothesis). See Figure 11.2. 11.4 χ 2 tests for independence (no association) The first χ 2 test is a test for the independence of two variables which is also described as a test for no association between two variables. For example, in Table 11.2 independence would mean that a candidate’s performance in the theory element of the driving test would give no indication (so is Area = α Critical χ from Tables 2 Large sample χ exceeds critical χ 2 2 Accept H 0 Reject H 0 Figure 11.2 All χ 2 in this chapter for significant levels α are right-tailed tests. [ 457 ] C H I - S Q UA R E D T E S T S independent) of their performance in the practical test. Stated another way, ‘their performance in one test has no association with their performance in the other’. The sample data in Table 11.2 will be used as an example to discuss tests for independence. The null hypothesis. The null hypothesis in tests for independence is always ‘variables (name the variables) are independent’ or there is ‘no association between the variables’ while the alternative hypothesis states that the variables are dependent or there is association. H 0 : The performance in the theory test is independent of performance in the practical test of the driving test (or there is no association between performance in the theory and practical tests for the driving test). H 1 : The performance in the theory test and practical test are dependent (performance in the theory test is related to (associated with) performance in the practical test). To calculate the test statistic it is necessary to begin by setting up the contingency table of observed frequencies.
  • Book cover image for: Mathematical Statistics with Resampling and R
    • Laura M. Chihara, Tim C. Hesterberg(Authors)
    • 2022(Publication Date)
    • Wiley
      (Publisher)
    Note that for every resample, the row and column totals of the contingency table are the same, as are the expected values; only the cells in the table change. By forming many such resamples and computing the corresponding chi-square test statistic, we obtain the permutation distribution of the chi-square test statistic. We follow this algorithm: Permutation Test of Independence of Two Categorical Variables Store the data with one row per observation, and one column per variable Calculate a test statistic for the original data. Normally large values of the test statistic suggest dependence. repeat Randomly permute the rows in one of the columns Create a contingency table for the resampled data Calculate the test statistic for the new contingency table until we have enough samples Calculate the P-value as the fraction of times the random statistics exceed the original statistic. Optionally, plot a histogram of the resampled statistic values. For instance, in the GSS2018 data set, one permutation of the values in the DeathPenalty column while leaving the Degree column fixed results in the contingency table Table 10.3. The corresponding chi-square statistic (from Equation (10.1)) is c = 50.449. Repeating this permutation many times and computing the chi-square statis- tic each time gives the permutation distribution of this statistic, shown in Figure 10.1. The estimated P-value, based on 10 5 − 1 replications, is 0.00001, Table 10.3 Contingency table after permuting DeathPenalty column. Death penalty? Degree Favor Oppose Less than high school 147 94 High school 687 417 Junior college 119 62 Bachelors 280 155 Graduate 152 80 10.3 Chi-Square Test of Independence 371 Figure 10.1 Null distribution for chi-square statistic for death penalty opinions; the overlaid density is a chi-square distribution with 4 degrees of freedom. 0.00 0.05 0.10 0.15 0 10 20 30 40 50 Chi-square statistic Density near zero.
  • Book cover image for: Statistics for The Behavioral Sciences
    5. Degrees of freedom for the test for goodness of fit are df = C – 1 where C is the number of categories in the variable. Degrees of freedom measure the number of categories for which f e values can be freely chosen. As can be seen from the formula, all but the last f e value to be determined are free to vary. 6. The chi-square distribution is positively skewed and begins at the value of zero. Its exact shape is determined by degrees of freedom. 7. The test for independence is used to assess the rela-tionship between two variables. The null hypothesis states that the two variables in question are indepen-dent of each other. That is, the frequency distribution for one variable does not depend on the categories of the second variable. On the other hand, if a rela-tionship does exist, then the form of the distribution for one variable depends on the categories of the other variable. 8. For the test for independence, the expected frequencies for H 0 can be directly calculated from the marginal frequency totals, f e 5 f c f r n where f c is the total column frequency and f r is the total row frequency for the cell in question. 9. Degrees of freedom for the test for independence are computed by df = ( R – 1)( C – 1) where R is the number of row categories and C is the number of column categories. 10. For the test of independence, a large chi-square value means there is a large discrepancy between the f o and f e values. Rejecting H 0 in this test provides support for a relationship between the two variables. 11. Both chi-square tests (for goodness of fit and independence) are based on the assumption that each observation is independent of the others. That is, each observed frequency reflects a different individual, and no individual can produce a response that would be classified in more than one category or more than one frequency in a single category. 12. The chi-square statistic is distorted when f e values are small.
  • Book cover image for: Understanding Business Statistics
    • Ned Freed, Stacey Jones, Timothy Bergquist(Authors)
    • 2013(Publication Date)
    • Wiley
      (Publisher)
    14.4 Chi-Square Tests of Independence 537 14.4 Chi-Square Tests of Independence In an extension of the ideas introduced in sections 14.2 and 14.3, the chi-square distribution can be used to determine whether certain factors represented in sample data are statistically independent. We’ll use the following situation to demonstrate the idea: Situation: The table below shows the results of a national survey of 1000 adults chosen randomly from the population of all adults in the country. Each individual in the sample was asked: “How optimistic are you about the future of the American economy?” Three possible answers were provided: Optimistic, Not Sure, and Pessimistic. Respondents were classified by age, as either young adults (18 to 30) or older adults (over 30). OBSERVED SAMPLE FREQUENCIES Age Group Optimistic Unsure Pessimistic Totals Young Adults 240 110 70 420 Older Adults 390 80 110 580 Totals 630 190 180 1000 Your job is to determine whether the different age groups represented in the survey have different attitudes about the future, or whether their attitudes are the same. Put another way, we want to know whether, for the population represented, attitude is dependent on age group or whether the two factors—attitude and age group—are independent. Tables like the one shown here are often referred to as contingency tables. In fact, the procedure we’re about to show is commonly called contingency table analysis. These sorts of tables can also be labeled cross-tabulation tables or pivot tables. The Hypotheses To answer the question that’s been posed, we’ll set up a hypothesis test to test proportion dif- ferences and use the chi-square distribution to conduct the test. The hypotheses for the test are: H 0 : Attitude is independent of age group. Translation: If we were to put this same question to all adults in the country, there would be no difference between age groups in their response to the question.
  • Book cover image for: Introductory Statistics
    • Prem S. Mann(Author)
    • 2020(Publication Date)
    • Wiley
      (Publisher)
    The value of the test statistic χ 2 in a test of independence is obtained using the same for- mula as in the goodness-of-fit test described in Section 11.2. Test Statistic for a Test of Independence The value of the test statistic χ 2 for a test of independence is calculated as χ 2 = ∑ (O − E ) 2 ________ E where O and E are the observed and expected frequencies, respectively, for a cell. The null hypothesis in a test of independence always specifies that the two attributes are not related. The alternative hypothesis is that the two attributes are related. The frequencies obtained from the performance of an experiment for a contingency table are called the observed frequencies. The procedure to calculate the expected frequencies for a contingency table for a test of independence is different from that for a goodness-of-fit test. Example 11.5 describes this procedure. Calculating Expected Frequencies for a Test of Independence Many adults think that lack of discipline has become a major problem in schools in the United States. A random sample of 300 adults was selected, and these adults were asked if they favor giving more freedom to school teachers to punish students for lack of discipline. The two-way classification of the responses of these adults is presented in the following table. In Favor ( F ) Against (A) No Opinion (N ) Men ( M ) 93 70 12 Women ( W ) 87 32 6 EXAMPLE 11.5 Lack of Discipline in Schools 11.3 A Test of Independence or Homogeneity 499 The numbers 93, 70, 12, 87, 32, and 6 listed inside Tables 11.5 and 11.6 are the observed frequencies of the respective cells. These frequencies are obtained from the sample. As mentioned earlier, the null hypothesis in a test of independence specifies that the two attributes (or classifications) are independent. In a test of independence, first we assume that the null hypothesis is true and that the two attributes are independent.
  • Book cover image for: Mann's Introductory Statistics
    • Prem S. Mann(Author)
    • 2017(Publication Date)
    • Wiley
      (Publisher)
    Performing a Chi-Square Goodness of Fit Test for Example 11–3 of the Text 1. Enter the observed counts from Example 11–3 into C1. 2. Select Stat > Tables > Chi-Square Goodness-of-Fit Test (One Variable). 3. Use the following settings in the dialog box that appears on screen (see Screen 11.6): • Select Observed counts and type C1 in the box. • Select Equal proportions at the Test submenu. Note: If the alternative hypothesis does not specify equal proportions, go to the worksheet and type the proportions in C2. Return to the dialog box, select Specific proportions at the Test submenu, and type C2 in the box. 4. Click OK. 5. The output, including the test statistic and p-value, will be displayed in the Session window. (See Screen 11.7.) Note: By default, Minitab will also generate two different bar graphs: one of the observed and expected counts and another of the (O-E) 2 /E values, which are called the contributions to the Chi-Square statistic. These graphs are not shown here. Now compare the χ 2 -value with the critical value of χ 2 or the p-value from Screen 11.7 with α and make a decision. Performing a Chi-Square Independence/Homogeneity Test for Example 11–6 of the Text 1. Enter the contingency table from Example 11–6 into the first two rows of C1 through C3. (See Screen 11.8.) 2. Select Stat > Tables > Chi-Square Test for Association. 3. Use the following settings in the dialog box that appears on screen (see Screen 11.8): • Select Summarized data in a two-way table from the drop-down menu. Note: For raw data in C1 and C2, select Raw data (categorical variables) from the drop- down menu, type C2 in the Rows box, and C1 in the Columns box. Then go to step 4. • Type C1-C3 in the Columns containing the table box. Screen 11.6 Screen 11.7 Screen 11.8 Technology Instructions 473 Technology Instructions 473 4. Click OK. 5. The output, including the test statistic and p-value, will be displayed in the Session window.
  • Book cover image for: Fundamental Statistics for the Behavioral Sciences
    For this reason the expected values that result are those that would be expected if H 0 were true and the variables were independent. A large discrepancy in the fit between expected and observed would reflect a large departure from independence, which is what we want to test. Copyright 2017 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 504 Chapter 19 Chi-Square x 2 5 S 1 O 2 E 2 2 E 5 1 13 2 14.226 2 2 14.226 1 1 36 2 34.774 2 2 34.774 1 1 14 2 12.774 2 2 12.774 1 1 30 2 31.226 2 2 31.226 5 .315 Degrees of Freedom Before we can compare our value of x 2 to the value in Table E.1, we must know the degrees of freedom. For the analysis of contingency tables, the degrees of freedom are given by df 5 ( R 2 1)( C 2 1) where R 5 the number of rows in the table and C 5 the number of columns in the table For our example we have R 5 2 and C 5 2; therefore, we have (2 2 1)(2 2 1) 5 1 df . It may seem strange to have only 1 df when we have four cells, but once you know the row and column totals, you need to know only one cell frequency to be able to determine the rest. 5 Evaluation of x 2 With 1 df the critical value of x 2 , as found in Table E.1, is 3.84. Because our value of 0.315 falls below the critical value, we will not reject the null hypothesis that the variables are independent of each other. ( p 5 .5746.) In this case we will conclude that we have no evidence to suggest that whether a girl does or does not relapse is de-pendent on whether she was provided with Prozac or a placebo.
  • Book cover image for: Introductory Statistics
    • Prem S. Mann(Author)
    • 2016(Publication Date)
    • Wiley
      (Publisher)
    Based on the results of the survey, a two- way classification table was prepared and presented in Example 11–5. Does the sample provide sufficient evidence to conclude that the two attributes, gender and opinions of adults, are depend- ent? Use a 1% significance level. Solution The test involves the following five steps. Step 1. State the null and alternative hypotheses. As mentioned earlier, the null hypothesis must be that the two attributes are independent. Consequently, the alternative hypothesis is that these attributes are dependent. H 0 : Gender and opinions of adults are independent. H 1 : Gender and opinions of adults are dependent. Step 2. Select the distribution to use. We use the chi-square distribution to make a test of independence for a contingency table. Step 3. Determine the rejection and nonrejection regions. The significance level is 1%. Because a test of independence is always right-tailed, the area of the rejection region is .01 and falls in the right tail of the chi-square distribution curve. The contingency table contains two rows (Men and Women) and three columns (In Favor, Against, and No Opinion). Note that we do not count the row and column of totals. The degrees of free- dom are df = ( R − 1)( C − 1) = (2 − 1)(3 − 1) = 2 From Table VI of Appendix B, for df = 2 and α = .01, the critical value of χ 2 is 9.210. This value is shown in Figure 11.6. Making a test of independence: 2 × 3 table. Step 4. Calculate the value of the test statistic. Table 11.7, with the observed and expected frequencies constructed in Example 11–5, is reproduced as Table 11.8. Figure 11.6 Rejection and nonrejection regions.  2 Do not reject H 0 Reject H 0 = .01 9.210 Critical value of  2 α
  • Book cover image for: Mind on Statistics (with JMP Printed Access Card)
    a. Are the conditions necessary for carrying out a chi-square test met? Explain. b. Test whether there is a statistically significant relation-ship between these two variables. Show all five steps for the hypothesis test. Be sure to state the level of signifi-cance that you are using. 15.54 Refer to Exercise 15.51, in which each student guessed the results of ten coin flips. If all students are just guessing, and if the coins are fair, then the number of correct guesses for each student should follow a binomial distribution. a. What are the parameters n and p for the binomial distri-bution, assuming that the coins are fair and students were just guessing? b. Specify the probabilities of getting 0 correct, 1 correct, ... , 10 correct for this experiment if students were just guess-ing. ( Hint: These are the probabilities in the pdf for a bino-mial distribution with parameters specified in part (a).) c. The following table shows how many students got two or less right, three right, four right, and so on, separately, for students classified as Sheep (believe in ESP) and classi-fied as Goats (don’t believe in ESP). Using your results from part (b), fill in the null probabilities that correspond to the hypothesis that students are just guessing. Number Correct Sheep Goats Null Probabilities 2 6 5 3 11 12 4 16 15 5 29 19 6 28 16 7 14 10 8 8 3 Total 112 80 a. Identify the two cells with the highest “contributions to chi-square.” Specify the numerical value of the “contribution” and the row and column categories for each of the two cells. b. For each of the two cells identified in part (a), determine whether the expected count is higher or lower than the observed count. c. Using the information in parts (a) and (b), explain how the women in those category combinations contribute to the overall conclusion for this study. 15.51 Example 15.12 (p. 612) described an experiment in which students were classified as “Sheep” who believe in ESP or as “Goats” who do not.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.