Psychology
Statistical Significance
Statistical significance in psychology refers to the likelihood that a research finding is not due to chance. It is determined through statistical tests and indicates the reliability of the results. If a finding is statistically significant, it suggests that there is a true relationship or difference in the population being studied, rather than one that occurred by random chance.
Written by Perlego with AI-assistance
Related key terms
1 of 5
8 Key excerpts on "Statistical Significance"
- eBook - ePub
- John Maltby, Liz Day, Glenn Williams(Authors)
- 2014(Publication Date)
- Routledge(Publisher)
We have divided these considerations into two main areas, but they are, like many other statistical procedures, related. These two areas are: (1) statistical and clinical significance and (2) hypothesis testing and confidence intervals. Therefore at the end of this chapter you should be able to outline ideas that underlie statistical and clinical significance and how these relate to effect size and percentage improvement in a research participant’s condition. You will also be able to outline the ideas that form hypothesis testing and confidence intervals, and how these two concepts are used in the literature to provide context to statistical findings.Statistical versus clinical significance
Within the statistical literature there is a distinction between Statistical Significance and clinical significance. Throughout this book we have concentrated on reporting Statistical Significance because these are changes that are primarily related to the use of statistical tests. However, when we report the findings from statistical tests, a number of questions can arise about the practical importance of these findings. These questions are best summarised by the one question: are findings clinically (or practically) significant?Let us frame this distinction with the following examples. Researchers might have found that a drug treatment has had a statistically significant effect on a particular illness. To do this, doctors and researchers would have administered the drug to different groups and looked at changes in the symptoms of the illness of individuals in all groups. These groups usually include:- Experimental groups – groups that receive an intervention (for example, a drug, a counselling session).
- Control groups
- No longer available |Learn more
The Psychology Research Handbook
A Guide for Graduate Students and Research Assistants
- Frederick T. L. Leong, James T. Austin(Authors)
- 2005(Publication Date)
- SAGE Publications, Inc(Publisher)
11 S TATISTICAL P OWER B RETT M YORS P sychological researchers have become increasingly aware of the importance of designing more powerful studies. In this context, power is a technical term that indicates the sensitivity of Statistical Significance tests used in the analysis of data. The power of a sta-tistical test is the probability of rejecting the null hypothesis when the alternative hypothesis is true (i.e., when there is a real effect in the popu-lation). To put it simply, power is the probability of getting a significant result when you deserve to get one. Assuming you are seeking significant effects, so you will have something to talk about in your discussion, power is the probability that your experiment will work out the way you hope it will. Before power was routinely calculated, it was not uncommon to find studies carried out in the psychological and behavioral sciences with power of .50 or less (Cohen, 1962; Sedlmeier & Gigerenzer, 1989). When power is as low as .50, the chance of a successful research outcome is no better than the toss of a coin. When power is less than .50, you’d be lucky to get a significant result at all! With a bit of forethought, however, the odds of a significant result can be improved considerably, and methods for doing so fall under the rubric of power analysis. Obviously power is something worth finding out about before you start collecting data. Most researchers are disappointed with nonsignificant results, especially after going to all the time and effort required to run their study; and it can be especially disappointing to realize that the odds were stacked against you right from the start. A power analysis during the planning stages of research is a good insurance policy against this kind of disappointment. Over the years, many commentators have lamented the low levels of power found in some areas of psychological and behavioral research (e.g., Cohen, 1962). - Caitlin Gerrity, Scott Lanning(Authors)
- 2024(Publication Date)
- Bloomsbury Libraries Unlimited(Publisher)
If the statistical test you used for your research returns a p-value of less than .05, we say the results are statistically significant. Your p-value should always be reported, and in cases where it is around .05, your readers can determine how to interpret your results. Practical Significance You created a research project, gathered your statistics, and ran your analysis. It was a lot of hard work. Your results indicate a strong Statistical Significance. The p-value is much less than .05. You are rightly excited. You report your findings to a colleague, and she just shrugs and says she is sorry you didn’t find anything. What just happened? What does she mean? Beyond the p-value, we need to make sure our results have practical significance. In your experiment, group A received special test instructions and group B did not. Group A had a mean score of 88, while group B had a mean score of 86 out of 100. Is this enough to be considered practically significant? If the difference in the means was 20 points, 88 versus 86, then there is no question that your results are both statistically and practically significant. If the mean scores were 88.2 and Statistical Significance, Effect Size, and Power 83 83 88, then clearly you have only a statistically significant result, and not one that is practically significant. Your 2-point difference is statistically significant, but is it practically so? Does it make that much of a real-world difference? What time, effort, and costs went into achieving that increase? Would something else cause the same result? Use your common sense in thinking about the results you achieved and be sure to talk about the practical aspect of those results in your analysis. Sample We talked about populations in Chapter 5. To review, a population is every member of a specified group, and the number of people in that group is denoted by the letter N. The population of your study could be all the freshmen at your high school or your university.- eBook - ePub
- Lisa L. Harlow, Stanley A. Mulaik, James H. Steiger(Authors)
- 2013(Publication Date)
- Psychology Press(Publisher)
Lively debate on a controversial issue is often regarded as a healthy sign in science. Anomalous or conflicting findings generated from alternative theoretical viewpoints often precede major theoretical advances in the more developed sciences (Kuhn, 1970), but this does not seem to be the case in the social and behavioral sciences. As Meehl (1978) pointed out nearly 20 years ago, theories in the behavioral sciences do not emerge healthier and stronger after a period of challenge and debate. Instead, our theories often fade away as we grow tired, confused, and frustrated by the lack of consistent research evidence. The reasons are many, including relatively crude measurement procedures and the lack of strong theories underlying our research endeavors (Platt, 1964; Rossi, 1985, 1990). But not least among our problems is that the accumulation of knowledge in the behavioral sciences often relies upon judgments and assessments of evidence that are rooted in Statistical Significance testing.At the outset I should point out that I do not intend here to enumerate yet again the many problems associated with the significance testing paradigm. Many competent critiques have appeared in recent years (Cohen, 1994; Folger, 1989; Goodman & Royall, 1988; Oakes, 1986; Rossi, 1990; Schmidt, 1996; Schmidt & Hunter, 1995); in fact, such criticisms are almost as old as the paradigm itself (Berkson, 1938, 1942; Bolles, 1962; Cohen, 1962; Grant, 1962; Jones, 1955; Kish, 1959; McNemar, 1960; Rozeboom, 1960). However, one consequence of significance testing is of special concern here. This is the practice of dichotomous interpretation of p values as the basis for deciding on the existence of an effect. That is, if p < .05, the effect exists. If p > .05, the effect does not exist. Unfortunately, this is a common decision-making pattern in the social and behavioral sciences (Beauchamp & May, 1964; Cooper & Rosenthal, 1980; Cowles & Davis, 1982; Rosenthal & Gaito, 1963, 1964).The consequences of this approach are bad enough for individual research studies: All too frequently, publication decisions are contingent on which side of the .05 line the test statistic lands. But the consequences for the accumulation of evidence across studies is even worse. As Meehl (1978) has indicated, most reviewers simply tend to “count noses” in assessing the evidence for an effect across studies. Traditional vote-counting methods generally underestimate the support for an effect and have been shown to have low statistical power (Cooper & Rosenthal, 1980; Hedges & Olkin, 1980). At the same time, those studies that find a statistically significant effect (and that are therefore more likely to be published) are in fact very likely to overestimate the actual strength of the effect (Lane & Dunlap, 1978; Schmidt, 1996). Combined with the generally poor power characteristics of many primary studies (Cohen, 1962; Rossi, 1990), the prospects for a meaningful cumulative science seem dismal. - eBook - PDF
Validity and Social Experimentation
Donald Campbell′s Legacy
- Leonard Bickman(Author)
- 2000(Publication Date)
- SAGE Publications, Inc(Publisher)
Significance testing, therefore, marks a fork in the road for intervention re-search that has very pronounced implications for the interpretation of the effects of the intervention under investigation. As with any decision-making procedure, of course, there is a possibility of error. An analysis might show statistical sig-nificance when, in fact, there were no meaningful intervenfion effects, or fail to reach significance when there were. The validity of the stafisfical conclusion about the relationship of the independent and dependent variables is what Cook and Campbell (1979) called statistical conclusion validity. They described the situafion as follows: Covariation is a necessary condition for inferring cause, and practicing scientists begin by asking of their data: Are the presumed independent and dependent vari-ables related? Therefore, it is useful to consider the particular reasons why we can draw false conclusions about covariation. We shall call these reasons (which are threats to valid inference-making) threats to statistical conclusion validity, for conclusions about covariation are made on the basis of statistical evidence, (p. 37) Statistical conclusion validity thus was among the four types of validity Don Campbell discussed as relevant to experimental and quasi-experimental design in field settings. Among the four, however, it has received the least attention in his writings. On that basis, one might conclude that Campbell thought it was rel-atively unimportant or unproblematic. Indeed, within the historical context of his seminal volumes on quasi-experimental design (Campbell & Stanley, 1966; Cook & Campbell, 1979), there was little reason to believe that statistical con-clusion validity was as troublesome in intervention research as internal validity or external validity. Within recent decades, however, evidence has mounted that the statistical conclusion validity of much intervention research is very questionable and does not justify complacency. - eBook - PDF
Evaluating Learning Algorithms
A Classification Perspective
- Nathalie Japkowicz, Mohak Shah(Authors)
- 2011(Publication Date)
- Cambridge University Press(Publisher)
6 Statistical Significance Testing The advances in performance measure characterization discussed in Chapters 3 and 4 have armed researchers with more precise estimates of classifier perfor- mance. However, these are not by themselves sufficient to fully evaluate the difference in performances between classifiers on one or more test domains. More precisely, even though the performance of different classifiers may be shown to be different on specified sets of data, it needs to be confirmed whether the observed differences are statistically significant and not merely coincidental. Chapter 5 started to look at this issue, but focused primarily on the objectivity and stability of the results. This can be construed as the first step to assessing the significance of a difference. Only in the case of the comparison of two classifiers on a single domain did the discussion actually move on to signifi- cance issues. Statistical Significance testing, which is the subject of this chapter, enables researchers to move on to more precise assessments of significance of the results obtained (within certain constraints). The importance of statis- tical significance testing hence cannot be overstated. Nonetheless, the use of available statistical tools for such testing in the fields of machine learning and data mining has been limited at best. Researchers have concentrated on using the paired t test, many times inappropriately, to confirm the difference in clas- sifiers’ performance. Moreover, this has sometimes been done at the cost of excluding other, more appropriate, tests. Thus, although we have at our dis- posal a vast choice of tools to perform such testing, it is unarguably important for researchers in the field to be aware of these tests and, even more so, to understand the framework within which they operate. - Bodo Winter(Author)
- 2019(Publication Date)
- Routledge(Publisher)
Inferential Statistics 1: Significance 169 Notice that at no point in this procedure did you directly compute a statistic that relates to the alternative hypothesis. Everything is computed with respect to the null hypothesis. Researchers commonly pretend that the alternative hypothesis is true when p < 0.05. However, this is literally pretense because the significance testing procedure has only measured the incompatibility of the data with the null hypothesis. 9.8. Chapter Conclusions This chapter started with the fundamental notion that in inferential statistics sample esti-mates are used to make inferences about population parameters. This chapter covered the basics of null hypothesis significance testing (NHST). A null hypothesis is posited that is assumed to characterize a phenomenon in the population, such as the means of two groups being equal ( μ 1 = μ 2 ). Then, sample data is collected to see whether the sample is incompatible with this original assumption. Three ingredients influence one’s confidence in rejecting the null hypothesis—the magnitude of an effect, the variability in the data, and the sample size. Standardized effect size measures such as Cohen’s d and Pearson’s r combine two of these (magnitude and variability), but they ignore sample size. Stand-ard errors and confidence intervals combine variability and sample size. The test statis-tics used in significance testing (such as t ) combine all three ingredients, and they are used to compute p -value. Once a p -value reaches a certain community standard (such as p < 0.05), a researcher may act as if the null hypothesis is to be rejected. 9.9. Exercises 9.9.1. Exercise 1: Gauging Intuitions About Cohen’s d In this exercise, you will generate some random data to gauge your intuitions about Cohen’s d .- Rotem Dror, Lotem Peled-Cohen, Segev Shlomov, Roi Reichart(Authors)
- 2022(Publication Date)
- Springer(Publisher)
This shows that Statistical Significance tests and the calculation of the p-value are parallel tools that help quantify the likelihood of the observed results under the null hypothesis. In this chapter we move from describing the general framework of Statistical Significance testing to the specific considerations involved in the selection of a Statistical Significance test for an NLP application. We shall define the difference between parametric and nonparametric tests, and explore another important characteristic of the sample of scores that we work with, one that is highly critical for the design of a valid statistical test. We will present prominent tests useful for NLP setups, and conclude our discussion by providing a simple decision tree that aims to guide the process of selecting a significance test. 3.1 PRELIMINARIES We previously presented an example of using the Statistical Significance testing framework for deciding between an LSTM and a phrase-based MT system, based on a certain dataset and evaluation metric, BLEU in our example. We defined our test statistic ı.X/ as the difference in BLEU score between the two algorithms, and wanted to compute the p-value, i.e., the prob- 10 3. Statistical Significance TESTS ability to observe such a ı.X/ under the null hypothesis. But wait, how can we calculate this probability without knowing the distribution of ı.X/ under the null hypothesis? Could we pos- sibly choose a test statistic about which we have solid prior knowledge? A major consideration in the selection of a Statistical Significance test is the distribution of the test statistic, ı.X/, under the null hypothesis. If the distribution of ı.X/ is known, then the suitable test will come from the family of parametric tests, that uses ı.X/’s distribution under the null hypothesis in order to obtain statistically powerful results, i.e., have small probability of making a type II error.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.







