Part I
Introduction Module 1
Introduction and Overview
Thousands of important, and oftentimes life-altering, decisions are made every day. Who should we hire? Which students should be placed in accelerated or remedial programs? Which defendants should be incarcerated and which paroled? Which treatment regimen will work best for a given client? Should custody of this child be granted to the mother or the father or the grandparents? In each of these situations, a âtestâ may be used to help provide guidance. There are many vocal opponents to the use of standardized tests to make such decisions. However, the bottom line is that these critical decisions will ultimately be made with or without the use of test information. The question we have to ask ourselves is, âCan a better decision be made with the use of relevant test information?â In many, although not all, instances, the answer will be yes, if a well-developed and appropriate test is used in combination with other relevant, well-justified information available to the decision maker. The opposition that many individuals have to standardized tests is that they are the sole basis for making an important, sometimes life-altering, decision. Thus, it would behoove any decision maker to take full advantage of other relevant, well-justified information, where available, to make the best and most informed decision possible.
A quick point regarding âother relevant and well-justified informationâ is in order. What one decision maker sees as ârelevantâ may not seem relevant and well justified to another constituent in the testing process. For example, as one of the reviewers of an earlier edition of this book pointed out, a manager in an organization may be willing to use tests that demonstrate validity and reliability for selecting workers in his organization. However, he may ultimately decide to rely more heavily on what he deems to be âother relevant information,â but in fact is simply his belief in his own biased intuition about people or non-job relevant information obtained from social media profiles. To this supervisor his intuitions, or non-systematic information gathered from social media profiles, are viewed as legitimate âother relevant informationâ beyond test scores. However, others in the testing process may not view the supervisorâs intuitions, nor non-systematic information obtained from social media profiles, as relevant. Thus, when we say that other relevant information beyond well developed and validated tests should be used when appropriate, we are not talking about information such as intuition (which should be distinguished from professional judgment, which more often than not, is in fact relevant) nor non-systematic information obtained from, say, casually perusing a job applicantâs social media profiles. Rather, we are referring to additional relevant information such as professional references, systematic background checks, structured observations, professional judgments, and the like. That is, additional information that can be well justified, as well as systematically developed, collected, and evaluated. Thus, we are not recommending collecting and using additional information beyond tests simply for the sake of doing so. Rather, any âother relevant informationâ that is used in addition to test information to make critical decisions should be well justified and supported by professional standards, as well as appropriate for the context it is being proposed for.
What Makes Tests Useful
Tests can take many forms from traditional paper-and-pencil exams to portfolio assessments, job interviews, case histories, behavioral observations, computer adaptive assessments, and peer ratingsâto name just a few. The common theme in all of these assessment procedures is that they represent a sample of behaviors from the test taker. Thus, psychological testing is similar to any science in that a sample is taken to make inferences about a population. In this case, the sample consists of behaviors (e.g., test responses on a paper-and-pencil test or performance of physical tasks on a physical ability test) from a larger domain of all possible behaviors representing a construct. For example, the first test we take when we come into the world is called the APGAR test. Thatâs right, just one minute into the world we get our first test. You probably do not remember your score on your APGAR test, but our guess is your mother does, given the importance this first test has in revealing your initial physical functioning. The purpose of the APGAR test is to assess a newbornâs general functioning right after birth. Table 1.1 displays the five categories that newborn infants are tested on at one and five minutes after birth: Appearance, Pulse, Grimace, Activity, and Respiration (hence, the acronym APGAR). A score is obtained by summing the newborn infantâs assessed value on each of the dimensions. Scores can range from 0 to 10. A score of 7â10 is considered normal. A score of 4â6 indicates that the newborn infant may require some resuscitation, while a score of 3 or less means the newborn would require immediate and intensive resuscitation. The infant is then assessed again at five minutes, and if the score still is below a 7, the infant may be assessed again at 10 minutes. If the infantâs APGAR score is 7 or above five minutes after birth, which is typical, then no further intervention is called for. Hence, by taking a relatively small sampling of behavior, we are (or at least a competent obstetrics nurse or doctor is) able to quickly, and quite accurately, assess the functioning of a newborn infant to determine if resuscitation interventions are required to help the newborn function properly.
Table 1.1 The APGAR Test Scoring Table
Sign | Points |
0 | 1 | 2 |
Appearance (color) | Pale or blue | Body pink, extremities blue | Pink (normal for non-Caucasian) |
Pulse (heartbeat) | Not detectible | Lower than 100 bpm | Higher than 100 bpm |
Grimace (reflex) | No response | Grimace | Lusty cry |
Activity (muscle tone) | Flaccid | Some movement | A lot of activity |
Respiration (breathing) | None | Slow, irregular | Good (crying) |
The utility of any assessment device, however, will depend on the qualities of the test and the intended use of the test. Test information can be used for a variety of purposes from making predictions about the likelihood that a patient will commit suicide to making personnel selection decisions by determining which entry-level workers to hire. Tests can also be used for classification purposes, as when students are designated as remedial, gifted, or somewhere in between. Tests can also be used for evaluation purposes, as in the use of a classroom test to evaluate performance of students in a given subject matter. Counseling psychologists routinely use tests to assess clients for emotional adjustment problems or possibly for help in providing vocational and career counseling. Finally, tests can also be used for research-only purposes such as when an experimenter uses a test to prescreen study participants to assign each one to an experimental condition. If the test is not used for its intended purpose, however, it will not be very useful and, in fact, may actually be harmful. As Anastasi and Urbina (1997) note, âPsychological tests are tools ⊠Any tool can be an instrument of good or harm, depending on how it is usedâ (p. 2).
For example, most American children in grades 2â12 are required to take standardized tests on a yearly basis. These tests were initially intended for the sole purpose of assessing studentsâ learning outcomes. Over time, however, a variety of other misuses for these tests have emerged. For instance, they are frequently used to determine school funding and, in some cases, teachersâ or school administratorsâ âmeritâ pay. However, given that determining the pay levels of educational employees was not the intended use of such standardized educational tests when they were developed, they almost always serve poorly in this capacity. Thus, a test that was developed with good (i.e., appropriate) intentions can be (mis)used for inappropriate purposes, limiting the usefulness of the test. In this instance, however, not only is the test of little use in setting pay for teachers and administrators, it may actually be causing harm to students by coercing teachers to âteach to the test,â thereby trading long-term gains in learning for short-term increases in standardized test performance.
In addition, no matter how the test is used, it will only be useful if it meets certain psychometric and practical requirements. From a psychometric or measurement standpoint, we want to know if the test is accurate, standardized, and reliable; if it demonstrates evidence of validity; and if it is free of both measurement and predictive bias. Procedures for determining these psychometric qualities form the core of the rest of this book. From a practical standpoint, the test must be cost effective as well as relatively easy to administer and score. Reflecting on our earlier example, we would surmise that the APGAR meets most of these qualities of being practical. Trained doctors and nurses in a hospital delivery room can administer the APGAR quickly and efficiently. Our key psychometric concern in this situation may be how often different doctors and nurses are able to provide similar APGAR scores in a given situation (i.e., the inter-rater reliability of the APGAR).
Individual Differences
Ultimately, when it comes right down to it, those interested in applied psychological measurement are usually interested in some form of individual differences (i.e., how individuals differ on test scores and the underlying traits being measured by those tests). If there are no differences in how target individuals score on the test, then the test will have little value to us. For example, if we give a group of elite athletes the standard physical ability test given to candidates for a police officer job, there will likely be very little variability in scores with all the athletes scoring extremely high on the test. Thus, the test data would provide little value in predicting which athletes would make good police officers. On the other hand, if we had a more typical group of job candidates who passed previous hurdles in the personnel selection process for police officer (e.g., cognitive tests, background checks, psychological evaluations) and administered them the same physical ability test, we would see much wider variability in scores. Thus, the test would at least have the potential to be a useful predictor of job success, as we would have at least some variability in the observed test scores.
Individual differences on psychological tests can take several different forms. Typically, we look at inter-individual differences where we examine differences on the same construct across individuals. In such cases, the desire is usually prediction. That is, how well does the test predict some criterion of interest? For example, in the preceding scenario, we would u...