Chapter 1
Variation
If there were no variation, if every observation were predictable, a mere repetition of what had gone before, there would be no need for statistics.
In this chapter, youâll learn what statistics is all about, variation and its potential sources, and how to use R to display the data youâve collected. Youâll start to acquire additional vocabulary, including such terms as accuracy and precision, mean and median, and sample and population.
1.1 VARIATION
We find physics extremely satisfying. In high school, we learned the formula S = VT, which in symbols relates the distance traveled by an object to its velocity multiplied by the time spent in traveling. If the speedometer says 60 mph, then in half an hour, you are certain to travel exactly 30 mi. Except that during our morning commute, the speed we travel is seldom constant, and the formula not really applicable. Yahoo Maps told us it would take 45 minutes to get to our teaching assignment at UCLA. Alas, it rained and it took us two and a half hours.
Politicians always tell us the best that can happen. If a politician had spelled out the worst-case scenario, would the United States have gone to war in Iraq without first gathering a great deal more information?
In college, we had Boyleâs law, V = KT/P, with its tidy relationship between the volume V, temperature T and pressure P of a perfect gas. This is just one example of the perfection encountered there. The problem was we could never quite duplicate this (or any other) law in the Freshman Physicsâ laboratory. Maybe it was the measuring instruments, our lack of familiarity with the equipment, or simple measurement error, but we kept getting different values for the constant K.
By now, we know that variation is the norm. Instead of getting a fixed, reproducible volume V to correspond to a specific temperature T and pressure P, one ends up with a distribution of values of V instead as a result of errors in measurement. But we also know that with a large enough representative sample (defined later in this chapter), the center and shape of this distribution are reproducible.
Hereâs more good and bad news: Make astronomical, physical, or chemical measurements and the only variation appears to be due to observational error. Purchase a more expensive measuring device and get more precise measurements and the situation will improve.
But try working with people. Anyone who spends any time in a schoolroomâwhether as a parent or as a child, soon becomes aware of the vast differences among individuals. Our most distinct memories are of how large the girls were in the third grade (ever been beat up by a girl?) and the trepidation we felt on the playground whenever teams were chosen (not right field again!). Much later, in our college days, we were to discover there were many individuals capable of devouring larger quantities of alcohol than we could without noticeable effect. And a few, mostly of other nationalities, whom we could drink under the table.
Whether or not you imbibe, weâre sure youâve had the opportunity to observe the effects of alcohol on others. Some individuals take a single drink and their nose turns red. Others canât seem to take just one drink.
Despite these obvious differences, scheduling for outpatient radiology at many hospitals is done by a computer program that allots exactly 15 minutes to each patient. Well, Iâve news for them and their computer. Occasionally, the technologists are left twiddling their thumbs. More often the waiting room is overcrowded because of routine exams that werenât routine or where the radiologist wanted additional X-rays. (To say nothing of those patients who show up an hour or so early or a half hour late.)
The majority of effort in experimental design, the focus of Chapter 6 of this text, is devoted to finding ways in which this variation from individual to individual wonât swamp or mask the variation that results from differences in treatment or approach. Itâs probably safe to say that what distinguishes statistics from all other branches of applied mathematics is that it is devoted to characterizing and then accounting for variation in the observations.
Consider the Following Experiment
You catch three fish. You heft each one and estimate its weight; you weigh each one on a pan scale when you get back to the dock, and you take them to a chemistry laboratory and weigh them there. Your two friends on the boat do exactly the same thing. (All but Mike; the chemistry professor catches him in the lab after hours and calls campus security. This is known as missing data.)
The 26 weights youâve recorded (3 Ă 3 Ă 3â1 when they nabbed Mike) differ as result of measurement error, observer error, differences among observers, differences among measuring devices, and differences among fish.
1.2 COLLECTING DATA
The best way to observe variation is for you, the reader, to collect some data. But before we make some suggestions, a few words of caution are in order: 80% of the effort in any study goes into data collection and preparation for data collection. Any effort you donât expend initially goes into cleaning up the resulting mess. Or, as my carpenter friends put it, âmeasure twice; cut once.â
We constantly receive letters and emails asking which statistic we would use to rescue a misdirected study. We know of no magic formula, no secret procedure known only to statisticians with a PhD. The operative phrase is GIGO: garbage in, garbage out. So think carefully before you embark on your collection effort. Make a list of possible sources of variation and see if you can eliminate any that are unrelated to the objectives of your study. If midway through, you think of a better methodâdonât use it.* Any inconsistency in your procedure will only add to the undesired variation.
1.2.1 A Worked-Through Example
Letâs get started. Suppose we were to record the time taken by an individual to run around the school track. Before turning the page to see a list of some possible sources of variation, test yourself by writing down a list of all the factors you feel will affect the individualâs performance. Obviously, the running time will depend upon the individualâs sex, age, weight (for height and age), and race. It also will depend upon the weather, as I can testify from personal experience.
Soccer referees are required to take an annual physical examination that includes a mile and a quarter run. On a cold March day, the last time I took the exam in Michigan, I wore a down parka. Halfway through the first lap, a light snow began to fall that melted as soon as it touched my parka. By the third go around the track, the down was saturated with moisture and I must have been carrying a dozen extra pounds. Needless to say, my running speed varied considerably over the mile and a quarter.
As we shall see in the chapter on analyzing experiments, we canât just add the effects of the various factors, for they often interact. Consider that Kenyanâs dominate the long-distance races, while Jamaicans and African-Americans do best in...