### Stages

We gather data because we want to know something. These data are useful only if they provide information about what we want to know. A scientist usually seeks to develop knowledge in three stages. The first stage is to *describe* a class of scientific events and formulate hypotheses regarding the nature of the events. The second stage is to *explain* these events. The third stage is to *predict* the occurrence of these events. The ability to predict an event implies some level of understanding of the rule of nature governing the event. The ability to predict outcomes of actions allows the scientist to make better decisions about such actions. At best, a general scientific rule may be inferred from repeated events of this type. The latter two stages of a scientific investigation will generally involve building a *statistical model*. A statistical model is an abstract concept but in the most general of terms it is a *mathematical model* that is built on a (generally) simplifying set of assumptions that attempts to explain the mechanistic or data-generating process that gave rise to the data one has observed. In this way a statistical model allows one to infer from a sample to the larger population by relying upon the assumptions made to describe the data-generating process. While the topic may seem abstract, the reader has undoubtedly encountered and possibly utilized multiple statistical models in practice. As an example, we might look at body mass index, or BMI. BMI is an indicator of body weight adjusted for body size. The model is BMI = weight/height^{2}, where weight is measured in kilograms and height in meters. (If pounds and inches are used, the equation becomes BMI = 703 Ă weight/height^{2}).

A 6-ft person (1.83 m) weighing 177 lb (80 kg) would have 80/1.83^{2} = 23.9 BMI. To build such a model, we would start with weights and heights recorded for a representative sample of normal people. (We will ignore underweight for this example.) For a given height, there is an ideal weight and the greater the excess weight, the lower the health. But ideal weight varies with body size. If we plot weights for various heights, we find a curve that increases in slope as height increases, something akin to the way *y*^{2} looks when plotted for *x*, so we try height^{2}. For a fixed weight the body mass measure goes down as height goes up, so the height term should be a divider of weight, not a multiplier. Thus we have the BMI formula. Of course, many influences are ignored to achieve simplicity. A better model would adjust for muscle mass, bone density, and others, but such measures are hard to come by. Height and weight are normally in every personâs medical history.

The model gives an estimate of, or approximation to, the body weightâs influence on the personâs health. More generally, *a model approximates a state or condition based on measurements of influencing variables*, whence its name, a model of the state, not a direct measure. The greater the predictive accuracy and reliability of a model, the more complicated the model needs to be. Usually, models are trade-offs between accessibility of measures and simplicity of interpretation versus the requirement for accuracy.

Sometimes it is necessary to formulate more complicated models in order to ensure better predictive accuracy. For example, The American College of Cardiology utilizes a model to estimate an individualâs 10-year risk of atherosclerotic cardiovascular disease (ASCVD). This model utilizes 13 variables to obtain an estimate of the probability that an individual will experience ASCVD within the next 10 years. These variables include factors such as age, sex, weight, smoking status, systolic and diastolic blood pressure, cholesterol levels, and medication use. The model then weights each of these factors in order to compute an estimate of ASCVD risk. Due to the complexity of the model, it is not easy to write down and communicate, as is the case with BMI. Instead, it is easier to produce an âonline calculatorâ that takes in each of the influencing variables and, behind the scenes, feeds these values into the model to report a final estimate of the probability of ASCVD. As an example, the ASCVD online calculator from The American College of Cardiology can be found at http://tools.acc.org/ASCVD-Risk-Estimator-Plus.

Following is a brief explanation of the three stages of gathering knowledge.

#### The causative process is of interest, not the data

A process, or set of forces, generates data related to an event. It is this process, not the data per se, that interests us.

*Description:* The stage in which we seek to describe the data-generating process in cases for which we have data from that process. Description would answer questions such as: What is the range of prostate volumes for a sample of urology patients? What is the difference in average volume between patients with negative biopsy results and those with positive results?

*Explanation:* The stage in which we seek to *infer* characteristics of the (overall) data-generating process when we have only part (usually a small part) of the possible data. Inference would answer questions such as: Based on a sample of patients with prostate problems, are the average volumes of patients with positive biopsy results less than those of patients with negative biopsy results, for all men with prostate problems? Such inferences usually take the form of tests of hypotheses.

*Prediction:* The stage in which we seek to make predictions about a characteristic of the data-generating process on the basis of newly taken related observations. Such a prediction would answer questions such as: On the basis of a patientâs negative digital rectal examination, prostate-specific antigen of 9, and prostate volume of 30 mL, what is the probability that he has prostate cancer? Such predictions allow us to make decisions on how to treat our patients to change the chances of an event. For example, should I perform a biopsy on my patient? Predictions usually take the form of a mathematical model of the relationship between the predicted (dependent) variable and the predictor (independent) variables.