Part I
Methodological Aspects
1 | Exploratory Data Mining Using Decision Trees in the Behavioral Sciences |
| John J. McArdle |
Introduction
This first chapter starts off with a discussion of confirmatory versus exploratory analyses in behavioral research, and exploratory approaches are considered most useful. Decision Tree Analysis (DTA) is defined in historical and technical detail. Four real-life examples are presented to give a flavor of what is now possible with DTA: (1) Predicting Coronary Heart Disease from Age; (2) Some New Approaches to the Classification of Alzheimerâs Disease; (3) Exploring Predictors of College Academic Performances from High School; and (4) Exploring Patterns of Changes in Longitudinal WISC Data. In each case, current questions regarding DTA are raised. The discussion that follows considers the benefits and limitations of this exploratory approach, and the author concludes that confirmatory analyses should be always be done first, but this should at all times be followed by exploratory analyses.
The term âexploratoryâ is considered by many as less than an approach to data analysis and more a confession of guiltâa dishonest act has been performed with oneâs data. This becomes obvious when we reflexively recoil at the thought of exploratory methods, or when immediate rejections occur when one proposes research exploration in a research grant application, or when one tries to publish new results found by exploration. We need to face up to the fact that we now have a clear preference for confirmatory and a priori testing of well-formulated research hypotheses in psychological research. One radical interpretation of this explicit preference is that we simply do not yet trust one another.
Unfortunately, as many researchers know, quite the opposite is actually the truth. That is, it can be said that exploratory analyses predominate in our actual research activities. To be more extreme, we can assert there is actually no such thing as a true confirmatory analysis of data, nor should there be. Either way, we can try to be clearer about this problem. We need better responses when well-meaning students and colleagues ask, âIs it OK to do procedure X?â I assume they are asking, âIs there a well-known probability basis for procedure X, and will I be able to publish it?â Fear of rejection is strong among many good researchers, and one side effect is that rejection leaves scientific creativity only to the bold. As I will imply several times here, the only real requirement for a useful data analysis is that we remain honest (see McArdle, 2010).
When I was searching around for materials on this topic I stumbled upon the informative work by Berk (2009) where he starts out by saying:
As I was writing my recent book on regression analysis (Berk, 2003), I was struck by how few alternatives to conventional regression there were. In the social sciences, for example, one either did casual modeling econometric style, or largely gave up quantitative work ⊠The life sciences did not seem quite as driven by causal modeling, but causal modeling was a popular tool. As I argued at length in my book, causal modeling as commonly undertaken is a loser.
There also seemed to be a more general problem. Across a range of scientific disciplines there was often too little interest in statistical tools emphasizing induction and description. With the primary goal of getting the ârightâ model and its associated p-values, the older and more interesting tradition of exploratory data analysis had largely become an under-the-table activity: the approach was in fact commonly used, but rarely discussed in polite company. How could one be a real scientist, guided by âtheoryâ and engaged in deductive model testing, while at the same time snooping around in the data to determine which models to test? In the battle for prestige, model testing had won.
At the same time, I became aware of some new developments in applied mathematics, computer sciences, and statistics making data exploration a virtue. And with this virtue came a variety of new ideas and concepts, coupled with the very latest in statistical computing. These new approaches, variously identified as âdata mining,â âstatistical learning,â âmachine learning,â and other names, were being tried in a number of natural and biomedical sciences, and the initial experience looked promising.
As I started to read more deeply, however, I was stuck by how difficult it was to work across writings from such disparate disciplines. Even when the material was essentially the same, it was very difficult to tell if it was. Each discipline brought it own goals, concepts, naming conventions, and (maybe worst of all) notation to the table . Finally, there is the matter of tone. The past several decades have seen the development of a dizzying array of new statistical procedures, sometimes introduced with the hype of a big-budget movie. Advertising from major statistical software providers has typically made things worse. Although there have been genuine and useful advances, none of the techniques have ever lived up to their original billing. Widespread misuse has further increased the gap between promised performance and actual performance. In this book, the tone will be cautious, some might even say dark âŠ
(p. xi)
The problems raised by Berk (2009) are pervasive and we need new ways to overcome them. In my own view, the traditional use of the simple independent groups t-test should have provided our first warning message that something was wrong about the standard âconfirmatoryâ mantras. For example, we know it is fine to calculate the classic test of the mean difference between two groups and calculate the âprobability of equalityâ or âsignificance of the mean differenceâ under the typical assumptions (i.e., random sampling of persons, random assignment to groups, equal variance within cells). But we also know it is not appropriate to achieve significance by: (a) using another variable when the first variable fails to please, (b) getting data on more people until the observed difference is significant, (c) using various transformations of the data until we achieve significance, (d) tossing out outliers until we achieve significance, (e) examining possible differences in the variance instead of the means when we do not get what we want, (f) accepting a significant difference in the opposite direction to that we originally thought. I assume all good researchers do these kinds of things all the time. In my view, the problem is not with us but with the way we are taught to revere the apparent objectivity of the t-test approach. It is bound to be even more complex when we use this t-test procedure over and over again in hopes of isolating multivariate relationships.
For similar reasons, the one-way analysis of variance (ANOVA) should have been our next warning sign about the overall statistical dilemma. When we have three or more groups and perform a one-way ANOVA we can consider the resulting F-ratio as an indicator of âany group difference.â In practice, we can calculate ...