1
Introduction
Whatever you are able to do with your might, do it.
āKohelet 9:10
1.1 The Personal Computer and Statistics
The personal computer (PC) has changed everythingāfor both better and worseāin the world of statistics. The PC can effortlessly produce precise calculations and eliminate the computational burden associated with statistics. One needs only to provide the right information. With minimal knowledge of statistics, the user points to the location of the input data, selects the desired statistical procedure, and directs the placement of the output. Thus, tasks such as testing, analyzing, and tabulating raw data into summary measures as well as many other statistical criteria are fairly rote. The PC has advanced statistical thinking in the decision-making process as evidenced by visual displays such as bar charts and line graphs, animated three-dimensional rotating plots, and interactive marketing models found in management presentations. The PC also facilitates support documentation, which includes the calculations for measures such as mean profit across market segments from a marketing database; statistical output is copied from the statistical software and then pasted into the presentation application. Interpreting the output and drawing conclusions still require human intervention.
Unfortunately, the confluence of the PC and the world of statistics has turned generalists with minimal statistical backgrounds into quasi-statisticians and affords them a false sense of confidence because they can now produce statistical output. For instance, calculating the mean profit is standard fare in business. However, the mean provides a ātypical valueā only when the distribution of the data is symmetric. In marketing databases, the distribution of profit commonly has a positive skewness.* Thus, the mean profit is not a reliable summary measure.ā The quasi-statistician would doubtlessly not know to check this supposition, thus rendering the interpretation of the mean profit as floccinaucinihilipilification.ā”
Another example of how the PC fosters a āquick-and-dirtyā§ approach to statistical analysis is in the use of the ubiquitous correlation coefficient (second in popularity to the mean as a summary measure), which measures the association between two variables. There is an assumption (that the underlying relationship between the two variables is linear or a straight line) to be met for the proper interpretation of the correlation coefficient. Rare is the quasi-statistician who is aware of the assumption. Meanwhile, well-trained statisticians often do not check this assumption, a habit developed by the uncritical use of statistics with the PC.
The PC with its unprecedented computational strength has also empowered professional statisticians to perform proper analytical due diligence; for example, the natural seven-step cycle of statistical analysis would not be practical [1]. The PC and the analytical cycle comprise the perfect pairing as long as the information obtained starts at Step 1 and continues straight through Step 7, without a break in the cycle. Unfortunately, statisticians are human and succumb to taking shortcuts in the path through the seven-step cycle. They ignore the cycle and focus solely on the sixth step. A careful statistical endeavor requires performance of all the steps in the seven-step cycle.* The seven-step sequence is as follows:
- Definition of the problemāDetermining the best way to tackle the problem is not always obvious. Management objectives are often expressed qualitatively, in which case the selection of the outcome or target (dependent) variable is subjectively biased. When the objectives are clearly stated, the appropriate dependent variable is often not available, in which case a surrogate must be used.
- Determining techniqueāThe technique first selected is often the one with which the data analyst is most comfortable; it is not necessarily the best technique for solving the problem.
- Use of competing techniquesāApplying alternative techniques increases the odds that a thorough analysis is conducted.
- Rough comparisons of efficacyāComparing variability of results across techniques can suggest additional techniques or the deletion of alternative techniques.
- Comparison in terms of a precise (and thereby inadequate) criterionāAn explicit criterion is difficult to define. Therefore, precise surrogates are often used.
- Optimization in terms of a precise and inadequate criterionāAn explicit criterion is difficult to define. Therefore, precise surrogates are often used.
- Comparison in terms of several optimization criteriaāThis constitutes the final step in determining the best solution.
The founding fathers of classical statisticsāKarl Pearson and Sir Ronald Fisherāwould have delighted in the PCās ability to free them from time-consuming empirical validations of their concepts. Pearson, whose contributions include regression analysis, the correlation coefficient, the standard deviation (a term he coined in 1893), and the chi-square test of statistical significance (to name but a few), would have likely developed even more concepts with the free time afforded by the PC. One can further speculate that the functionality of the PC would have allowed Fisherās methods (e.g., maximum likelihood estimation, hypothesis testing, and analysis of variance) to have immediate and practical applications.
The PC took the classical statistics of Pearson and Fisher from their theoretical blackboards into the practical classrooms and boardrooms. In the 1970s, statisticians were starting to acknowledge that their methodologies had the potential for wider applications. However, they knew an accessible computing device was required to perform their on-demand statistical analyses with an acceptable accuracy and within a reasonable turnaround time. Because the statistical techniques, developed for a small data setting consisting of one or two handfuls of variables and up to hundreds of records, the hand tabulation of data was computationally demanding and almost insurmountable. Accordingly, conducting the statistical techniques on large data (big data were not born until the late 2000s) was virtually out of the question. With the inception of the microprocessor in the mid-1970s, statisticians now had their computing device, the PC, to perform statistical analyses on large data with excellent accuracy and turnaround time. The desktop PCs replaced handheld calculators in the classroom and boardrooms. From the 1990s to the present, the PC has offered statisticians advantages that were imponderable decades earlier.
1.2 Statistics and Data Analysis
As early as 1957, Roy believed that classical statistical analysis was likely to be supplanted by assumption-free, nonparametric approaches that were more realistic and meaningful [2]. It was an onerous task to understand the robustness of the classical (parametric) techniques to violations of the restrictive and unrealistic assumptions underlying their use. In practical applications, the primary assumption of āa random sample from a multivariate normal populationā is virtually untenable. The effects of violating this assumption and additional model-specific assumptions (e.g., linearity between predictor and dependent variables, constant variance among errors, and uncorrelated errors) are hard to determine with any exactitude. It is difficult to encourage the use of statistical techniques, given that their limitations are not fully understood.
In 1962, in his influential article, āThe Future of Data Analysis,ā John Tukey expressed concern that the field of statistics was not advancing [1]. He felt there was too much focus on the mathematics of statistics and not enough on the analysis of data; he predicted a movement to unlock the rigidities that characterize the discipline. In an act of statistical heresy, Tukey took the first step toward revolutionizing statistics by referring to himself not as a statistician but as a data analyst. However, it was not until the publication of his seminal masterpiece, Exploratory Data Analysis, in 1977, that Tukey led the discipline away from the rigors of statistical inference into a new area known as EDA (the initialism from the title of the unquestionable masterpiece) [3]. For his part, Tukey tried to advance EDA as a separate and distinct discipline from statisticsāan idea that never took hold. EDA offered a fresh, assumption-free, nonparametric approach to problem-solving in which the data guide the analysis and utilize self-educating techniques, such as iteratively testing and modifying the analysis as the evaluation of feedback, thereby improving the final analysis for reliable results.
Tukeyās words best describe the essence of EDA:
Exploratory data analysis is detective workānumerical detective workāor counting detective workāor graphical detective work. ⦠[It is] about looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights. [3, p. 1]
EDA includes the following characteristics:
- FlexibilityāTechniques with greater flexibility to delve into the data
- PracticalityāAdvice for procedures of analyzing data
- InnovationāTechniques for interpreting results
- UniversalityāUse all statistics that apply to analyzing data
- SimplicityāAbove all, the belief that simplicity is the golden rule
On a personal note, when I learned that Tukey preferred to be called a data analyst, I felt both validated and liberated because many of my analyses fell outside the realm of the classical statistical framework. Also, I had virtually eliminated the mathematical machinery, such as the calculus of maximum likelihood. In homage to Tukey, I use the terms data analyst and statistician interchangeably throughout this book.
1.3 EDA
Tukeyās book is more than a collection of new and creative rules and operations; it defines EDA as a discipline, which holds that data analysts only fail if they fail to try many things. It further espouses the belief that data analysts are especially successful if their detective work forces them to notice the unexpected. In other words, the philosophy of EDA is a trinity of attitude and flexibility to do whatever it takes to refine the analysis and sharp-sightedness to observe the unexpected when it does appear. EDA is thus a self-propagating theory; each data analyst adds his or her contribution, thereby contributing to the discipline, as I hope to accomplish with this book.
The sharp-sightedness of EDA warrants more attention because it is an important feature of the EDA approach. The data analyst should be a keen observer of indicators that are capable of being dealt with successfully and should use them to paint an analytical picture of the data. In addition to the ever-ready visual graphical displays as indicators of what the data reveal, there are numerical indicators, such as counts, percentages, averages, and the other classical descriptive statistics (e.g., standard deviation, minimum, maximum, and missing values). The data analystās personal judgme...