Statistical and Machine-Learning Data Mining:
eBook - ePub

Statistical and Machine-Learning Data Mining:

Techniques for Better Predictive Modeling and Analysis of Big Data, Third Edition

  1. 656 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Statistical and Machine-Learning Data Mining:

Techniques for Better Predictive Modeling and Analysis of Big Data, Third Edition

About this book

Interest in predictive analytics of big data has grown exponentially in the four years since the publication of Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data, Second Edition. In the third edition of this bestseller, the author has completely revised, reorganized, and repositioned the original chapters and produced 13 new chapters of creative and useful machine-learning data mining techniques. In sum, the 43 chapters of simple yet insightful quantitative techniques make this book unique in the field of data mining literature.

What is new in the Third Edition:



  • The current chapters have been completely rewritten.


  • The core content has been extended with strategies and methods for problems drawn from the top predictive analytics conference and statistical modeling workshops.


  • Adds thirteen new chapters including coverage of data science and its rise, market share estimation, share of wallet modeling without survey data, latent market segmentation, statistical regression modeling that deals with incomplete data, decile analysis assessment in terms of the predictive power of the data, and a user-friendly version of text mining, not requiring an advanced background in natural language processing (NLP).


  • Includes SAS subroutines which can be easily converted to other languages.

As in the previous edition, this book offers detailed background, discussion, and illustration of specific methods for solving the most commonly experienced problems in predictive modeling and analysis of big data. The author addresses each methodology and assigns its application to a specific type of problem. To better ground readers, the book provides an in-depth discussion of the basic methodologies of predictive modeling and analysis. While this type of overview has been attempted before, this approach offers a truly nitty-gritty, step-by-step method that both tyros and experts in the field can enjoy playing with.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Statistical and Machine-Learning Data Mining: by Bruce Ratner in PDF and/or ePUB format, as well as other popular books in Computer Science & Computer Science General. We have over one million books available in our catalogue for you to explore.
1
Introduction
Whatever you are able to do with your might, do it.
—Kohelet 9:10
1.1 The Personal Computer and Statistics
The personal computer (PC) has changed everything—for both better and worse—in the world of statistics. The PC can effortlessly produce precise calculations and eliminate the computational burden associated with statistics. One needs only to provide the right information. With minimal knowledge of statistics, the user points to the location of the input data, selects the desired statistical procedure, and directs the placement of the output. Thus, tasks such as testing, analyzing, and tabulating raw data into summary measures as well as many other statistical criteria are fairly rote. The PC has advanced statistical thinking in the decision-making process as evidenced by visual displays such as bar charts and line graphs, animated three-dimensional rotating plots, and interactive marketing models found in management presentations. The PC also facilitates support documentation, which includes the calculations for measures such as mean profit across market segments from a marketing database; statistical output is copied from the statistical software and then pasted into the presentation application. Interpreting the output and drawing conclusions still require human intervention.
Unfortunately, the confluence of the PC and the world of statistics has turned generalists with minimal statistical backgrounds into quasi-statisticians and affords them a false sense of confidence because they can now produce statistical output. For instance, calculating the mean profit is standard fare in business. However, the mean provides a ā€œtypical valueā€ only when the distribution of the data is symmetric. In marketing databases, the distribution of profit commonly has a positive skewness.* Thus, the mean profit is not a reliable summary measure.† The quasi-statistician would doubtlessly not know to check this supposition, thus rendering the interpretation of the mean profit as floccinaucinihilipilification.—
Another example of how the PC fosters a ā€œquick-and-dirtyā€Ā§ approach to statistical analysis is in the use of the ubiquitous correlation coefficient (second in popularity to the mean as a summary measure), which measures the association between two variables. There is an assumption (that the underlying relationship between the two variables is linear or a straight line) to be met for the proper interpretation of the correlation coefficient. Rare is the quasi-statistician who is aware of the assumption. Meanwhile, well-trained statisticians often do not check this assumption, a habit developed by the uncritical use of statistics with the PC.
The PC with its unprecedented computational strength has also empowered professional statisticians to perform proper analytical due diligence; for example, the natural seven-step cycle of statistical analysis would not be practical [1]. The PC and the analytical cycle comprise the perfect pairing as long as the information obtained starts at Step 1 and continues straight through Step 7, without a break in the cycle. Unfortunately, statisticians are human and succumb to taking shortcuts in the path through the seven-step cycle. They ignore the cycle and focus solely on the sixth step. A careful statistical endeavor requires performance of all the steps in the seven-step cycle.* The seven-step sequence is as follows:
  1. Definition of the problem—Determining the best way to tackle the problem is not always obvious. Management objectives are often expressed qualitatively, in which case the selection of the outcome or target (dependent) variable is subjectively biased. When the objectives are clearly stated, the appropriate dependent variable is often not available, in which case a surrogate must be used.
  2. Determining technique—The technique first selected is often the one with which the data analyst is most comfortable; it is not necessarily the best technique for solving the problem.
  3. Use of competing techniques—Applying alternative techniques increases the odds that a thorough analysis is conducted.
  4. Rough comparisons of efficacy—Comparing variability of results across techniques can suggest additional techniques or the deletion of alternative techniques.
  5. Comparison in terms of a precise (and thereby inadequate) criterion—An explicit criterion is difficult to define. Therefore, precise surrogates are often used.
  6. Optimization in terms of a precise and inadequate criterion—An explicit criterion is difficult to define. Therefore, precise surrogates are often used.
  7. Comparison in terms of several optimization criteria—This constitutes the final step in determining the best solution.
The founding fathers of classical statistics—Karl Pearson and Sir Ronald Fisher—would have delighted in the PC’s ability to free them from time-consuming empirical validations of their concepts. Pearson, whose contributions include regression analysis, the correlation coefficient, the standard deviation (a term he coined in 1893), and the chi-square test of statistical significance (to name but a few), would have likely developed even more concepts with the free time afforded by the PC. One can further speculate that the functionality of the PC would have allowed Fisher’s methods (e.g., maximum likelihood estimation, hypothesis testing, and analysis of variance) to have immediate and practical applications.
The PC took the classical statistics of Pearson and Fisher from their theoretical blackboards into the practical classrooms and boardrooms. In the 1970s, statisticians were starting to acknowledge that their methodologies had the potential for wider applications. However, they knew an accessible computing device was required to perform their on-demand statistical analyses with an acceptable accuracy and within a reasonable turnaround time. Because the statistical techniques, developed for a small data setting consisting of one or two handfuls of variables and up to hundreds of records, the hand tabulation of data was computationally demanding and almost insurmountable. Accordingly, conducting the statistical techniques on large data (big data were not born until the late 2000s) was virtually out of the question. With the inception of the microprocessor in the mid-1970s, statisticians now had their computing device, the PC, to perform statistical analyses on large data with excellent accuracy and turnaround time. The desktop PCs replaced handheld calculators in the classroom and boardrooms. From the 1990s to the present, the PC has offered statisticians advantages that were imponderable decades earlier.
1.2 Statistics and Data Analysis
As early as 1957, Roy believed that classical statistical analysis was likely to be supplanted by assumption-free, nonparametric approaches that were more realistic and meaningful [2]. It was an onerous task to understand the robustness of the classical (parametric) techniques to violations of the restrictive and unrealistic assumptions underlying their use. In practical applications, the primary assumption of ā€œa random sample from a multivariate normal populationā€ is virtually untenable. The effects of violating this assumption and additional model-specific assumptions (e.g., linearity between predictor and dependent variables, constant variance among errors, and uncorrelated errors) are hard to determine with any exactitude. It is difficult to encourage the use of statistical techniques, given that their limitations are not fully understood.
In 1962, in his influential article, ā€œThe Future of Data Analysis,ā€ John Tukey expressed concern that the field of statistics was not advancing [1]. He felt there was too much focus on the mathematics of statistics and not enough on the analysis of data; he predicted a movement to unlock the rigidities that characterize the discipline. In an act of statistical heresy, Tukey took the first step toward revolutionizing statistics by referring to himself not as a statistician but as a data analyst. However, it was not until the publication of his seminal masterpiece, Exploratory Data Analysis, in 1977, that Tukey led the discipline away from the rigors of statistical inference into a new area known as EDA (the initialism from the title of the unquestionable masterpiece) [3]. For his part, Tukey tried to advance EDA as a separate and distinct discipline from statistics—an idea that never took hold. EDA offered a fresh, assumption-free, nonparametric approach to problem-solving in which the data guide the analysis and utilize self-educating techniques, such as iteratively testing and modifying the analysis as the evaluation of feedback, thereby improving the final analysis for reliable results.
Tukey’s words best describe the essence of EDA:
Exploratory data analysis is detective work—numerical detective work—or counting detective work—or graphical detective work. … [It is] about looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights. [3, p. 1]
EDA includes the following characteristics:
  1. Flexibility—Techniques with greater flexibility to delve into the data
  2. Practicality—Advice for procedures of analyzing data
  3. Innovation—Techniques for interpreting results
  4. Universality—Use all statistics that apply to analyzing data
  5. Simplicity—Above all, the belief that simplicity is the golden rule
On a personal note, when I learned that Tukey preferred to be called a data analyst, I felt both validated and liberated because many of my analyses fell outside the realm of the classical statistical framework. Also, I had virtually eliminated the mathematical machinery, such as the calculus of maximum likelihood. In homage to Tukey, I use the terms data analyst and statistician interchangeably throughout this book.
1.3 EDA
Tukey’s book is more than a collection of new and creative rules and operations; it defines EDA as a discipline, which holds that data analysts only fail if they fail to try many things. It further espouses the belief that data analysts are especially successful if their detective work forces them to notice the unexpected. In other words, the philosophy of EDA is a trinity of attitude and flexibility to do whatever it takes to refine the analysis and sharp-sightedness to observe the unexpected when it does appear. EDA is thus a self-propagating theory; each data analyst adds his or her contribution, thereby contributing to the discipline, as I hope to accomplish with this book.
The sharp-sightedness of EDA warrants more attention because it is an important feature of the EDA approach. The data analyst should be a keen observer of indicators that are capable of being dealt with successfully and should use them to paint an analytical picture of the data. In addition to the ever-ready visual graphical displays as indicators of what the data reveal, there are numerical indicators, such as counts, percentages, averages, and the other classical descriptive statistics (e.g., standard deviation, minimum, maximum, and missing values). The data analyst’s personal judgme...

Table of contents

  1. Cover
  2. Half Title
  3. Title
  4. Copyright
  5. Dedication
  6. Contents
  7. Preface to Third Edition
  8. Preface of Second Edition
  9. Acknowledgments
  10. Author
  11. 1. Introduction
  12. 2. Science Dealing with Data: Statistics and Data Science
  13. 3. Two Basic Data Mining Methods for Variable Assessment
  14. 4. CHAID-Based Data Mining for Paired-Variable Assessment
  15. 5. The Importance of Straight Data: Simplicity and Desirability for Good Model-Building Practice
  16. 6. Symmetrizing Ranked Data: A Statistical Data Mining Method for Improving the Predictive Power of Data
  17. 7. Principal Component Analysis: A Statistical Data Mining Method for Many-Variable Assessment
  18. 8. Market Share Estimation: Data Mining for an Exceptional Case
  19. 9. The Correlation Coefficient: Its Values Range between Plus and Minus 1, or Do They?
  20. 10. Logistic Regression: The Workhorse of Response Modeling
  21. 11. Predicting Share of Wallet without Survey Data
  22. 12. Ordinary Regression: The Workhorse of Profit Modeling
  23. 13. Variable Selection Methods in Regression: Ignorable Problem, Notable Solution
  24. 14. CHAID for Interpreting a Logistic Regression Model
  25. 15. The Importance of the Regression Coefficient
  26. 16. The Average Correlation: A Statistical Data Mining Measure for Assessment of Competing Predictive Models and the Importance of the Predictor Variables
  27. 17. CHAID for Specifying a Model with Interaction Variables
  28. 18. Market Segmentation Classification Modeling with Logistic Regression
  29. 19. Market Segmentation Based on Time-Series Data Using Latent Class Analysis
  30. 20. Market Segmentation: An Easy Way to Understand the Segments
  31. 21. The Statistical Regression Model: An Easy Way to Understand the Model
  32. 22. CHAID as a Method for Filling in Missing Values
  33. 23. Model Building with Big Complete and Incomplete Data
  34. 24. Art, Science, Numbers, and Poetry
  35. 25. Identifying Your Best Customers: Descriptive, Predictive, and Look-Alike Profiling
  36. 26. Assessment of Marketing Models
  37. 27. Decile Analysis: Perspective and Performance
  38. 28. Net T-C Lift Model: Assessing the Net Effects of Test and Control Campaigns
  39. 29. Bootstrapping in Marketing: A New Approach for Validating Models
  40. 30. Validating the Logistic Regression Model: Try Bootstrapping
  41. 31. Visualization of Marketing Models: Data Mining to Uncover Innards of a Model
  42. 32. The Predictive Contribution Coefficient: A Measure of Predictive Importance
  43. 33. Regression Modeling Involves Art, Science, and Poetry, Too
  44. 34. Opening the Dataset: A Twelve-Step Program for Dataholics
  45. 35. Genetic and Statistic Regression Models: A Comparison
  46. 36. Data Reuse: A Powerful Data Mining Effect of the GenIQ Model
  47. 37. A Data Mining Method for Moderating Outliers Instead of Discarding Them
  48. 38. Overfitting: Old Problem, New Solution
  49. 39. The Importance of Straight Data: Revisited
  50. 40. The GenIQ Model: Its Definition and an Application
  51. 41. Finding the Best Variables for Marketing Models
  52. 42. Interpretation of Coefficient-Free Models
  53. 43. Text Mining: Primer, Illustration, and TXTDM Software
  54. 44. Some of My Favorite Statistical Subroutines
  55. Index