eBook - ePub

Handbook of Statistical Analysis and Data Mining Applications

Name: Handbook of Statistical Analysis and Data Mining Applications
Author: Ken Yale, Robert Nisbet, Gary D. Miner

Ken Yale, Robert Nisbet, Gary D. Miner

Share book

822 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Handbook of Statistical Analysis and Data Mining Applications

Ken Yale, Robert Nisbet, Gary D. Miner

Book details

Book preview

Table of contents

Citations

About This Book

Handbook of Statistical Analysis and Data Mining Applications, Second Edition, is a comprehensive professional reference book that guides business analysts, scientists, engineers and researchers, both academic and industrial, through all stages of data analysis, model building and implementation. The handbook helps users discern technical and business problems, understand the strengths and weaknesses of modern data mining algorithms and employ the right statistical methods for practical application.

This book is an ideal reference for users who want to address massive and complex datasets with novel statistical approaches and be able to objectively evaluate analyses and solutions. It has clear, intuitive explanations of the principles and tools for solving problems using modern analytic techniques and discusses their application to real problems in ways accessible and beneficial to practitioners across several areas—from science and engineering, to medicine, academia and commerce.

Includes input by practitioners for practitioners
Includes tutorials in numerous fields of study that provide step-by-step instruction on how to use supplied tools to build models
Contains practical advice from successful real-world implementations
Brings together, in a single resource, all the information a beginner needs to understand the tools and issues in data mining to build successful data mining solutions
Features clear, intuitive explanations of novel analytical tools and techniques, and their practical applications

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Handbook of Statistical Analysis and Data Mining Applications an online PDF/ePUB?

Yes, you can access Handbook of Statistical Analysis and Data Mining Applications by Ken Yale, Robert Nisbet, Gary D. Miner in PDF and/or ePUB format, as well as other popular books in Mathematik & Wahrscheinlichkeitsrechnung & Statistiken. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Academic Press

Year

2017

ISBN

9780124166455

Edition

Topic

Mathematik

Subtopic

Wahrscheinlichkeitsrechnung & Statistiken

Foreword 1 for 1st Edition

This book will help the novice user become familiar with data mining. Basically, data mining is doing data analysis (or statistics) on data sets (often large) that have been obtained from potentially many sources. As such, the miner may not have control of the input data, but must rely on sources that have gathered the data. As such, there are problems that every data miner must be aware of as he or she begins (or completes) a mining operation. I strongly resonated to the material on “The Top 10 Data Mining Mistakes,” which give a worthwhile checklist:

• Ensure you have a response variable and predictor variables—and that they are correctly measured.

• Beware of overfitting. With scads of variables, it is easy with most statistical programs to fit incredibly complex models, but they cannot be reproduced. It is good to save part of the sample to use to test the model. Various methods are offered in this book.

• Don't use only one method. Using only linear regression can be a problem. Try dichotomizing the response or categorizing it to remove nonlinearities in the response variable. Often, there are clusters of values at zero, which messes up any normality assumption. This, of course, loses information, so you may want to categorize a continuous response variable and use an alternative to regression. Similarly, predictor variables may need to be treated as factors rather than linear predictors. A classic example is using marital status or race as a linear predictor when there is no order.

• Asking the wrong question—when looking for a rare phenomenon, it may be helpful to identify the most common pattern. These may lead to complex analyses, as in item 3, but they may also be conceptually simple. Again, you may need to take care that you don't overfit the data.

• Don't become enamored with the data. There may be a substantial history from earlier data or from domain experts that can help with the modeling.

• Be wary of using an outcome variable (or one highly correlated with the outcome variable) and becoming excited about the result. The predictors should be “proper” predictors in the sense that they (a) are measured prior to the outcome and (b) are not a function of the outcome.

• Do not discard outliers without solid justification. Just because an observation is out of line with others is insufficient reason to ignore it. You must check the circumstances that led to the value. In any event, it is useful to conduct the analysis with the observation(s) included and excluded to determine the sensitivity of the results to the outlier.

• Extrapolating is a fine way to go broke; the best example is the stock market. Stick within your data, and if you must go outside, put plenty of caveats. Better still, restrain the impulse to extrapolate. Beware that pictures are often far too simple and we can be misled. Political campaigns oversimplify complex problems (“my opponent wants to raise taxes”; “my opponent will take us to war”) when the realities may imply we have some infrastructure needs that can be handled only with new funding or we have been attacked by some bad guys.

Be wary of your data sources. If you are combining several sets of data, they need to meet a few standards:

• The definitions of variables that are being merged should be identical. Often, they are close but not exact (especially in metaanalysis where clinical studies may have somewhat different definitions due to different medical institutions or laboratories).

• Be careful about missing values. Often, when multiple data sets are merged, missing values can be induced: one variable isn't present in another data set; what you thought was a unique variable name was slightly different in the two sets, so you end up with two variables that both have a lot of missing values.

• How you handle missing values can be crucial. In one example, I used complete cases and lost half of my sample; all variables had at least 85% completeness, but when put together, the sample lost half of the data. The residual sum of squares from a stepwise regression was about 8. When I included more variables using mean replacement, almost the same set of predictor variables surfaced, but the residual sum of squares was 20. I then used multiple imputation and found approximately the same set of predictors but had a residual sum of squares (median of 20 imputations) of 25. I find that mean replacement is rather optimistic but surely better than relying on only complete cases. Using stepwise regression, I find it useful to replicate it with a bootstrap or with multiple imputations. However, with large data sets, this approach may be expensive computationally.

To conclude, there is a wealth of material in this handbook that will repay study.

Peter A. Lachenbruch, Oregon State University, Corvallis, OR, United States, American Statistical Association, Alexandria, VA, United States, Johns Hopkins University, Baltimore, MD, United States, UCLA, Los Angeles, CA, United States, University of Iowa, Iowa City, IA, United States, University of North Carolina, Chapel Hill, NC, United States

Foreword 2 for 1st Edition

A November 2008 search on https://www.amazon.com/ for “data mining” books yielded over 15,000 hits—including 72 to be published in 2009. Most of these books either describe data mining in very technical and mathematical terms, beyond the reach of most individuals, or approach data mining at an introductory level without sufficient detail to be useful to the practitioner. The Handbook of Statistical Analysis and Data Mining Applications is the book that strikes the right balance between these two treatments of data mining.

This volume is not a theoretical treatment of the subject—the authors themselves recommend other books for this—but rather contains a description of data mining principles and techniques in a series of “knowledge-transfer” sessions, where examples from real data mining projects illustrate the main ideas. This aspect of the book makes it most valuable for practitioners, whether novice or more experienced.

While it would be easier for everyone if data mining were merely a matter of finding and applying the correct mathematical equation or approach for any given problem, the reality is that both “art” and “science” are necessary. The “art” in data mining requires experience: when one has seen and overcome the difficulties in finding solutions from among the many possible approaches, one can apply newfound wisdom to the next project. However, this process takes considerable time, and particularly for data mining novices, the iterative process inevitable in data mining can lead to discouragement when a “textbook” approach doesn't yield a good solution.

This book is different; it is organized with the practitioner in mind. The volume is divided into four parts. Part I provides an overview of analytics from a historical perspective and frameworks from which to approach data mining, including CRISP-DM and SEMMA. These chapters will provide a novice analyst an excellent overview by defining terms and methods to use and will provide program managers a framework from which to approach a wide variety of data mining problems. Part II describes algorithms, though without extensive mathematics. These will appeal to practitioners who are or will be involved with day-to-day analytics and need to understand the qualitative aspects of the algorithms. The inclusion of a chapter on text mining is particularly timely, as text mining has shown tremendous growth in recent years.

Part III provides a series of tutorials that are both domain-specific and software-specific. Any instructor knows that examples make the abstract concept more concrete, and these tutorials accomplish exactly that. In addition, each tutorial shows how the solutions were developed using popular data mining software tools, such as Clementine, Enterprise Miner, Weka, and STATISTICA. The step-by-step specifics will assist practitioners in learning not only how to approach a wide variety of problems but also how to use these software products effectively. Part IV presents a look at the future of data mining, including a treatment of model ensembles and “The Top 10 Data Mining Mistakes,” from the popular presentation by Dr. Elder.

However, the book is best read a few chapters at a time while actively doing the data mining rather than read cover to cover (a daunting task for a book this size). Practitioners will appreciate tutorials that match their business objectives and choose to ignore other tutorials. They may choose to read sections on a particular algorithm to increase insight into that algorithm and then decide to add a second algorithm after the first is mastered. For those new to a particular software tool highlighted in the tutorials section, the step-by-step approach will operate much like a user's manual. Many chapters stand well on their own, such as the excellent “History of Statistics and Data Mining” chapter and chapters 16, 17, and 18. These are broadly applicable and should be read by even the most experienced data miners.

The Handbook of Statistical Analysis and Data Mining Applications is an exceptional book that should be on every data miner's bookshelf or, better yet, found lying open next to their computer.

Dean Abbott, Abbott Analytics, San Diego, CA, United States

Preface

Bob Nisbet; Gary Miner; Ken Yale

Much has happened in the professional discipline known previously as data mining since the first edition of this book was written in 2008. This discipline has broadened and deepened to a very large extent, requiring a major reorganization of its elements. A new parent discipline was formed, data science, which includes previous subjects and activities in data mining and many new elements of the scientific study of data, including storage structures optimized for analytic use, data ethics, and performance of many activities in business, industry, and education. Analytic aspects that used to be included in data mining have broadened considerably to include image analysis, facial recognition, industrial performance and control, threat detection, fraud detection, astronomy, national security, weather forecasting, and financial forensics. Consequently, several subdisciplines have been erected to contain various specialized data analytic applications. These subdisciplines of data science include the following:

• Machine learning—analytic algorithm design and optimization

• Data mining—generally restricted in scope now to pattern recognition apart from causes and interpretation

• Predictive analytics—using algorithms to predict things, rather than describe them or manage them

• Statistical analysis—use of parametric statistical algorithms for analysis and prediction

•...