Data Analytics
eBook - ePub

Data Analytics

A Small Data Approach

  1. 276 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Data Analytics

A Small Data Approach

About this book

Data Analytics: A Small Data Approach is suitable for an introductory data analytics course to help students understand some main statistical learning models. It has many small datasets to guide students to work out pencil solutions of the models and then compare with results obtained from established R packages. Also, as data science practice is a process that should be told as a story, in this book there are many course materials about exploratory data analysis, residual analysis, and flowcharts to develop and validate models and data pipelines.

The main models covered in this book include linear regression, logistic regression, tree models and random forests, ensemble learning, sparse learning, principal component analysis, kernel methods including the support vector machine and kernel regression, and deep learning. Each chapter introduces two or three techniques. For each technique, the book highlights the intuition and rationale first, then shows how mathematics is used to articulate the intuition and formulate the learning problem. R is used to implement the techniques on both simulated and real-world dataset. Python code is also available at the book's website: http://dataanalyticsbook.info.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Data Analytics by Shuai Huang,Houtao Deng in PDF and/or ePUB format, as well as other popular books in Mathematics & Statistics for Business & Economics. We have over one million books available in our catalogue for you to explore.

Chapter 1: Introduction

Who will benefit from this book?

Students who will find this book useful are those who have not systematically learned statistics or machine learning, but have had some exposure to basic statistical knowledge such as normal distribution, hypothesis testing, and are interested in finding data scientist jobs in a variety of areas. And practitioners with or without formal training in data science-related disciplines, who use data science in interdisciplinary areas, will find this book a useful addition. For example, I know a friend who learned statistics in college, more or less as applied mathematics that less emphasized data, computation, and storytelling, had found a remarkable resemblance between many data science methods with some concepts that she learned 20 years ago. She said if she could have a book that helps connect all the dots, and go through the materials with an easy-to-follow programming tool, it would help her move to a new field of work, as she is a physicist who is now working on genetics data.
We feel the same way. We have been working with medical doctors to diagnose surgical site infections using mobile phone images, with healthcare professionals to use hospital data to optimize the care process, with biologists and epidemiologists to understand the natural history of diseases, and with manufacturing companies to build Internet-of-Things, among others. The challenge of interdisciplinary collaboration is to cross boundaries and build new platforms. To embark on this adventure, a flexible understanding of our methods is important, as well as the skill of storytelling.
To help readers develop these skills, the style of the book highlights a combination of two aspects: technical concreteness and holistic thinking2. Holistic thinking is the foundation of how we formulate problems and why we could trust our formulations, knowing that our formulations inevitably are only a partial representation of a real-world problem. Holistic thinking is also the foundation of communication between team members of different backgrounds. With a diverse team, things that make sense intuitively are important to build team-wide trust in decision-making. And technical concreteness is the springboard for us to jump into a higher awareness and understanding of the problem to make holistic decisions.
2 The chapters are named using different qualities of holistic thinking in decision-makings, including ”Abstraction”, ”Recognition”, ”Resonance”, ”Learning”, ”Diagnosis”, ”Scalability”, ”Pragmatism”, and ”Synthesis”.
To begin our journey, first, let’s look at the big picture, the data analytics pipeline.

Overview of a data analytics pipeline

A typical data analytics pipeline consists of several major pillars. In the example shown in Figure 1, it has four pillars: sensor and devices, data preprocessing and feature engineering, feature selection and dimension reduction, and modeling and data analysis. While this is not the only way to present the diverse data pipelines in the real world, these pipelines more or less resemble this architecture.
Figure 1
Figure 1: Overview of a data analytics pipeline
The pipeline starts with a real-world problem, for which we are not sure about the underlying system/mechanism, but we are able to characterize the system by defining some variables. Then, we could develop sensors and devices to acquire measurements of these variables3. Before analyzing the data and building models, there is a step for data preprocessing and feature engineering. For example, some signals acquired by sensors are not interpretable or not easily compatible with human perceptions, such as the signal acquired by MRI scanning machines in the Fourier space. Data preprocessing also refers to removal of outliers or imputation of missing data, detection and removal of redundant features, to name a few. After data preprocessing, we may conduct feature selection and dimension reduction to distill or condense signals in the data and reduce noise. Finally, we conduct modeling and data analysis on the prepared dataset to gain knowledge and build models of the real-world system4.
3 These measurements, we call data, are objective evidences that we can use to explore the statistical principles or mechanistic laws regulating the system behaviors.
4 Prediction models are quite common, but other models for decision-makings, such as system modeling, monitoring, intervention, and control, can be built as well.
This book focuses on the last two pillars of this pipeline, the modeling, data analysis, feature selection, and dimension reduction methods. But it is helpful to keep in mind the big picture of a data analytics pipeline. Because in practice, it takes a whole pipeline to make things work.

Topics in a nutshell

Specific techniques that will be introduced in this book are shown below.

Data models (i.e., regression-based techniques)

  • Chapter 2: Linear regression, least squares estimation, hypothesis testing, R-squared, First Derivative Test, connection with experimental design, data-generating mechanism, history of adventures in understanding errors, exploratory data analysis (EDA)
  • Chapter 3: Logistic regression, generalized least squares estimation, iterative reweighted least squares (IRLS) algorithm, ranking (formulated as a regression problem)
  • Chapter 4: Bootstrap, data resampling, nonparametric hypothesis testing, nonparametric confidence interval estimation
  • Chapter 5: Overfitting and underfitting, limitation of R-squared, training dataset and testing dataset, random sampling, K-fold cross-validation, the confusion matrix, false positive and false negative, the Receiver Operating Characteristics (ROC) curve, the law of errors, how data scientists work with clients
  • Chapter 6: Residual analysis, normal Q-Q plot, Cook’s distance, leverage, multicollinearity, heterogeneity, clustering, Gaussian mixture model (GMM), the Expectation-Maximization (EM) algorithm, Jensen Inequality
  • Chapter 7: Support Vector Machine (SVM), generalize data versus memorize data, maximum margin, support vectors, model complexity and regularization, primal-dual formulation, quadratic programming, KKT condition, kernel trick, kernel machines, SVM as a neural network model
  • Chapter 8: LASSO, sparse learning, L1-norm and L2-norm regularization, Ridge regression, feature selection, shooting algorithm, Principal Component Analysis (PCA), eigenvalue decomposition, scree plot
  • Chapter 9: Kernel regression as generalization of linear regression model, local smoother regression model, k-nearest neighbor (KNN) regression model, conditional variance regression model, heteroscedasticity, weighted least squares estimation, model extension and stacking
  • Chapter 10: Deep learning, neural network, activation function, model primitives, convolution, max pooling, convolutional neural network (CNN)

Algorithmic models (i.e., tree-based techniques)

  • Chapter 2: Decision tree, entropy, information gain (IG), node splitting, pre- and post-pruning, empirical error, generalization error, pessimistic error by binomial approximation, greedy recursive splitting
  • Chapter 3: System monitoring reformulated as classification problem, real-time contrasts method (RTC), design of monitoring statistics, sliding window, anomaly detection, false alarm
  • Chapter 4: Random forest, Gini index, weak classifiers, the probabilistic mechanism about why random forests can create a strong classifier out of many weak classifiers, importance score, partial dependency plot
  • Chapter 5: Out-of-bag (OOB) error in random forest
  • Chapter 6: Residual analysis, clustering by random forests
  • Chapter 7: Ensemble learning, Adaboost, analysis of ensemble learning from statistical, computational, and representational perspectives
  • Chapter 10: Automations of pipelines, integration of tree models, feature selection, and regression models in ‘inTrees‘, random forest as a rule generator, rule extraction, pruning, selection, and summarization, confidence and support of rules, variable interactions, rule-based prediction
In this book, we will use lower case letters, e.g., x, to represent scalars, bold face, lower case letters, e.g., x, to represent vectors, and bold face, upper case letters, e.g., X, to represent matrices.

Table of contents

  1. Cover
  2. Half Title
  3. Series Page
  4. Title Page
  5. Copyright Page
  6. Dedication
  7. Contents
  8. Preface
  9. Acknowledgments
  10. Chapter 1: Introduction
  11. Chapter 2: Abstraction Regression & Tree Models
  12. Chapter 3: Recognition Logistic Regression & Ranking
  13. Chapter 4: Resonance Bootstrap & Random Forests
  14. Chapter 5: Learning (I) Cross-validation & OOB
  15. Chapter 6: Diagnosis Residuals & Heterogeneity
  16. Chapter 7: Learning (II) SVM & Ensemble Learning
  17. Chapter 8: Scalability LASSO & PCA
  18. Chapter 9: Pragmatism Experience & Experimental
  19. Chapter 10: Synthesis Architecture & Pipeline
  20. Conclusion
  21. Appendix: A Brief Review of Background Knowledge
  22. Index