Contents
Preface
List of Figures
List of Tables
1 Data Science
1.1 Exercises
2 Introducing R
2.1 Tooling For R Programming
2.2 Packages and Libraries
2.3 Functions, Commands and Operators
2.4 Pipes
2.5 Getting Help
2.6 Exercises
3 Data Wrangling
3.1 Data Ingestion
3.2 Data Review
3.3 Data Cleaning
3.4 Variable Roles
3.5 Feature Selection
3.6 Missing Data
3.7 Feature Creation
3.8 Preparing the Metadata
3.9 Preparing for Model Building
3.10 Save the Dataset
3.11 A Template for Data Preparation
3.12 Exercises
4 Visualising Data
4.1 Preparing the Dataset
4.2 Scatter Plot
4.3 Bar Chart
4.4 Saving Plots to File
4.5 Adding Spice to the Bar Chart
4.6 Alternative Bar Charts
4.7 Box Plots
4.8 Exercises
5 Case Study: Australian Ports
5.1 Data Ingestion
5.2 Bar Chart: Value/Weight of Sea Trade
5.3 Scatter Plot: Throughput versus Annual Growth
5.4 Combined Plots: Port Calls
5.5 Further Plots
5.6 Exercises
6 Case Study: Web Analytics
6.1 Sourcing Data from CKAN
6.2 Browser Data
6.3 Entry Pages
6.4 Exercises
7 A Pattern for Predictive Modelling
7.1 Loading the Dataset
7.2 Building a Decision Tree Model
7.3 Model Performance
7.4 Evaluating Model Generality
7.6 Comparison of Performance Measures
7.7 Save the Model to File
7.8 A Template for Predictive Modelling
7.9 Exercises
8 Ensemble of Predictive Models
8.1 Loading the Dataset
8.2 Random Forest
8.3 Extreme Gradient Boosting
8.4 Exercises
9 Writing Functions in R
9.1 Model Evaluation
9.2 Creating a Function
9.3 Function for ROC Curves
9.4 Exercises
10 Literate Data Science
10.1 Basic LATEX Template
10.2 A Template for our Narrative
10.3 Including R Commands
10.4 Inline R Code
10.5 Formatting Tables Using Kable
10.6 Formatting Tables Using XTable
10.7 Including Figures
10.8 Add a Caption and Label
10.9 Knitr Options
10.10 Exercises
11 R with Style
11.1 Why We Should Care
11.2 Naming
11.3 Comments
11.4 Layout
11.5 Functions
11.6 Assignment
11.7 Miscellaneous
11.8 Exercises
Bibliography
Index
Preface
From data we derive information and by combining different bits of information we build knowledge. It is then with wisdom that we deploy knowledge into enterprises, governments, and society. Data is core to every organisation as we continue to digitally capture volumes and a variety of data at an unprecedented velocity. The demand for data science continues to growing substantially with a shortfall of data scientists worldwide.
Professional data scientists combine a good grounding in computer science and statistics with an ability to explore through the space of data to make sense of the world. Data science relies on their aptitude and art for observation, mathematics, and logical reasoning.
This book introduces the essentials of data analysis and machine learning as the foundations for data science. It uses the free and open source software R (R Core Team, 2017) which is freely available to anyone. All are permitted, and indeed encouraged, to read the source code to learn, understand, verify, and extend it. Being open source we also have the assurance that the software will always be available. R is supported by a worldwide network of some of the world’s leading statisticians and professional data scientists.
Features
A key feature of this book, differentiating it from other textbooks on data science, is the focus on the hands-on end-to-end process. It covers data analysis including loading data into R, wrangling the data to improve its quality and utility, visualising the data to gain understanding and insight, and, importantly, using machine learning to discover knowledge from the data.
This book brings together the essentials of doing data science based on over 30 years of the practise and teaching of data science. It presents a programming-by-example approach that allows students to quickly achieve outcomes whilst building a skill set and knowledge base, without getting sidetracked into the details of programming.
The book systematically develops an end-to-end process flow for ...