Section 1: Getting Started with Statistics for Data Science
In this section, you will learn how to preprocess data and inspect distributions and correlations from a statistical perspective.
This section consists of the following chapters:
- Chapter 1, Fundamentals of Data Collection, Cleaning, and Preprocessing
- Chapter 2, Essential Statistics for Data Assessment
- Chapter 3, Visualization with Statistical Graphs
Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing
Thank you for purchasing this book and welcome to a journal of exploration and excitement! Whether you are already a data scientist, preparing for an interview, or just starting learning, this book will serve you well as a companion. You may already be familiar with common Python toolkits and have followed trending tutorials online. However, there is a lack of a systematic approach to the statistical side of data science. This book is designed and written to close this gap for you.
As the first chapter in the book, we start with the very first step of a data science project: collecting, cleaning data, and performing some initial preprocessing. It is like preparing fish for cooking. You get the fish from the water or from the fish market, examine it, and process it a little bit before bringing it to the chef.
You are going to learn five key topics in this chapter. They are correlated with other topics, such as visualization and basic statistics concepts. For example, outlier removal will be very hard to conduct without a scatter plot. Data standardization clearly requires an understanding of statistics such as standard deviation. We prepared a GitHub repository that contains ready-to-run codes from this chapter as well as the rest.
Here are the topics that will be covered in this chapter:
- Collecting data from various data sources with a focus on data quality
- Data imputation with an assessment of downstream task requirements
- Outlier removal
- Data standardization – when and how
- Examples involving the scikit-learn preprocessing module
The role of this chapter is as a primer. It is not possible to cover the topics in an entirely sequential fashion. For example, to remove outliers, necessary techniques such as statistical plotting, specifically a box plot and scatter plot, will be used. We will come back to those techniques in detail in future chapters of course, but you must bear with it now. Sometimes, in order to learn new topics, bootstrapping may be one of a few ways to break the shell. You will enjoy it because the more topics you learn along the way, the higher your confidence will be.
Technical requirements
The best environment for running the Python code in the book is on Google Colaboratory (https://colab.research.google.com). Google Colaboratory is a product that runs Jupyter Notebook in the cloud. It has common Python packages that are pre-installed and runs in a browser. It can also communicate with a disk so that you can upload local files to Google Drive. The recommended browsers are the latest versions of Chrome and Firefox.
For more information about Colaboratory, check out their official notebooks: https://colab.research.google.com .
You can find the code for this chapter in the following GitHub repository: https://github.com/PacktPublishing/Essential-Statistics-for-Non-STEM-Data-Analysts
Collecting data from various data sources
There are three major ways to collect and gather data. It is crucial to keep in mind that data doesn't have to be well-formatted tables:
- Obtaining structured tabulated data directly: For example, the Federal Reserve (https://www.federalreserve.gov/data.htm) releases well-structured and well-documented data in various formats, including CSV, so that pandas can read the file into a DataFrame format.
- Requesting data from an API: For example, the Google Map API (https://developers.google.com/maps/documentation) allows developers to request data from the Google API at a capped rate depending on the pricing plan. The returned format is usually JSON or XML.
- Building a dataset from scratch: For example, social scientists often perform surveys and collect participants' answers to build proprietary data.
Let's look at some examples involving these three approaches. You will use the UCI machine learning repository, the Google Map API and USC President's Office websites as data sources, respectively.
Reading data directly from files
Reading data from local files or remote files through a URL usually requires a good source of publicly accessible data archives. For example, the University of California, Irvine maintains a data repository for machine learning. We will be reading the air quality dataset with pandas. The latest URL will be updated in the book's official GitHub repository in case the following code fails. You may obtain the file from https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. From the datasets, we are using the processed.hungarian.data file. You need to upload the file to the same folder where the notebook resides.
The following code snippet reads the data and displays the first several rows of the datasets:
import pandas as pd
df = pd.read_csv("processed.hungarian.data",
sep=",",
names = ["age","sex","cp","trestbps",
"chol","fbs","restecg","thalach",
"exang","oldpeak","slope","ca",
"thal","num"])
df.head()
This produces the following output:
Figure 1.1 – Head of the Hungarian heart disease dataset
In the following section, you will learn how to obtain data from an API.
Obtaining data from an API
In plain English, an Application Programming Interface (API) defines protocols, agreements, or treaties between applications or parts of applications. You need to pass requests to an API and obtain returned data in JSON or other formats specified in the API documentation. Then you can extract the data you want.
Note
When working with an API, you need to follow the guidelines and restrictions regarding API usage. Improper usage of an API will result in the suspension of an account or even legal issues.
Let's take the Google Map Place API as an example. The Place API (https://developers.google.com/places/web-service/intro) is one of many Google Map APIs that Google offers. Developers can use HTTP requests to obtain information about certain geographic locations, the opening hours of establishments, and the types of establishment, such as schools, government offices, ...