1.1 Before You Start
Statistics starts with a problem, proceeds with the collection of data, continues with the data analysis and finishes with conclusions. It is a common mistake of inexperienced statisticians to plunge into a complex analysis without paying attention to the objectives or even whether the data are appropriate for the proposed analysis. As Einstein said, the formulation of a problem is often more essential than its solution which may be merely a matter of mathematical or experimental skill.
To formulate the problem correctly, you must:
Understand the physical background. Statisticians often work in collaboration with others and need to understand something about the subject area. Regard this as an opportunity to learn something new rather than a chore.
Understand the objective. Again, often you will be working with a collaborator who may not be clear about what the objectives are. Beware of “fishing expeditions” — if you look hard enough, you will almost always find something, but that something may just be a coincidence.
Make sure you know what the client wants. You can often do quite different analyses on the same dataset. Sometimes statisticians perform an analysis far more complicated than the client really needed. You may find that simple descriptive statistics are all that are needed.
Put the problem into statistical terms. This is a challenging step and where irreparable errors are sometimes made. Once the problem is translated into the language of statistics, the solution is often routine. This is where human intelligence is decidedly superior to artificial intelligence. Defining the problem is hard to program. That a statistical method can read in and process the data is not enough. The results of an inapt analysis may be meaningless.
It is important to understand how the data were collected.
Are the data observational or experimental? Are the data a sample of convenience or were they obtained via a designed sample survey? How the data were collected has a crucial impact on what conclusions can be made.
Is there nonresponse? The data you do not see may be just as important as the data you do see.
Are there missing values? This is a common problem that is troublesome and time consuming to handle.
How are the data coded? In particular, how are the categorical variables represented?
What are the units of measurement?
Beware of data entry errors and other corruption of the data. This problem is all too common — almost a certainty in any real dataset of at least moderate size. Perform some data sanity checks.
1.2 Initial Data Analysis
This is a critical step that should always be performed. It is simple but it is vital. You should make numerical summaries such as means, standard deviations (SDs), maximum and minimum, correlations and whatever else is appropriate to the specific dataset. Equally important are graphical summaries. There is a wide variety of techniques to choose from. For one variable at a time, you can make boxplots, histograms, density plots and more. For two variables, scatterplots are standard while for even more variables, there are numerous good ideas for display including interactive and dynamic graphics. In the plots, look for outliers, data-entry errors, skewed or unusual distributions and structure. Check whether the data are distributed according to prior expectations.
Getting data into a form suitable for analysis by cleaning out mistakes and aberrations is often time consuming. It often takes more time than the data analysis itself. One might consider this the core work of data science. In this book, all the data will be ready to analyze, but you should realize that in practice this is rarely the case.
Let’s look at an example. The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The following variables were recorded: number of times pregnant, plasma glucose concentration at 2 hours in an oral glucose tolerance test, diastolic blood pressure (mmHg), triceps skin fold thickness (mm), 2-hour serum insulin (mu U/ml), body mass index (weight in kg/(height in m2)), diabetes pedigree function, age (years) and a test whether the patient showed signs of diabetes (coded zero if negative, one if positive). The data may be obtained from UCI Repository of machine learning databases at archive.ics.uci.edu/ml.
Base Python has only limited functionality for numerical work. You will surely need to import some packages before you can accomplish anything. It is common to load all the packages you will need in a session at the beginning. We start with: import pandas as pd import numpy as np import matplotlib.pyplot as plt import scipy as sp import seaborn as sns import statsmodels.formula.api as smf
You can wait until you need them but it can be helpful when you share or return to your work later to have them all listed at the beginning so all will know which packages you need. The as pd means we can refer to functions in the pandas with the abbreviation pd.
Before doing anything else, one should find out the purpose of the study and more about how the data were collected. However, let’s skip ahead to a look at the data: import faraway.datasets.pima pima = faraway.datasets.pima.load() pima.head() pregnant glucose diastolic triceps insulin bmi diabetes age test 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1
Many of the datasets used in this book are supplied in the faraway package. See the appendix for how to install this package. Any time you want to use one of these datasets, you will need to import the package containing the data you require and then load it.
The command pima.head() prints out the first five lines of the data frame. This is a good way to see what variables we have and what sort of values they take. You can type pima to see the whole data frame but...