Linear Models with Python
eBook - ePub

Linear Models with Python

  1. 298 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Linear Models with Python

About this book

Praise for Linear Models with R:

This book is a must-have tool for anyone interested in understanding and applying linear models. The logical ordering of the chapters is well thought out and portrays Faraway's wealth of experience in teaching and using linear models. … It lays down the material in a logical and intricate manner and makes linear modeling appealing to researchers from virtually all fields of study. -Biometrical Journal

Throughout, it gives plenty of insight … with comments that even the seasoned practitioner will appreciate. Interspersed with R code and the output that it produces one can find many little gems of what I think is sound statistical advice, well epitomized with the examples chosen…I read it with delight and think that the same will be true with anyone who is engaged in the use or teaching of linear models. -Journal of the Royal Statistical Society

Like its widely praised, best-selling companion version, Linear Models with R, this book replaces R with Python to seamlessly give a coherent exposition of the practice of linear modeling. Linear Models with Python offers up-to-date insight on essential data analysis topics, from estimation, inference and prediction to missing data, factorial models and block designs. Numerous examples illustrate how to apply the different methods using Python.

Features:

  • Python is a powerful, open source programming language increasingly being used in data science, machine learning and computer science. Python and R are similar, but R was designed for statistics, while Python is multi-talented.
  • This version replaces R with Python to make it accessible to a greater number of users outside of statistics, including those from Machine Learning.
  • A reader coming to this book from an ML background will learn new statistical perspectives on learning from data.
  • Topics include Model Selection, Shrinkage, Experiments with Blocks and Missing Data.
  • Includes an Appendix on Python for beginners.

Linear Models with Python explains how to use linear models in physical science, engineering, social science and business applications. It is ideal as a textbook for linear models or linear regression courses.

Tools to learn more effectively

Saving Books

Saving Books

Keyword Search

Keyword Search

Annotating Text

Annotating Text

Listen to it instead

Listen to it instead

Chapter 1

Introduction

1.1 Before You Start

Statistics starts with a problem, proceeds with the collection of data, continues with the data analysis and finishes with conclusions. It is a common mistake of inexperienced statisticians to plunge into a complex analysis without paying attention to the objectives or even whether the data are appropriate for the proposed analysis. As Einstein said, the formulation of a problem is often more essential than its solution which may be merely a matter of mathematical or experimental skill.
To formulate the problem correctly, you must:
  1. Understand the physical background. Statisticians often work in collaboration with others and need to understand something about the subject area. Regard this as an opportunity to learn something new rather than a chore.
  2. Understand the objective. Again, often you will be working with a collaborator who may not be clear about what the objectives are. Beware of “fishing expeditions” — if you look hard enough, you will almost always find something, but that something may just be a coincidence.
  3. Make sure you know what the client wants. You can often do quite different analyses on the same dataset. Sometimes statisticians perform an analysis far more complicated than the client really needed. You may find that simple descriptive statistics are all that are needed.
  4. Put the problem into statistical terms. This is a challenging step and where irreparable errors are sometimes made. Once the problem is translated into the language of statistics, the solution is often routine. This is where human intelligence is decidedly superior to artificial intelligence. Defining the problem is hard to program. That a statistical method can read in and process the data is not enough. The results of an inapt analysis may be meaningless.
It is important to understand how the data were collected.
  1. Are the data observational or experimental? Are the data a sample of convenience or were they obtained via a designed sample survey? How the data were collected has a crucial impact on what conclusions can be made.
  2. Is there nonresponse? The data you do not see may be just as important as the data you do see.
  3. Are there missing values? This is a common problem that is troublesome and time consuming to handle.
  4. How are the data coded? In particular, how are the categorical variables represented?
  5. What are the units of measurement?
  6. Beware of data entry errors and other corruption of the data. This problem is all too common — almost a certainty in any real dataset of at least moderate size. Perform some data sanity checks.

1.2 Initial Data Analysis

This is a critical step that should always be performed. It is simple but it is vital. You should make numerical summaries such as means, standard deviations (SDs), maximum and minimum, correlations and whatever else is appropriate to the specific dataset. Equally important are graphical summaries. There is a wide variety of techniques to choose from. For one variable at a time, you can make boxplots, histograms, density plots and more. For two variables, scatterplots are standard while for even more variables, there are numerous good ideas for display including interactive and dynamic graphics. In the plots, look for outliers, data-entry errors, skewed or unusual distributions and structure. Check whether the data are distributed according to prior expectations.
Getting data into a form suitable for analysis by cleaning out mistakes and aberrations is often time consuming. It often takes more time than the data analysis itself. One might consider this the core work of data science. In this book, all the data will be ready to analyze, but you should realize that in practice this is rarely the case.
Let’s look at an example. The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The following variables were recorded: number of times pregnant, plasma glucose concentration at 2 hours in an oral glucose tolerance test, diastolic blood pressure (mmHg), triceps skin fold thickness (mm), 2-hour serum insulin (mu U/ml), body mass index (weight in kg/(height in m2)), diabetes pedigree function, age (years) and a test whether the patient showed signs of diabetes (coded zero if negative, one if positive). The data may be obtained from UCI Repository of machine learning databases at archive.ics.uci.edu/ml.
Base Python has only limited functionality for numerical work. You will surely need to import some packages before you can accomplish anything. It is common to load all the packages you will need in a session at the beginning. We start with: import pandas as pd import numpy as np import matplotlib.pyplot as plt import scipy as sp import seaborn as sns import statsmodels.formula.api as smf
You can wait until you need them but it can be helpful when you share or return to your work later to have them all listed at the beginning so all will know which packages you need. The as pd means we can refer to functions in the pandas with the abbreviation pd.
Before doing anything else, one should find out the purpose of the study and more about how the data were collected. However, let’s skip ahead to a look at the data: import faraway.datasets.pima pima = faraway.datasets.pima.load() pima.head() pregnant glucose diastolic triceps insulin bmi diabetes age test 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1
Many of the datasets used in this book are supplied in the faraway package. See the appendix for how to install this package. Any time you want to use one of these datasets, you will need to import the package containing the data you require and then load it.
The command pima.head() prints out the first five lines of the data frame. This is a good way to see what variables we have and what sort of values they take. You can type pima to see the whole data frame but...

Table of contents

  1. Cover
  2. Half Title
  3. Series Page
  4. Title Page
  5. Copyright Page
  6. Contents
  7. Preface
  8. 1 Introduction
  9. 2 Estimation
  10. 3 Inference
  11. 4 Prediction
  12. 5 Explanation
  13. 6 Diagnostics
  14. 7 Problems with the Predictors
  15. 8 Problems with the Error
  16. 9 Transformation
  17. 10 Model Selection
  18. 11 Shrinkage Methods
  19. 12 Insurance Redlining — A Complete Example
  20. 13 Missing Data
  21. 14 Categorical Predictors
  22. 15 One-Factor Models
  23. 16 Models with Several Factors
  24. 17 Experiments with Blocks
  25. A About Python
  26. Bibliography
  27. Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Linear Models with Python by Julian J. Faraway in PDF and/or ePUB format, as well as other popular books in Économie & Statistiques pour les entreprises et l'économie. We have over one million books available in our catalogue for you to explore.