Essential Statistics for Non-STEM Data Analysts
eBook - ePub

Essential Statistics for Non-STEM Data Analysts

Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

  1. 392 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Essential Statistics for Non-STEM Data Analysts

Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

About this book

Reinforce your understanding of data science and data analysis from a statistical perspective to extract meaningful insights from your data using Python programming

Key Features

  • Work your way through the entire data analysis pipeline with statistics concerns in mind to make reasonable decisions
  • Understand how various data science algorithms function
  • Build a solid foundation in statistics for data science and machine learning using Python-based examples

Book Description

Statistics remain the backbone of modern analysis tasks, helping you to interpret the results produced by data science pipelines. This book is a detailed guide covering the math and various statistical methods required for undertaking data science tasks.

The book starts by showing you how to preprocess data and inspect distributions and correlations from a statistical perspective. You'll then get to grips with the fundamentals of statistical analysis and apply its concepts to real-world datasets. As you advance, you'll find out how statistical concepts emerge from different stages of data science pipelines, understand the summary of datasets in the language of statistics, and use it to build a solid foundation for robust data products such as explanatory models and predictive models. Once you've uncovered the working mechanism of data science algorithms, you'll cover essential concepts for efficient data collection, cleaning, mining, visualization, and analysis. Finally, you'll implement statistical methods in key machine learning tasks such as classification, regression, tree-based methods, and ensemble learning.

By the end of this Essential Statistics for Non-STEM Data Analysts book, you'll have learned how to build and present a self-contained, statistics-backed data product to meet your business goals.

What you will learn

  • Find out how to grab and load data into an analysis environment
  • Perform descriptive analysis to extract meaningful summaries from data
  • Discover probability, parameter estimation, hypothesis tests, and experiment design best practices
  • Get to grips with resampling and bootstrapping in Python
  • Delve into statistical tests with variance analysis, time series analysis, and A/B test examples
  • Understand the statistics behind popular machine learning algorithms
  • Answer questions on statistics for data scientist interviews

Who this book is for

This book is an entry-level guide for data science enthusiasts, data analysts, and anyone starting out in the field of data science and looking to learn the essential statistical concepts with the help of simple explanations and examples. If you're a developer or student with a non-mathematical background, you'll find this book useful. Working knowledge of the Python programming language is required.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Information

Year
2020
Print ISBN
9781838984847
Edition
1
eBook ISBN
9781838987565

Section 1: Getting Started with Statistics for Data Science

In this section, you will learn how to preprocess data and inspect distributions and correlations from a statistical perspective.
This section consists of the following chapters:
  • Chapter 1, Fundamentals of Data Collection, Cleaning, and Preprocessing
  • Chapter 2, Essential Statistics for Data Assessment
  • Chapter 3, Visualization with Statistical Graphs

Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing

Thank you for purchasing this book and welcome to a journal of exploration and excitement! Whether you are already a data scientist, preparing for an interview, or just starting learning, this book will serve you well as a companion. You may already be familiar with common Python toolkits and have followed trending tutorials online. However, there is a lack of a systematic approach to the statistical side of data science. This book is designed and written to close this gap for you.
As the first chapter in the book, we start with the very first step of a data science project: collecting, cleaning data, and performing some initial preprocessing. It is like preparing fish for cooking. You get the fish from the water or from the fish market, examine it, and process it a little bit before bringing it to the chef.
You are going to learn five key topics in this chapter. They are correlated with other topics, such as visualization and basic statistics concepts. For example, outlier removal will be very hard to conduct without a scatter plot. Data standardization clearly requires an understanding of statistics such as standard deviation. We prepared a GitHub repository that contains ready-to-run codes from this chapter as well as the rest.
Here are the topics that will be covered in this chapter:
  • Collecting data from various data sources with a focus on data quality
  • Data imputation with an assessment of downstream task requirements
  • Outlier removal
  • Data standardization – when and how
  • Examples involving the scikit-learn preprocessing module
The role of this chapter is as a primer. It is not possible to cover the topics in an entirely sequential fashion. For example, to remove outliers, necessary techniques such as statistical plotting, specifically a box plot and scatter plot, will be used. We will come back to those techniques in detail in future chapters of course, but you must bear with it now. Sometimes, in order to learn new topics, bootstrapping may be one of a few ways to break the shell. You will enjoy it because the more topics you learn along the way, the higher your confidence will be.

Technical requirements

The best environment for running the Python code in the book is on Google Colaboratory (https://colab.research.google.com). Google Colaboratory is a product that runs Jupyter Notebook in the cloud. It has common Python packages that are pre-installed and runs in a browser. It can also communicate with a disk so that you can upload local files to Google Drive. The recommended browsers are the latest versions of Chrome and Firefox.
For more information about Colaboratory, check out their official notebooks: https://colab.research.google.com .
You can find the code for this chapter in the following GitHub repository: https://github.com/PacktPublishing/Essential-Statistics-for-Non-STEM-Data-Analysts

Collecting data from various data sources

There are three major ways to collect and gather data. It is crucial to keep in mind that data doesn't have to be well-formatted tables:
  • Obtaining structured tabulated data directly: For example, the Federal Reserve (https://www.federalreserve.gov/data.htm) releases well-structured and well-documented data in various formats, including CSV, so that pandas can read the file into a DataFrame format.
  • Requesting data from an API: For example, the Google Map API (https://developers.google.com/maps/documentation) allows developers to request data from the Google API at a capped rate depending on the pricing plan. The returned format is usually JSON or XML.
  • Building a dataset from scratch: For example, social scientists often perform surveys and collect participants' answers to build proprietary data.
Let's look at some examples involving these three approaches. You will use the UCI machine learning repository, the Google Map API and USC President's Office websites as data sources, respectively.

Reading data directly from files

Reading data from local files or remote files through a URL usually requires a good source of publicly accessible data archives. For example, the University of California, Irvine maintains a data repository for machine learning. We will be reading the air quality dataset with pandas. The latest URL will be updated in the book's official GitHub repository in case the following code fails. You may obtain the file from https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. From the datasets, we are using the processed.hungarian.data file. You need to upload the file to the same folder where the notebook resides.
The following code snippet reads the data and displays the first several rows of the datasets:
import pandas as pd
df = pd.read_csv("processed.hungarian.data",
sep=",",
names = ["age","sex","cp","trestbps",
"chol","fbs","restecg","thalach",
"exang","oldpeak","slope","ca",
"thal","num"])
df.head()
This produces the following output:
Figure 1.1 – Head of the Hungarian heart disease dataset
Figure 1.1 – Head of the Hungarian heart disease dataset
In the following section, you will learn how to obtain data from an API.

Obtaining data from an API

In plain English, an Application Programming Interface (API) defines protocols, agreements, or treaties between applications or parts of applications. You need to pass requests to an API and obtain returned data in JSON or other formats specified in the API documentation. Then you can extract the data you want.
Note
When working with an API, you need to follow the guidelines and restrictions regarding API usage. Improper usage of an API will result in the suspension of an account or even legal issues.
Let's take the Google Map Place API as an example. The Place API (https://developers.google.com/places/web-service/intro) is one of many Google Map APIs that Google offers. Developers can use HTTP requests to obtain information about certain geographic locations, the opening hours of establishments, and the types of establishment, such as schools, government offices, ...

Table of contents

  1. Essential Statistics for Non-STEM Data Analysts
  2. Why subscribe?
  3. Preface
  4. Section 1: Getting Started with Statistics for Data Science
  5. Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing
  6. Chapter 2: Essential Statistics for Data Assessment
  7. Chapter 3: Visualization with Statistical Graphs
  8. Section 2: Essentials of Statistical Analysis
  9. Chapter 4: Sampling and Inferential Statistics
  10. Chapter 5: Common Probability Distributions
  11. Chapter 6: Parametric Estimation
  12. Chapter 7: Statistical Hypothesis Testing
  13. Section 3: Statistics for Machine Learning
  14. Chapter 8: Statistics for Regression
  15. Chapter 9: Statistics for Classification
  16. Chapter 10: Statistics for Tree-Based Methods
  17. Chapter 11: Statistics for Ensemble Methods
  18. Section 4: Appendix
  19. Chapter 12: A Collection of Best Practices
  20. Chapter 13: Exercises and Projects
  21. Other Books You May Enjoy

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Essential Statistics for Non-STEM Data Analysts by Rongpeng Li in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Modelling & Design. We have over 1.5 million books available in our catalogue for you to explore.