eBook - ePub

Python Data Cleaning Cookbook

Name: Python Data Cleaning Cookbook
ISBN: 9781800564596

Modern techniques and Python tools to detect and remove dirty data and extract key insights

Michael Walker,

436 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Python Data Cleaning Cookbook

Modern techniques and Python tools to detect and remove dirty data and extract key insights

Michael Walker,

About this book

Discover how to describe your data in detail, identify data issues, and find out how to solve them using commonly used techniques and tips and tricks

Key Features

Get well-versed with various data cleaning techniques to reveal key insights
Manipulate data of different complexities to shape them into the right form as per your business needs
Clean, monitor, and validate large data volumes to diagnose problems before moving on to data analysis

Book Description

Getting clean data to reveal insights is essential, as directly jumping into data analysis without proper data cleaning may lead to incorrect results. This book shows you tools and techniques that you can apply to clean and handle data with Python. You'll begin by getting familiar with the shape of data by using practices that can be deployed routinely with most data sources. Then, the book teaches you how to manipulate data to get it into a useful form. You'll also learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Moving on, you'll perform key tasks, such as handling missing values, validating errors, removing duplicate data, monitoring high volumes of data, and handling outliers and invalid dates. Next, you'll cover recipes on using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors, and generate visualizations for exploratory data analysis (EDA) to visualize unexpected values. Finally, you'll build functions and classes that you can reuse without modification when you have new data.

By the end of this Python book, you'll be equipped with all the key skills that you need to clean data and diagnose problems within it.

What you will learn

Find out how to read and analyze data from a variety of sources
Produce summaries of the attributes of data frames, columns, and rows
Filter data and select columns of interest that satisfy given criteria
Address messy data issues, including working with dates and missing values
Improve your productivity in Python pandas by using method chaining
Use visualizations to gain additional insights and identify potential data issues
Enhance your ability to learn what is going on in your data
Build user-defined functions and classes to automate data cleaning

Who this book is for

This book is for anyone looking for ways to handle messy, duplicate, and poor data using different Python tools and techniques. The book takes a recipe-based approach to help you to learn how to clean and manage data. Working knowledge of Python programming is all you need to get the most out of the book.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

Print ISBN

eBook ISBN

Edition

Topic

Computer Science

Subtopic

Data Mining

Index

Computer Science

Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas

Scientific distributions of Python (Anaconda, WinPython, Canopy, and so on) provide analysts with an impressive range of data manipulation, exploration, and visualization tools. One important tool is pandas. Developed by Wes McKinney in 2008, but really gaining in popularity after 2012, pandas is now an essential library for data analysis in Python. We work with pandas extensively in this book, along with popular packages such as numpy, matplotlib, and scipy.

A key pandas object is the data frame, which represents data as a tabular structure, with rows and columns. In this way, it is similar to the other data stores we discuss in this chapter. However, a pandas data frame also has indexing functionality that makes selecting, combining, and transforming data relatively straightforward, as the recipes in this book will demonstrate.

Before we can make use of this great functionality, we have to get our data into pandas. Data comes to us in a wide variety of formats: as CSV or Excel files, as tables from SQL databases, from statistical analysis packages such as SPSS, Stata, SAS, or R, from non-tabular sources such as JSON, and from web pages.

We examine tools for importing tabular data in this recipe. Specifically, we cover the following topics:

Importing CSV files
Importing Excel files
Importing data from SQL databases
Importing SPSS, Stata, and SAS data
Importing R data
Persisting tabular data

Technical requirements

The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook

Importing CSV files

The read_csv method of the pandas library can be used to read a file with comma separated values (CSV) and load it into memory as a pandas data frame. In this recipe, we read a CSV file and address some common issues: creating column names that make sense to us, parsing dates, and dropping rows with critical missing data.

Raw data is often stored as CSV files. These files have a carriage return at the end of each line of data to demarcate a row, and a comma between each data value to delineate columns. Something other than a comma can be used as the delimiter, such as a tab. Quotation marks may be placed around values, which can be helpful when the delimiter occurs naturally within certain values, which sometimes happens with commas.

All data in a CSV file are characters, regardless of the logical data type. This is why it is easy to view a CSV file, presuming it is not too large, in a text editor. The pandas read_csv method will make an educated guess about the data type of each column, but you will need to help it along to ensure that these guesses are on the mark.

Getting ready

Create a folder for this chapter and create a new Python script or Jupyter Notebook file in that folder. Create a data subfolder and place the landtempssample.csv file in that subfolder. Alternatively, you could retrieve all of the files from the GitHub repository. Here is a code sample from the beginning of the CSV file:

locationid,year,month,temp,latitude,longitude,stnelev,station,countryid,country

USS0010K01S,2000,4,5.27,39.9,-110.75,2773.7,INDIAN_CANYON,US,United States

CI000085406,1940,5,18.04,-18.35,-70.333,58.0,ARICA,CI,Chile

USC00036376,2013,12,6.22,34.3703,-91.1242,61.0,SAINT_CHARLES,U...

Python Data Cleaning Cookbook
Why subscribe?
Preface
Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas
Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas
Chapter 3: Taking the Measure of Your Data
Chapter 4: Identifying Missing Values and Outliers in Subsets of Data
Chapter 5: Using Visualizations for the Identification of Unexpected Values
Chapter 6: Cleaning and Exploring Data with Series Operations
Chapter 7: Fixing Messy Data when Aggregating
Chapter 8: Addressing Data Issues When Combining DataFrames
Chapter 9: Tidying and Reshaping Data
Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning
Other Books You May Enjoy

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Python Data Cleaning Cookbook by Michael Walker in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.