Python Data Cleaning Cookbook
eBook - ePub

Python Data Cleaning Cookbook

Modern techniques and Python tools to detect and remove dirty data and extract key insights

  1. 436 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Python Data Cleaning Cookbook

Modern techniques and Python tools to detect and remove dirty data and extract key insights

About this book

Discover how to describe your data in detail, identify data issues, and find out how to solve them using commonly used techniques and tips and tricks

Key Features

  • Get well-versed with various data cleaning techniques to reveal key insights
  • Manipulate data of different complexities to shape them into the right form as per your business needs
  • Clean, monitor, and validate large data volumes to diagnose problems before moving on to data analysis

Book Description

Getting clean data to reveal insights is essential, as directly jumping into data analysis without proper data cleaning may lead to incorrect results. This book shows you tools and techniques that you can apply to clean and handle data with Python. You'll begin by getting familiar with the shape of data by using practices that can be deployed routinely with most data sources. Then, the book teaches you how to manipulate data to get it into a useful form. You'll also learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Moving on, you'll perform key tasks, such as handling missing values, validating errors, removing duplicate data, monitoring high volumes of data, and handling outliers and invalid dates. Next, you'll cover recipes on using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors, and generate visualizations for exploratory data analysis (EDA) to visualize unexpected values. Finally, you'll build functions and classes that you can reuse without modification when you have new data.

By the end of this Python book, you'll be equipped with all the key skills that you need to clean data and diagnose problems within it.

What you will learn

  • Find out how to read and analyze data from a variety of sources
  • Produce summaries of the attributes of data frames, columns, and rows
  • Filter data and select columns of interest that satisfy given criteria
  • Address messy data issues, including working with dates and missing values
  • Improve your productivity in Python pandas by using method chaining
  • Use visualizations to gain additional insights and identify potential data issues
  • Enhance your ability to learn what is going on in your data
  • Build user-defined functions and classes to automate data cleaning

Who this book is for

This book is for anyone looking for ways to handle messy, duplicate, and poor data using different Python tools and techniques. The book takes a recipe-based approach to help you to learn how to clean and manage data. Working knowledge of Python programming is all you need to get the most out of the book.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Python Data Cleaning Cookbook by Michael Walker in PDF and/or ePUB format, as well as other popular books in Ciencia de la computación & Minería de datos. We have over one million books available in our catalogue for you to explore.

Information

Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas

Scientific distributions of Python (Anaconda, WinPython, Canopy, and so on) provide analysts with an impressive range of data manipulation, exploration, and visualization tools. One important tool is pandas. Developed by Wes McKinney in 2008, but really gaining in popularity after 2012, pandas is now an essential library for data analysis in Python. We work with pandas extensively in this book, along with popular packages such as numpy, matplotlib, and scipy.
A key pandas object is the data frame, which represents data as a tabular structure, with rows and columns. In this way, it is similar to the other data stores we discuss in this chapter. However, a pandas data frame also has indexing functionality that makes selecting, combining, and transforming data relatively straightforward, as the recipes in this book will demonstrate.
Before we can make use of this great functionality, we have to get our data into pandas. Data comes to us in a wide variety of formats: as CSV or Excel files, as tables from SQL databases, from statistical analysis packages such as SPSS, Stata, SAS, or R, from non-tabular sources such as JSON, and from web pages.
We examine tools for importing tabular data in this recipe. Specifically, we cover the following topics:
  • Importing CSV files
  • Importing Excel files
  • Importing data from SQL databases
  • Importing SPSS, Stata, and SAS data
  • Importing R data
  • Persisting tabular data

Technical requirements

The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook

Importing CSV files

The read_csv method of the pandas library can be used to read a file with comma separated values (CSV) and load it into memory as a pandas data frame. In this recipe, we read a CSV file and address some common issues: creating column names that make sense to us, parsing dates, and dropping rows with critical missing data.
Raw data is often stored as CSV files. These files have a carriage return at the end of each line of data to demarcate a row, and a comma between each data value to delineate columns. Something other than a comma can be used as the delimiter, such as a tab. Quotation marks may be placed around values, which can be helpful when the delimiter occurs naturally within certain values, which sometimes happens with commas.
All data in a CSV file are characters, regardless of the logical data type. This is why it is easy to view a CSV file, presuming it is not too large, in a text editor. The pandas read_csv method will make an educated guess about the data type of each column, but you will need to help it along to ensure that these guesses are on the mark.

Getting ready

Create a folder for this chapter and create a new Python script or Jupyter Notebook file in that folder. Create a data subfolder and place the landtempssample.csv file in that subfolder. Alternatively, you could retrieve all of the files from the GitHub repository. Here is a code sample from the beginning of the CSV file:
locationid,year,month,temp,latitude,longitude,stnelev,station,countryid,country
USS0010K01S,2000,4,5.27,39.9,-110.75,2773.7,INDIAN_CANYON,US,United States
CI000085406,1940,5,18.04,-18.35,-70.333,58.0,ARICA,CI,Chile
USC00036376,2013,12,6.22,34.3703,-91.1242,61.0,SAINT_CHARLES,U...

Table of contents

  1. Python Data Cleaning Cookbook
  2. Why subscribe?
  3. Preface
  4. Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas
  5. Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas
  6. Chapter 3: Taking the Measure of Your Data
  7. Chapter 4: Identifying Missing Values and Outliers in Subsets of Data
  8. Chapter 5: Using Visualizations for the Identification of Unexpected Values
  9. Chapter 6: Cleaning and Exploring Data with Series Operations
  10. Chapter 7: Fixing Messy Data when Aggregating
  11. Chapter 8: Addressing Data Issues When Combining DataFrames
  12. Chapter 9: Tidying and Reshaping Data
  13. Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning
  14. Other Books You May Enjoy