eBook - ePub

Python Data Cleaning Cookbook

Name: Python Data Cleaning Cookbook
Author: Michael Walker

Modern techniques and Python tools to detect and remove dirty data and extract key insights

Michael Walker

Compartir libro

436 páginas
English
ePUB (apto para móviles)
Disponible en iOS y Android

eBook - ePub

Python Data Cleaning Cookbook

Modern techniques and Python tools to detect and remove dirty data and extract key insights

Michael Walker

Detalles del libro

Vista previa del libro

Índice

Citas

Información del libro

Discover how to describe your data in detail, identify data issues, and find out how to solve them using commonly used techniques and tips and tricks

Key Features

Get well-versed with various data cleaning techniques to reveal key insights
Manipulate data of different complexities to shape them into the right form as per your business needs
Clean, monitor, and validate large data volumes to diagnose problems before moving on to data analysis

Book Description

Getting clean data to reveal insights is essential, as directly jumping into data analysis without proper data cleaning may lead to incorrect results. This book shows you tools and techniques that you can apply to clean and handle data with Python. You'll begin by getting familiar with the shape of data by using practices that can be deployed routinely with most data sources. Then, the book teaches you how to manipulate data to get it into a useful form. You'll also learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Moving on, you'll perform key tasks, such as handling missing values, validating errors, removing duplicate data, monitoring high volumes of data, and handling outliers and invalid dates. Next, you'll cover recipes on using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors, and generate visualizations for exploratory data analysis (EDA) to visualize unexpected values. Finally, you'll build functions and classes that you can reuse without modification when you have new data.

By the end of this Python book, you'll be equipped with all the key skills that you need to clean data and diagnose problems within it.

What you will learn

Find out how to read and analyze data from a variety of sources
Produce summaries of the attributes of data frames, columns, and rows
Filter data and select columns of interest that satisfy given criteria
Address messy data issues, including working with dates and missing values
Improve your productivity in Python pandas by using method chaining
Use visualizations to gain additional insights and identify potential data issues
Enhance your ability to learn what is going on in your data
Build user-defined functions and classes to automate data cleaning

Who this book is for

This book is for anyone looking for ways to handle messy, duplicate, and poor data using different Python tools and techniques. The book takes a recipe-based approach to help you to learn how to clean and manage data. Working knowledge of Python programming is all you need to get the most out of the book.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?

Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.

¿Cómo descargo los libros?

Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.

¿En qué se diferencian los planes de precios?

Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.

¿Qué es Perlego?

Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.

¿Perlego ofrece la función de texto a voz?

Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.

¿Es Python Data Cleaning Cookbook un PDF/ePUB en línea?

Sí, puedes acceder a Python Data Cleaning Cookbook de Michael Walker en formato PDF o ePUB, así como a otros libros populares de Informatik y Datenverarbeitung. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial

Packt Publishing

Año

2020

ISBN

9781800564596

Edición

Categoría

Informatik

Categoría

Datenverarbeitung

Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas

Scientific distributions of Python (Anaconda, WinPython, Canopy, and so on) provide analysts with an impressive range of data manipulation, exploration, and visualization tools. One important tool is pandas. Developed by Wes McKinney in 2008, but really gaining in popularity after 2012, pandas is now an essential library for data analysis in Python. We work with pandas extensively in this book, along with popular packages such as numpy, matplotlib, and scipy.

A key pandas object is the data frame, which represents data as a tabular structure, with rows and columns. In this way, it is similar to the other data stores we discuss in this chapter. However, a pandas data frame also has indexing functionality that makes selecting, combining, and transforming data relatively straightforward, as the recipes in this book will demonstrate.

Before we can make use of this great functionality, we have to get our data into pandas. Data comes to us in a wide variety of formats: as CSV or Excel files, as tables from SQL databases, from statistical analysis packages such as SPSS, Stata, SAS, or R, from non-tabular sources such as JSON, and from web pages.

We examine tools for importing tabular data in this recipe. Specifically, we cover the following topics:

Importing CSV files
Importing Excel files
Importing data from SQL databases
Importing SPSS, Stata, and SAS data
Importing R data
Persisting tabular data

Technical requirements

The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook

Importing CSV files

The read_csv method of the pandas library can be used to read a file with comma separated values (CSV) and load it into memory as a pandas data frame. In this recipe, we read a CSV file and address some common issues: creating column names that make sense to us, parsing dates, and dropping rows with critical missing data.

Raw data is often stored as CSV files. These files have a carriage return at the end of each line of data to demarcate a row, and a comma between each data value to delineate columns. Something other than a comma can be used as the delimiter, such as a tab. Quotation marks may be placed around values, which can be helpful when the delimiter occurs naturally within certain values, which sometimes happens with commas.

All data in a CSV file are characters, regardless of the logical data type. This is why it is easy to view a CSV file, presuming it is not too large, in a text editor. The pandas read_csv method will make an educated guess about the data type of each column, but you will need to help it along to ensure that these guesses are on the mark.

Getting ready

Create a folder for this chapter and create a new Python script or Jupyter Notebook file in that folder. Create a data subfolder and place the landtempssample.csv file in that subfolder. Alternatively, you could retrieve all of the files from the GitHub repository. Here is a code sample from the beginning of the CSV file:

locationid,year,month,temp,latitude,longitude,stnelev,station,countryid,country

USS0010K01S,2000,4,5.27,39.9,-110.75,2773.7,INDIAN_CANYON,US,United States

CI000085406,1940,5,18.04,-18.35,-70.333,58.0,ARICA,CI,Chile

USC00036376,2013,12,6.22,34.3703,-91.1242,61.0,SAINT_CHARLES,U...