Hands-On Data Analysis with Pandas
eBook - ePub

Hands-On Data Analysis with Pandas

Efficiently perform data collection, wrangling, analysis, and visualization using Python

  1. 716 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Hands-On Data Analysis with Pandas

Efficiently perform data collection, wrangling, analysis, and visualization using Python

About this book

Get to grips with pandas—a versatile and high-performance Python library for data manipulation, analysis, and discovery

Key Features

  • Perform efficient data analysis and manipulation tasks using pandas
  • Apply pandas to different real-world domains using step-by-step demonstrations
  • Get accustomed to using pandas as an effective data exploration tool

Book Description

Data analysis has become a necessary skill in a variety of positions where knowing how to work with data and extract insights can generate significant value.

Hands-On Data Analysis with Pandas will show you how to analyze your data, get started with machine learning, and work effectively with Python libraries often used for data science, such as pandas, NumPy, matplotlib, seaborn, and scikit-learn. Using real-world datasets, you will learn how to use the powerful pandas library to perform data wrangling to reshape, clean, and aggregate your data. Then, you will learn how to conduct exploratory data analysis by calculating summary statistics and visualizing the data to find patterns. In the concluding chapters, you will explore some applications of anomaly detection, regression, clustering, and classification, using scikit-learn, to make predictions based on past data.

By the end of this book, you will be equipped with the skills you need to use pandas to ensure the veracity of your data, visualize it for effective decision-making, and reliably reproduce analyses across multiple datasets.

What you will learn

  • Understand how data analysts and scientists gather and analyze data
  • Perform data analysis and data wrangling in Python
  • Combine, group, and aggregate data from multiple sources
  • Create data visualizations with pandas, matplotlib, and seaborn
  • Apply machine learning (ML) algorithms to identify patterns and make predictions
  • Use Python data science libraries to analyze real-world datasets
  • Use pandas to solve common data representation and analysis problems
  • Build Python scripts, modules, and packages for reusable analysis code

Who this book is for

This book is for data analysts, data science beginners, and Python developers who want to explore each stage of data analysis and scientific computing using a wide range of datasets. You will also find this book useful if you are a data scientist who is looking to implement pandas in machine learning. Working knowledge of Python programming language will be beneficial.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Hands-On Data Analysis with Pandas by Stefanie Molin in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.

Information

Section 1: Getting Started with Pandas

Our journey begins with an introduction to data analysis and statistics, which will lay a strong foundation for the concepts we will cover throughout the book. Then, we will set up our Python data science environment, which contains everything we will need to work through the examples, and get started with learning the basics of pandas.
The following chapters are included in this section:
  • Chapter 1, Introduction to Data Analysis
  • Chapter 2, Working with Pandas DataFrames

Introduction to Data Analysis

Before we can begin our hands-on introduction to data analysis with pandas, we need to learn about the fundamentals of data analysis. Those who have ever looked at the documentation for a software library know how overwhelming it can be if you have no clue what you are looking for. Therefore, it is essential that we not only master the coding aspect, but also the thought process and workflow required to analyze data, which will prove the most useful in augmenting our skill set in the future.
Much like the scientific method, data science has some common workflows that we can follow when we want to conduct an analysis and present the results. The backbone of this process is statistics, which gives us ways to describe our data, make predictions, and also draw conclusions about it. Since prior knowledge of statistics is not a prerequisite, this chapter will give us exposure to the statistical concepts we will use throughout this book, as well as areas for further exploration.
After covering the fundamentals, we will get our Python environment set up for the remainder of this book. Python is a powerful language, and its uses go way beyond data science: building web applications, software, and web scraping, to name a few. In order to work effectively across projects, we need to learn how to make virtual environments, which will isolate each project's dependencies. Finally, we will learn how to work with Jupyter Notebooks in order to follow along with the text.
The following topics will be covered in this chapter:
  • The core components of conducting data analysis
  • Statistical foundations
  • How to set up a Python data science environment

Chapter materials

All the files for this book are on GitHub at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas. While having a GitHub account isn't necessary to work through this book, it is a good idea to create one, as it will serve as a portfolio for any data/coding projects. In addition, working with Git will provide a version control system and make collaboration easy.
Check out this article to learn some Git basics: https://www.freecodecamp.org/news/learn-the-basics-of-git-in-under-10-minutes-da548267cc91/.
In order to get a local copy of the files, we have a few options (ordered from least useful to most useful):
  • Download the ZIP file and extract the files locally
  • Clone the repository without forking it
  • Fork the repository and then clone it
This book includes exercises for every chapter; therefore, for those who want to keep a copy of their solutions along with the original content on GitHub, it is highly recommended to fork the repository and clone the forked version. When we fork a repository, GitHub will make a repository under our own profile with the latest version of the original. Then, whenever we make changes to our version, we can push the changes back up. Note that if we simply clone, we don't get this benefit.
The relevant buttons for initiating this process are circled in the following screenshot:
The cloning process will copy the files to the current working directory in a folder called Hands-On-Data-Analysis-with-Pandas. To make a folder to put this repository in, we can use mkdir my_folder && cd my_folder. This will create a new folder (directory) called my_folder and then change the current directory to that folder, after which we can clone the repository. We can chain these two commands (and any number of commands) together by adding && in between them. This can be thought of as and then (provided the first command succeeds).
This repository has folders for each chapter. This chapter's materials can be found at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/tree/master/ch_01. While the bulk of this chapter doesn't involve any coding, feel free to follow along in the introduction_to_data_analysis.ipynb notebook on the GitHub website until we set up our environment toward the end of the chapter. After we do so, we will use the check_your_environment.ipynb notebook to get familiar with Jupyter Notebooks and to run some checks to make sure that everything is set up properly for the rest of this book.
Since the code that's used to generate the content in these notebooks is not the main focus of this chapter, the majority of it has been separated into the check_environment.py and stats_viz.py files. If you choose to inspect these files, don't be overwhelmed; everything that's relevant to data science will be covered in this book.
Every chapter includes exercises; however, for this chapter only, there is an exercises.ipynb notebook, with some code to generate some starting data. Knowledge of basic Python will be necessary to complete these exercises. For those who would like to review the basics, the official Python tutorial is a good place to start: https://docs.python.org/3/tutorial/index.html.

Fundamentals of data analysis

Data analysis is a highly iterative process involving collection, preparation (wrangling), exploratory data analysis (EDA), and drawing conclusions. During an analysis, we will frequently revisit each of these steps. The following diagram depicts a generalized workflow:
In practice, this process is heavily skewed towards the data preparation side. Surveys have found that, although data scientists enjoy the data preparation side of their job the least, it makes up 80% of their work (https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#419ce7b36f63). This data preparation step is where pandas really shines.

Data collection

Data collection is the natural first step for any data analysis—we can't analyze data we don't have. In reality, our analysis can begin even before we have the data: when we decide what we want to investigate or analyze, we have to think of what kind of data we can collect that will be u...

Table of contents

  1. Title Page
  2. Copyright and Credits
  3. Dedication
  4. About Packt
  5. Foreword
  6. Contributors
  7. Preface
  8. Section 1: Getting Started with Pandas
  9. Introduction to Data Analysis
  10. Working with Pandas DataFrames
  11. Section 2: Using Pandas for Data Analysis
  12. Data Wrangling with Pandas
  13. Aggregating Pandas DataFrames
  14. Visualizing Data with Pandas and Matplotlib
  15. Plotting with Seaborn and Customization Techniques
  16. Section 3: Applications - Real-World Analyses Using Pandas
  17. Financial Analysis - Bitcoin and the Stock Market
  18. Rule-Based Anomaly Detection
  19. Section 4: Introduction to Machine Learning with Scikit-Learn
  20. Getting Started with Machine Learning in Python
  21. Making Better Predictions - Optimizing Models
  22. Machine Learning Anomaly Detection
  23. Section 5: Additional Resources
  24. The Road Ahead
  25. Solutions
  26. Appendix
  27. Other Books You May Enjoy