Mastering pandas
eBook - ePub

Mastering pandas

A complete guide to pandas, from installation to advanced data analysis techniques, 2nd Edition

  1. 674 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Mastering pandas

A complete guide to pandas, from installation to advanced data analysis techniques, 2nd Edition

About this book

Perform advanced data manipulation tasks using pandas and become an expert data analyst.

Key Features

  • Manipulate and analyze your data expertly using the power of pandas
  • Work with missing data and time series data and become a true pandas expert
  • Includes expert tips and techniques on making your data analysis tasks easier

Book Description

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful data manipulation techniques in pandas to perform complex data analysis in various domains.

An update to our highly successful previous edition with new features, examples, updated code, and more, this book is an in-depth guide to get the most out of pandas for data analysis. Designed for both intermediate users as well as seasoned practitioners, you will learn advanced data manipulation techniques, such as multi-indexing, modifying data structures, and sampling your data, which allow for powerful analysis and help you gain accurate insights from it. With the help of this book, you will apply pandas to different domains, such as Bayesian statistics, predictive analytics, and time series analysis using an example-based approach. And not just that; you will also learn how to prepare powerful, interactive business reports in pandas using the Jupyter notebook.

By the end of this book, you will learn how to perform efficient data analysis using pandas on complex data, and become an expert data analyst or data scientist in the process.

What you will learn

  • Speed up your data analysis by importing data into pandas
  • Keep relevant data points by selecting subsets of your data
  • Create a high-quality dataset by cleaning data and fixing missing values
  • Compute actionable analytics with grouping and aggregation in pandas
  • Master time series data analysis in pandas
  • Make powerful reports in pandas using Jupyter notebooks

Who this book is for

This book is for data scientists, analysts and Python developers who wish to explore advanced data analysis and scientific computing techniques using pandas. Some fundamental understanding of Python programming and familiarity with the basic data analysis concepts is all you need to get started with this book.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Information

Year
2019
Edition
2
eBook ISBN
9781789343359

Section 1: Overview of Data Analysis and pandas

In this section, we give you a quick overview of the concepts of the data analysis process and where pandas fits into that picture. You will also learn how to install and set up the pandas library, along with the other supporting libraries and environments required to build an enterprise-grade data analysis pipeline.
This section is comprised of the following chapters:
  • Chapter 1, Introduction to pandas and Data Analysis
  • Chapter 2, Installation of pandas and Supporting Software

Introduction to pandas and Data Analysis

We start the book and this chapter by discussing the contemporary data analytics landscape and how pandas fits into that landscape. pandas is the go-to tool for data scientists for data pre-processing tasks. We will learn about the technicalities of pandas in the later chapters. This chapter covers the context, origin, history, market share, and current standing of pandas.
The chapter has been divided into the following headers:
  • Motivation for data analysis
  • How Python and pandas can be used for data analysis
  • Description of the pandas library
  • Benefits of using pandas

Motivation for data analysis

In this section, we discuss the trends that are making data analysis an increasingly important field in today's fast-moving technological landscape.

We live in a big data world

The term big data has become one of the hottest technology buzzwords in the past two years. We now increasingly hear about big data in various media outlets, and big data start-ups have increasingly been attracting venture capital. A good example in the area of retail is Target Corporation, which has invested substantially in big data and is now able to identify potential customers by using big data to analyze people's shopping habits online; refer to a related article at http://nyti.ms/19LT8ic.
Loosely speaking, big data refers to the phenomenon wherein the amount of data exceeds the capability of the recipients of the data to process it. Here is an article on big data that sums it up nicely: https://www.oracle.com/in/big-data/guide/what-is-big-data.html.

The four V's of big data

A good way to start thinking about the complexities of big data is called the four dimensions, or Four V's of big data. This model was first introduced as the three V's by Gartner analyst Doug Laney in 2001. The three V's stood for Volume, Velocity, and Variety, and the fourth V, Veracity, was added later by IBM. Gartner's official definition states the following:
"Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."
Laney, Douglas. "The Importance of 'Big Data': A Definition", Gartner

Volume of big data

The volume of data in the big data age is simply mind-boggling. According to IBM, by 2020, the total amount of data on the planet will have ballooned to 40 zettabytes. You heard that right! 40 zettabytes is 43 trillion gigabytes. For more information on this, refer to the Wikipedia page on the zettabyte: http://en.wikipedia.org/wiki/Zettabyte.
To get a handle on how much data this is, let me refer to an EMC press release published in 2010, which stated what 1 zettabyte was approximately equal to:
"The digital information created by every man, woman and child on Earth 'Tweeting' continuously for 100 years " or "75 billion fully-loaded 16 GB Apple iPads, which would fill the entire area of Wembley Stadium to the brim 41 times, the Mont Blanc Tunnel 84 times, CERN's Large Hadron Collider tunnel 151 times, Beijing National Stadium 15.5 times or the Taipei 101 Tower 23 times..."
EMC study projects 45× data growth by 2020
The growth rate of data has been fuelled largely by a few factors, such as the following:
  • The rapid growth of the internet.
  • The conversion from analog to digital media, coupled with an increased ability to capture and store data, which in turn has been made possible with cheaper and better storage technology. There has been a proliferation of digital data input devices, such as cameras and wearables, and the cost of huge data storage has fallen rapidly. Amazon Web Services is a prime example of the trend toward much cheaper storage.
The internetification of devices, or rather the Internet of Things, is the phenomenon wherein common household devices, such as our refrigerators and cars, will be connected to the internet. This phenomenon will only accelerate the above trend.

Velocity of big data

From a purely technological point of view, velocity refers to the throughput of big data, or how fast the data is coming in and is being processed. This has ramifications on how fast the recipient of the data needs to process it to keep up. Real-time analytics is one attempt to handle this characteristic. Tools that can enable this include Amazon Web Services Elastic MapReduce.
At a more macro level, the velocity of data can also be regarded as the increased speed at which data and information can now be transferred and processed faster and at greater distances than ever before.
The proliferation of high-speed data and communication networks coupled with the advent of cell phones, tablets, and other connected devices are primary factors driving information velocity. Some measures of velocity include the number of tweets per second and the number of emails per minute.

Variety of big data

The variety of big data comes from having a multiplicity of data sources that generate data and the different formats of data that are produced.
This results in a technological challenge for the recipients of the data who have to process it. Digital cameras, sensors, the web, cell phones, and so on are some of the data generators that produce data in differing formats, and the challenge is being able to handle all these formats and extract meaningful information from the data. The ever-changing nature of data formats with the dawn of the big data era has led to a revolution in the database technology industry with the rise of NoSQL databases, which handle what is known as unstructured data or rather data whose format is fungible or constantly changing.

Veracity of big data

The fourth characteristic of big data—veracity, which was added later—refers to the need to validate or confirm the correctness of the data or the fact that the da...

Table of contents

  1. Title Page
  2. Copyright and Credits
  3. About Packt
  4. Contributors
  5. Preface
  6. Section 1: Overview of Data Analysis and pandas
  7. Introduction to pandas and Data Analysis
  8. Installation of pandas and Supporting Software
  9. Section 2: Data Structures and I/O in pandas
  10. Using NumPy and Data Structures with pandas
  11. I/Os of Different Data Formats with pandas
  12. Section 3: Mastering Different Data Operations in pandas
  13. Indexing and Selecting in pandas
  14. Grouping, Merging, and Reshaping Data in pandas
  15. Special Data Operations in pandas
  16. Time Series and Plotting Using Matplotlib
  17. Section 4: Going a Step Beyond with pandas
  18. Making Powerful Reports In Jupyter Using pandas
  19. A Tour of Statistics with pandas and NumPy
  20. A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates
  21. Data Case Studies Using pandas
  22. The pandas Library Architecture
  23. pandas Compared with Other Tools
  24. A Brief Tour of Machine Learning
  25. Other Books You May Enjoy

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Mastering pandas by Ashish Kumar in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Modelling & Design. We have over 1.5 million books available in our catalogue for you to explore.