Big Data Analysis with Python
eBook - ePub

Big Data Analysis with Python

Combine Spark and Python to unlock the powers of parallel computing and machine learning

Ivan Marin, Ankit Shukla, Sarang VK

Share book
  1. 276 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Big Data Analysis with Python

Combine Spark and Python to unlock the powers of parallel computing and machine learning

Ivan Marin, Ankit Shukla, Sarang VK

Book details
Book preview
Table of contents
Citations

About This Book

Get to grips with processing large volumes of data and presenting it as engaging, interactive insights using Spark and Python.

Key Features

  • Get a hands-on, fast-paced introduction to the Python data science stack
  • Explore ways to create useful metrics and statistics from large datasets
  • Create detailed analysis reports with real-world data

Book Description

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control this data avalanche for you. With this book, you'll learn practical techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems.

The book begins with an introduction to data manipulation in Python using pandas. You'll then get familiar with statistical analysis and plotting techniques. With multiple hands-on activities in store, you'll be able to analyze data that is distributed on several computers by using Dask. As you progress, you'll study how to aggregate data for plots when the entire data cannot be accommodated in memory. You'll also explore Hadoop (HDFS and YARN), which will help you tackle larger datasets. The book also covers Spark and explains how it interacts with other tools.

By the end of this book, you'll be able to bootstrap your own Python environment, process large files, and manipulate data to generate statistics, metrics, and graphs.

What you will learn

  • Use Python to read and transform data into different formats
  • Generate basic statistics and metrics using data on disk
  • Work with computing tasks distributed over a cluster
  • Convert data from various sources into storage or querying formats
  • Prepare data for statistical analysis, visualization, and machine learning
  • Present data in the form of effective visuals

Who this book is for

Big Data Analysis with Python is designed for Python developers, data analysts, and data scientists who want to get hands-on with methods to control data and transform it into impactful insights. Basic knowledge of statistical measurements and relational databases will help you to understand various concepts explained in this book.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Big Data Analysis with Python an online PDF/ePUB?
Yes, you can access Big Data Analysis with Python by Ivan Marin, Ankit Shukla, Sarang VK in PDF and/or ePUB format, as well as other popular books in Ciencia de la computación & Programación en Python. We have over one million books available in our catalogue for you to explore.

Information

Year
2019
ISBN
9781789950731

Chapter 1

The Python Data Science Stack

Learning Objectives

We will start our journey by understanding the power of Python to manipulate and visualize data, creating useful analysis.
By the end of this chapter, you will be able to:
  • Use all components of the Python data science stack
  • Manipulate data using pandas DataFrames
  • Create simple plots using pandas and Matplotlib
In this chapter, we will learn how to use NumPy, Pandas, Matplotlib, IPython, Jupyter notebook. Later in the chapter, we will explore how the deployment of virtualenv, pyenv, works, soon after that we will plot basic visualization using Matplotlib and Seaborn libraries.

Introduction

The Python data science stack is an informal name for a set of libraries used together to tackle data science problems. There is no consensus on which libraries are part of this list; it usually depends on the data scientist and the problem to be solved. We will present the libraries most commonly used together and explain how they can be used.
In this chapter, we will learn how to manipulate tabular data with the Python data science stack. The Python data science stack is the first stepping stone to manipulate large datasets, although these libraries are not commonly used for big data themselves. The ideas and the methods that are used here will be very helpful when we get to large datasets.

Python Libraries and Packages

One of the main reasons Python is a powerful programming language is the libraries and packages that come with it. There are more than 130,000 packages on the Python Package Index (PyPI) and counting! Let's explore some of the libraries and packages that are part of the data science stack.
The components of the data science stack are as follows:
  • NumPy: A numerical manipulation package
  • pandas: A data manipulation and analysis library
  • SciPy library: A collection of mathematical algorithms built on top of NumPy
  • Matplotlib: A plotting and graph library
  • IPython: An interactive Python shell
  • Jupyter notebook: A web document application for interactive computing
The combination of these libraries forms a powerful tool set for handling data manipulation and analysis. We will go through each of the libraries, explore their functionalities, and show how they work together. Let's start with the interpreters.

IPython: A Powerful Interactive Shell

The IPython shell (https://ipython.org/) is an interactive Python command interpreter that can handle several languages. It allows us to test ideas quickly rather than going through creating files and running them. Most Python installations have a bundled command interpreter, usually called the shell, where you can execute commands iteratively. Although it's handy, this standard Python shell is a bit cumbersome to use. IPython has more features:
  • Input history that is available between sessions, so when you restart your shell, the previous commands that you typed can be reused.
  • Using Tab completion for commands and variables, you can type the first letters of a Python command, function, or variable and IPython will autocomplete it.
  • Magic commands that extend the functionality of the shell. Magic functions can enhance IPython functionality, such as adding a module that can reload imported modules after they are changed in the disk, without having to restart IPython.
  • Syntax highlighting.

Exercise 1: Interacting with the Python Shell Using the IPython Commands

Getting started with the Python shell is simple. Let's follow these steps to interact with the IPython shell:
  1. To start the Python shell, type the ipython command in the console:
    > ipython
    In [1]:
    The IPython shell is now ready and waiting for further commands. First, let's do a simple exercise to solve a sorting problem with one of the basic sorting methods, called straight insertion.
  2. In the IPython shell, copy-paste the following code:
    import numpy as np
    vec = np.random.randint(0, 100, size=5)
    print(vec)
    Now, the output for the randomly generated numbers will be similar to the following:
    [23, 66, 12, 54, 98, 3]
  3. Use the following logic to print the elements of the vec array in ascending order:
    for j in np.arange(1, vec.size):
    v = vec[j]
    i = j
    while i > 0 and vec[i-1] > v:
    vec[i] = vec[i-1]
    i = i - 1
    vec[i] = v
    Use the print(vec) command to print the output on the console:
    [3, 12, 23, 54, 66, 98]
  4. Now modify the code. Instead of creating an array of 5 elements, change its parameters so it creates an array with 20 elements, using the up arrow to edit the pasted code. After changing the relevant section, use the down arrow to move to the end of the code and press Enter to execute it.
Notice the number on the left, indicating the instruction number. This number always increases. We attributed the value to a variable and executed an operation on that variable, getting the result interactively. We will use IPython in the following sections.

The Jupyter Notebook

The Jupyter notebook (https://jupyter.org/) started as part of IPython but was separated in version 4 and extended, and lives now as a separate project. The notebook concept is based on the extension of the interactive shell model, creating documents that can run code, show documentation, and present results such as graphs and images.
Jupyter is a web application, so it runs in your web browser directly, without having to install separate software, and enabling it to be used across the internet. Jupyter can use IPython as a kernel for running Python, but it has support for more than 40 kernels that are contributed by the developer community.

Note

A kernel, in Jupyter parlance, is a computation engine that runs the code that is typed into a code cell in a notebook. For example, the IPython kernel executes Python code in a notebook. There are kernels for other languages, such as R and Julia.
It has become a de facto platform for performing operations related to data science from beginners to power users, and from small to large enterprises, and even academia. Its popularity has increased tremendously in the last few years. A Jupyter notebook contains both the input and the output of the code you run on it. It allows text, images, mathematical formulas, and more, and is an excellent platform for developing code and communicating results. Because of its web format, notebooks can be shared over the internet. It also supports the Markdown markup language and renders Markdown text as rich text, with formatting and other features supported.
As we've seen before, each notebook has a kernel. This kernel is the interpreter that will execute the code in the cells. The basic unit of a notebook is called a cell. A cell is a container for either code or text. W...

Table of contents