Big Data Analysis with Python
eBook - ePub

Big Data Analysis with Python

Combine Spark and Python to unlock the powers of parallel computing and machine learning

Ivan Marin, Ankit Shukla, Sarang VK

Partager le livre
  1. 276 pages
  2. English
  3. ePUB (adapté aux mobiles)
  4. Disponible sur iOS et Android
eBook - ePub

Big Data Analysis with Python

Combine Spark and Python to unlock the powers of parallel computing and machine learning

Ivan Marin, Ankit Shukla, Sarang VK

DĂ©tails du livre
Aperçu du livre
Table des matiĂšres
Citations

À propos de ce livre

Get to grips with processing large volumes of data and presenting it as engaging, interactive insights using Spark and Python.

Key Features

  • Get a hands-on, fast-paced introduction to the Python data science stack
  • Explore ways to create useful metrics and statistics from large datasets
  • Create detailed analysis reports with real-world data

Book Description

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control this data avalanche for you. With this book, you'll learn practical techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems.

The book begins with an introduction to data manipulation in Python using pandas. You'll then get familiar with statistical analysis and plotting techniques. With multiple hands-on activities in store, you'll be able to analyze data that is distributed on several computers by using Dask. As you progress, you'll study how to aggregate data for plots when the entire data cannot be accommodated in memory. You'll also explore Hadoop (HDFS and YARN), which will help you tackle larger datasets. The book also covers Spark and explains how it interacts with other tools.

By the end of this book, you'll be able to bootstrap your own Python environment, process large files, and manipulate data to generate statistics, metrics, and graphs.

What you will learn

  • Use Python to read and transform data into different formats
  • Generate basic statistics and metrics using data on disk
  • Work with computing tasks distributed over a cluster
  • Convert data from various sources into storage or querying formats
  • Prepare data for statistical analysis, visualization, and machine learning
  • Present data in the form of effective visuals

Who this book is for

Big Data Analysis with Python is designed for Python developers, data analysts, and data scientists who want to get hands-on with methods to control data and transform it into impactful insights. Basic knowledge of statistical measurements and relational databases will help you to understand various concepts explained in this book.

Foire aux questions

Comment puis-je résilier mon abonnement ?
Il vous suffit de vous rendre dans la section compte dans paramĂštres et de cliquer sur « RĂ©silier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez rĂ©siliĂ© votre abonnement, il restera actif pour le reste de la pĂ©riode pour laquelle vous avez payĂ©. DĂ©couvrez-en plus ici.
Puis-je / comment puis-je télécharger des livres ?
Pour le moment, tous nos livres en format ePub adaptĂ©s aux mobiles peuvent ĂȘtre tĂ©lĂ©chargĂ©s via l’application. La plupart de nos PDF sont Ă©galement disponibles en tĂ©lĂ©chargement et les autres seront tĂ©lĂ©chargeables trĂšs prochainement. DĂ©couvrez-en plus ici.
Quelle est la différence entre les formules tarifaires ?
Les deux abonnements vous donnent un accĂšs complet Ă  la bibliothĂšque et Ă  toutes les fonctionnalitĂ©s de Perlego. Les seules diffĂ©rences sont les tarifs ainsi que la pĂ©riode d’abonnement : avec l’abonnement annuel, vous Ă©conomiserez environ 30 % par rapport Ă  12 mois d’abonnement mensuel.
Qu’est-ce que Perlego ?
Nous sommes un service d’abonnement Ă  des ouvrages universitaires en ligne, oĂč vous pouvez accĂ©der Ă  toute une bibliothĂšque pour un prix infĂ©rieur Ă  celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! DĂ©couvrez-en plus ici.
Prenez-vous en charge la synthÚse vocale ?
Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte Ă  haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accĂ©lĂ©rer ou le ralentir. DĂ©couvrez-en plus ici.
Est-ce que Big Data Analysis with Python est un PDF/ePUB en ligne ?
Oui, vous pouvez accĂ©der Ă  Big Data Analysis with Python par Ivan Marin, Ankit Shukla, Sarang VK en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Ciencia de la computaciĂłn et ProgramaciĂłn en Python. Nous disposons de plus d’un million d’ouvrages Ă  dĂ©couvrir dans notre catalogue.

Informations

Année
2019
ISBN
9781789950731

Chapter 1

The Python Data Science Stack

Learning Objectives

We will start our journey by understanding the power of Python to manipulate and visualize data, creating useful analysis.
By the end of this chapter, you will be able to:
  • Use all components of the Python data science stack
  • Manipulate data using pandas DataFrames
  • Create simple plots using pandas and Matplotlib
In this chapter, we will learn how to use NumPy, Pandas, Matplotlib, IPython, Jupyter notebook. Later in the chapter, we will explore how the deployment of virtualenv, pyenv, works, soon after that we will plot basic visualization using Matplotlib and Seaborn libraries.

Introduction

The Python data science stack is an informal name for a set of libraries used together to tackle data science problems. There is no consensus on which libraries are part of this list; it usually depends on the data scientist and the problem to be solved. We will present the libraries most commonly used together and explain how they can be used.
In this chapter, we will learn how to manipulate tabular data with the Python data science stack. The Python data science stack is the first stepping stone to manipulate large datasets, although these libraries are not commonly used for big data themselves. The ideas and the methods that are used here will be very helpful when we get to large datasets.

Python Libraries and Packages

One of the main reasons Python is a powerful programming language is the libraries and packages that come with it. There are more than 130,000 packages on the Python Package Index (PyPI) and counting! Let's explore some of the libraries and packages that are part of the data science stack.
The components of the data science stack are as follows:
  • NumPy: A numerical manipulation package
  • pandas: A data manipulation and analysis library
  • SciPy library: A collection of mathematical algorithms built on top of NumPy
  • Matplotlib: A plotting and graph library
  • IPython: An interactive Python shell
  • Jupyter notebook: A web document application for interactive computing
The combination of these libraries forms a powerful tool set for handling data manipulation and analysis. We will go through each of the libraries, explore their functionalities, and show how they work together. Let's start with the interpreters.

IPython: A Powerful Interactive Shell

The IPython shell (https://ipython.org/) is an interactive Python command interpreter that can handle several languages. It allows us to test ideas quickly rather than going through creating files and running them. Most Python installations have a bundled command interpreter, usually called the shell, where you can execute commands iteratively. Although it's handy, this standard Python shell is a bit cumbersome to use. IPython has more features:
  • Input history that is available between sessions, so when you restart your shell, the previous commands that you typed can be reused.
  • Using Tab completion for commands and variables, you can type the first letters of a Python command, function, or variable and IPython will autocomplete it.
  • Magic commands that extend the functionality of the shell. Magic functions can enhance IPython functionality, such as adding a module that can reload imported modules after they are changed in the disk, without having to restart IPython.
  • Syntax highlighting.

Exercise 1: Interacting with the Python Shell Using the IPython Commands

Getting started with the Python shell is simple. Let's follow these steps to interact with the IPython shell:
  1. To start the Python shell, type the ipython command in the console:
    > ipython
    In [1]:
    The IPython shell is now ready and waiting for further commands. First, let's do a simple exercise to solve a sorting problem with one of the basic sorting methods, called straight insertion.
  2. In the IPython shell, copy-paste the following code:
    import numpy as np
    vec = np.random.randint(0, 100, size=5)
    print(vec)
    Now, the output for the randomly generated numbers will be similar to the following:
    [23, 66, 12, 54, 98, 3]
  3. Use the following logic to print the elements of the vec array in ascending order:
    for j in np.arange(1, vec.size):
    v = vec[j]
    i = j
    while i > 0 and vec[i-1] > v:
    vec[i] = vec[i-1]
    i = i - 1
    vec[i] = v
    Use the print(vec) command to print the output on the console:
    [3, 12, 23, 54, 66, 98]
  4. Now modify the code. Instead of creating an array of 5 elements, change its parameters so it creates an array with 20 elements, using the up arrow to edit the pasted code. After changing the relevant section, use the down arrow to move to the end of the code and press Enter to execute it.
Notice the number on the left, indicating the instruction number. This number always increases. We attributed the value to a variable and executed an operation on that variable, getting the result interactively. We will use IPython in the following sections.

The Jupyter Notebook

The Jupyter notebook (https://jupyter.org/) started as part of IPython but was separated in version 4 and extended, and lives now as a separate project. The notebook concept is based on the extension of the interactive shell model, creating documents that can run code, show documentation, and present results such as graphs and images.
Jupyter is a web application, so it runs in your web browser directly, without having to install separate software, and enabling it to be used across the internet. Jupyter can use IPython as a kernel for running Python, but it has support for more than 40 kernels that are contributed by the developer community.

Note

A kernel, in Jupyter parlance, is a computation engine that runs the code that is typed into a code cell in a notebook. For example, the IPython kernel executes Python code in a notebook. There are kernels for other languages, such as R and Julia.
It has become a de facto platform for performing operations related to data science from beginners to power users, and from small to large enterprises, and even academia. Its popularity has increased tremendously in the last few years. A Jupyter notebook contains both the input and the output of the code you run on it. It allows text, images, mathematical formulas, and more, and is an excellent platform for developing code and communicating results. Because of its web format, notebooks can be shared over the internet. It also supports the Markdown markup language and renders Markdown text as rich text, with formatting and other features supported.
As we've seen before, each notebook has a kernel. This kernel is the interpreter that will execute the code in the cells. The basic unit of a notebook is called a cell. A cell is a container for either code or text. W...

Table des matiĂšres