Learning Objectives
We will start our journey by understanding the power of Python to manipulate and visualize data, creating useful analysis.
By the end of this chapter, you will be able to:
- Use all components of the Python data science stack
- Manipulate data using pandas DataFrames
- Create simple plots using pandas and Matplotlib
In this chapter, we will learn how to use NumPy, Pandas, Matplotlib, IPython, Jupyter notebook. Later in the chapter, we will explore how the deployment of virtualenv, pyenv, works, soon after that we will plot basic visualization using Matplotlib and Seaborn libraries.
Introduction
The Python data science stack is an informal name for a set of libraries used together to tackle data science problems. There is no consensus on which libraries are part of this list; it usually depends on the data scientist and the problem to be solved. We will present the libraries most commonly used together and explain how they can be used.
In this chapter, we will learn how to manipulate tabular data with the Python data science stack. The Python data science stack is the first stepping stone to manipulate large datasets, although these libraries are not commonly used for big data themselves. The ideas and the methods that are used here will be very helpful when we get to large datasets.
Python Libraries and Packages
One of the main reasons Python is a powerful programming language is the libraries and packages that come with it. There are more than 130,000 packages on the Python Package Index (PyPI) and counting! Let's explore some of the libraries and packages that are part of the data science stack.
The components of the data science stack are as follows:
- NumPy: A numerical manipulation package
- pandas: A data manipulation and analysis library
- SciPy library: A collection of mathematical algorithms built on top of NumPy
- Matplotlib: A plotting and graph library
- IPython: An interactive Python shell
- Jupyter notebook: A web document application for interactive computing
The combination of these libraries forms a powerful tool set for handling data manipulation and analysis. We will go through each of the libraries, explore their functionalities, and show how they work together. Let's start with the interpreters.
IPython: A Powerful Interactive Shell
The IPython shell (https://ipython.org/) is an interactive Python command interpreter that can handle several languages. It allows us to test ideas quickly rather than going through creating files and running them. Most Python installations have a bundled command interpreter, usually called the shell, where you can execute commands iteratively. Although it's handy, this standard Python shell is a bit cumbersome to use. IPython has more features:
- Input history that is available between sessions, so when you restart your shell, the previous commands that you typed can be reused.
- Using Tab completion for commands and variables, you can type the first letters of a Python command, function, or variable and IPython will autocomplete it.
- Magic commands that extend the functionality of the shell. Magic functions can enhance IPython functionality, such as adding a module that can reload imported modules after they are changed in the disk, without having to restart IPython.
- Syntax highlighting.
Exercise 1: Interacting with the Python Shell Using the IPython Commands
Getting started with the Python shell is simple. Let's follow these steps to interact with the IPython shell:
- To start the Python shell, type the ipython command in the console:
> ipython
In [1]:
The IPython shell is now ready and waiting for further commands. First, let's do a simple exercise to solve a sorting problem with one of the basic sorting methods, called straight insertion.
- In the IPython shell, copy-paste the following code:
import numpy as np
vec = np.random.randint(0, 100, size=5)
print(vec)
Now, the output for the randomly generated numbers will be similar to the following:
[23, 66, 12, 54, 98, 3]
- Use the following logic to print the elements of the vec array in ascending order:
for j in np.arange(1, vec.size):
v = vec[j]
i = j
while i > 0 and vec[i-1] > v:
vec[i] = vec[i-1]
i = i - 1
vec[i] = v
Use the print(vec) command to print the output on the console:
[3, 12, 23, 54, 66, 98]
- Now modify the code. Instead of creating an array of 5 elements, change its parameters so it creates an array with 20 elements, using the up arrow to edit the pasted code. After changing the relevant section, use the down arrow to move to the end of the code and press Enter to execute it.
Notice the number on the left, indicating the instruction number. This number always increases. We attributed the value to a variable and executed an operation on that variable, getting the result interactively. We will use IPython in the following sections.
The Jupyter Notebook
The Jupyter notebook (https://jupyter.org/) started as part of IPython but was separated in version 4 and extended, and lives now as a separate project. The notebook concept is based on the extension of the interactive shell model, creating documents that can run code, show documentation, and present results such as graphs and images.
Jupyter is a web application, so it runs in your web browser directly, without having to install separate software, and enabling it to be used across the internet. Jupyter can use IPython as a kernel for running Python, but it has support for more than 40 kernels that are contributed by the developer community.
Note
A kernel, in Jupyter parlance, is a computation engine that runs the code that is typed into a code cell in a notebook. For example, the IPython kernel executes Python code in a notebook. There are kernels for other languages, such as R and Julia.
It has become a de facto platform for performing operations related to data science from beginners to power users, and from small to large enterprises, and even academia. Its popularity has increased tremendously in the last few years. A Jupyter notebook contains both the input and the output of the code you run on it. It allows text, images, mathematical formulas, and more, and is an excellent platform for developing code and communicating results. Because of its web format, notebooks can be shared over the internet. It also supports the Markdown markup language and renders Markdown text as rich text, with formatting and other features supported.
As we've seen before, each notebook has a kernel. This kernel is the interpreter that will execute the code in the cells. The basic unit of a notebook is called a cell. A cell is a container for either code or text. W...