Overview
This chapter describes Jupyter Notebooks and their use in data analysis. It also explains the features of Jupyter Notebooks, which allow for additional functionality beyond running Python code. You will learn and implement the fundamental features of Jupyter Notebooks by completing several hands-on exercises. By the end of this chapter, you will be able to use some important features of Jupyter Notebooks and some key libraries available in Python.
Introduction
Our approach to learning in this book is highly applied since hands-on learning is the quickest way to understand abstract concepts. With this in mind, the focus of this chapter is to introduce Jupyter Notebooks—the data science tool that we will be using throughout this book.
Since Jupyter Notebooks have gained mainstream popularity, they have been one of the most important tools for data scientists who use Python. This is because they offer a great environment for a variety of tasks, such as performing quick and dirty analysis, researching model selection, and creating reproducible pipelines. They allow for data to be loaded, transformed, and modeled inside a single file, where it's quick and easy to test out code and explore ideas along the way. Furthermore, all of this can be documented inline using formatted text, which means you can make notes or even produce a structured report.
Other comparable platforms—for example, RStudio or Spyder—offer multiple panels to work between. Frequently, one of these panels will be a Read Eval Prompt Loop (REPL), where code is run on a Terminal session that has saved memory. Code written here may end up being copied and pasted into a different panel within the main codebase, and there may also be additional panels to see visualizations or other files. Such development environments are prone to efficiency issues and can promote bad practices for reproducibility if you're not careful.
Jupyter Notebooks work differently. Instead of having multiple panels for different components of your project, they offer the same functionality in a single component (that is, the Notebook), where the text is displayed along with code snippets, and code outputs are displayed inline. This lets you code efficiently and allows you to look back at previous work for reference, or even make alterations.
We'll start this chapter by explaining exactly what Jupyter Notebooks are and why they are so popular among data scientists. Then, we'll access a Notebook together and go through some exercises to learn how the platform is used.
Basic Functionality and Features of Jupyter Notebooks
In this section, we will briefly demonstrate the usefulness of Jupyter Notebooks with examples. Then, we'll walk through the basics of how they work and how to run them within the Jupyter platform. For those who have used Jupyter Notebooks before, this will be a good refresher, and you are likely to uncover new things as well.
What Is a Jupyter Notebook and Why Is It Useful?
Jupyter Notebooks are locally run on web applications that contain live code, equations, figures, interactive apps, and Markdown text in which the default programming language is Python. In other words, a Notebook will assume you are writing Python unless you tell it otherwise. We'll see examples of this when we work through our first workbook, later in this chapter.
Note
Jupyter Notebooks support many programming languages through the use of kernels, which act as bridges between the Notebook and the language. These include R, C++, and JavaScript, among many others. A list of available kernels can be found here: https://packt.live/2Y0jKJ0.
The following is an example of a Jupyter Notebook:
Figure 1.1: Jupyter Notebook sample workbook
Besides executing Python code, you can write in Markdown to quickly render formatted text, such as titles, lists, or bold font. This can be done in combination with code using the concept of independent cells in the Notebook, as seen in Figure 1.2. Markdown is not specific to Jupyter; it is also a simple language used for styling text and creating basic documents. For example, most GitHub repositories have a README.md file that is written in Markdown format. It's comparable to HTML but offers much less customization in exchange for simplicity.
Commonly used symbols in markdown include hashes (#) to make text into a heading, square ([]) and round brackets (()) to insert hyperlinks, and asterisks (*) to create italicized or bold text:
Figure 1.2: Sample Markdown document
In addition, Markdown can be used to render images and add hyperlinks in your document, both of which are supported in Jupyter Notebooks.
Jupyter Notebooks was not the first tool to use Markdown alongside code. This was the design of R Markdown, a hybrid language where R code can be written and executed inline with Markdown text. Jupyter Notebooks essentially offer the equivalent functionality for Python code. However, as we will see, they function quite differently from R Markdown documents. For example, R Markdown assumes you are writing Markdown unless otherwise specified, whereas Jupyter Notebooks assume you are inputting code. This and other features (as we will explore throughout) make it more appealing to use Jupyter Notebooks for rapid development in data science research.
While Jupyter Notebooks offer a blank canvas for a general range of applications, the types of Notebooks commonly seen in real-world data science can be categorized as either lab-style or deliverable.
Lab-style Notebooks serve as the programming analog of research journals. These should contain all the work you've done to load, process, analyze, and model the data. The idea here is to document everything you've done for future reference. For this reason, it's usually not advisable to delete or alter previous lab-style Notebooks. It's also a good idea to accumulate multiple date-stamped versions of the Notebook as you progress through the analysis, in case you want to look back at previous states.
Deliverable Notebooks are intended to be presentable and should contain only select parts of the lab-style Notebooks. For example, this could be an interesting discovery to share with your colleagues, an in-depth report of your analysis for a manager, ...