eBook - ePub

Data Science with Python

Name: Data Science with Python
Author: Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen

Combine Python with machine learning principles to discover hidden patterns in raw data

Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen

Share book

426 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Data Science with Python

Combine Python with machine learning principles to discover hidden patterns in raw data

Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen

Book details

Book preview

Table of contents

Citations

About This Book

Leverage the power of the Python data science libraries and advanced machine learning techniques to analyse large unstructured datasets and predict the occurrence of a particular future event.

Key Features

Explore the depths of data science, from data collection through to visualization
Learn pandas, scikit-learn, and Matplotlib in detail
Study various data science algorithms using real-world datasets

Book Description

Data Science with Python begins by introducing you to data science and teaches you to install the packages you need to create a data science coding environment. You will learn three major techniques in machine learning: unsupervised learning, supervised learning, and reinforcement learning. You will also explore basic classification and regression techniques, such as support vector machines, decision trees, and logistic regression.

As you make your way through chapters, you will study the basic functions, data structures, and syntax of the Python language that are used to handle large datasets with ease. You will learn about NumPy and pandas libraries for matrix calculations and data manipulation, study how to use Matplotlib to create highly customizable visualizations, and apply the boosting algorithm XGBoost to make predictions. In the concluding chapters, you will explore convolutional neural networks (CNNs), deep learning algorithms used to predict what is in an image. You will also understand how to feed human sentences to a neural network, make the model process contextual information, and create human language processing systems to predict the outcome.

By the end of this book, you will be able to understand and implement any new data science algorithm and have the confidence to experiment with tools or libraries other than those covered in the book.

What you will learn

Pre-process data to make it ready to use for machine learning
Create data visualizations with Matplotlib
Use scikit-learn to perform dimension reduction using principal component analysis (PCA)
Solve classification and regression problems
Get predictions using the XGBoost library
Process images and create machine learning models to decode them
Process human language for prediction and classification
Use TensorBoard to monitor training metrics in real time
Find the best hyperparameters for your model with AutoML

Who this book is for

Data Science with Python is designed for data analysts, data scientists, database engineers, and business analysts who want to move towards using Python and machine learning techniques to analyze data and predict outcomes. Basic knowledge of Python and data analytics will prove beneficial to understand the various concepts explained through this book.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Data Science with Python an online PDF/ePUB?

Yes, you can access Data Science with Python by Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen in PDF and/or ePUB format, as well as other popular books in Computer Science & Programming in Python. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2019

ISBN

9781838552169

Edition

Topic

Computer Science

Subtopic

Programming in Python

Index

Computer Science

Chapter 1 Introduction to Data Science and Data Pre-Processing

Learning Objectives

By the end of this chapter, you will be able to:

Use various Python machine learning libraries
Handle missing data and deal with outliers
Perform data integration to bring together data from different sources
Perform data transformation to convert data into a machine-readable form
Scale data to avoid problems with values of different magnitudes
Split data into train and test datasets
Describe the different types of machine learning
Describe the different performance measures of a machine learning model

This chapter introduces data science and covers the various processes included in the building of machine learning models, with a particular focus on pre-processing.

Introduction

We live in a world where we are constantly surrounded by data. As such, being able to understand and process data is an absolute necessity.

Data Science is a field that deals with the description, analysis, and prediction of data. Consider an example from our daily lives: every day, we utilize multiple social media applications on our phones. These applications gather and process data in order to create a more personalized experience for each user – for example, showing us news articles that we may be interested in, or tailoring search results according to our location. This branch of data science is known as machine learning.

Machine learning is the methodical learning of procedures and statistical representations that computers use to accomplish tasks without human intervention. In other words, it is the process of teaching a computer to perform tasks by itself without explicit instructions, relying only on patterns and inferences. Some common uses of machine learning algorithms are in email filtering, computer vision, and computational linguistics.

This book will focus on machine learning and other aspects of data science using Python. Python is a popular language for data science, as it is versatile and relatively easy to use. It also has several ready-made libraries that are well equipped for processing data.

Python Libraries

Throughout this book, we'll be using various Python libraries, including pandas, Matplotlib, Seaborn, and scikit-learn.

pandas

pandas is an open source package that has many functions for loading and processing data in order to prepare it for machine learning tasks. It also has tools that can be used to analyze and manipulate data. Data can be read from many formats using pandas. We will mainly be using CSV data throughout this book. To read CSV data, you can use the read_csv() function by passing filename.csv as an argument. An example of this is shown here:

>>> import pandas as pd

>>> pd.read_csv("data.csv")

In the preceding code, pd is an alias name given to pandas. It is not mandatory to give an alias. To visualize a pandas DataFrame, you can use the head() function to list the top five rows. This will be demonstrated in one of the following exercises.

Note

Please visit the following link to learn more about pandas: https://pandas.pydata.org/pandas-docs/stable/.

NumPy

NumPy is one of the main packages that Python has to offer. It is mainly used in practices related to scientific computing and when working on mathematical operations. It comprises of tools that enable us to work with arrays and array objects.

Matplotlib

Matplotlib is a data visualization package. It is useful for plotting data points in a 2D space with the help of NumPy.

Seaborn

Seaborn is also a data visualization library that is based on matplotlib. Visualizations created using Seaborn are far more attractive than ones created using matplotlib in terms of graphics.

scikit-learn

scikit-learn is a Python package used for machine learning. It is designed in such a way that it interoperates with other numeric and scientific libraries in Python to achieve the implementation of algorithms.

These ready-to-use libraries have gained interest and attention from developers, especially in the data science space. Now that we have covered the various libraries in Python, in the next section we'll explore the roadmap for building machine learning models.

Roadmap for Building Machine Learning Models

The roadmap for building machine learning models is straightforward and consists of five major steps, which are explained here:

Data Pre-processing
This is the first step in building a machine learning model. Data pre-processing refers to the transformation of data before feeding it into the model. It deals with the techniques that are used to convert unusable raw data into clean reliable data.
Since data collection is often not performed in a controlled manner, raw data often contains outliers (for example, age = 120), nonsensical data combinations (for example, model: bicycle, type: 4-wheeler), missing values, scale problems, and so on. Because of this, raw data cannot be fed into a machine learning model because it might compromise the quality of the results. As such, this is the most important step in the process of data science.
Model Learning
After pre-processing the data and splitting it into train/test sets (more on this later), we move on to modeling. Models are nothing but sets of well-defined methods called algorithms that use pre-processed data to learn patterns, which can later be used to make predictions. There are different types of learning algorithms, including supervised, semi-supervised, unsupervised, and reinforcement learning. These will be discussed later.
Model Evaluation
In this stage, the models are evaluated with the help of specific performance metrics. With these metrics, we can go on to tune the hyperparameters of a model in order to improve it. This process is called hyperparameter optimization. We will repeat this step until we are satisfied with the performance.
Prediction
Once we are happy with the results from the evaluation step, we will then move on to predictions. Predictions are made by the trained model when it is exposed to a new dataset. In a business setting, these predictions can be shared with decision makers to make effective business choices.
Model Deployment
The whole...