eBook - ePub

Hands-On Predictive Analytics with Python

Name: Hands-On Predictive Analytics with Python
Author: Alvaro Fuentes

Master the complete predictive analytics process, from problem definition to model deployment

Alvaro Fuentes

Share book

330 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Hands-On Predictive Analytics with Python

Master the complete predictive analytics process, from problem definition to model deployment

Alvaro Fuentes

Book details

Book preview

Table of contents

Citations

About This Book

Step-by-step guide to build high performing predictive applications

Key Features

Use the Python data analytics ecosystem to implement end-to-end predictive analytics projects
Explore advanced predictive modeling algorithms with an emphasis on theory with intuitive explanations
Learn to deploy a predictive model's results as an interactive application

Book Description

Predictive analytics is an applied field that employs a variety of quantitative methods using data to make predictions. It involves much more than just throwing data onto a computer to build a model. This book provides practical coverage to help you understand the most important concepts of predictive analytics. Using practical, step-by-step examples, we build predictive analytics solutions while using cutting-edge Python tools and packages.

The book's step-by-step approach starts by defining the problem and moves on to identifying relevant data. We will also be performing data preparation, exploring and visualizing relationships, building models, tuning, evaluating, and deploying model.

Each stage has relevant practical examples and efficient Python code. You will work with models such as KNN, Random Forests, and neural networks using the most important libraries in Python's data science stack: NumPy, Pandas, Matplotlib, Seaborn, Keras, Dash, and so on. In addition to hands-on code examples, you will find intuitive explanations of the inner workings of the main techniques and algorithms used in predictive analytics.

By the end of this book, you will be all set to build high-performance predictive analytics solutions using Python programming.

What you will learn

Get to grips with the main concepts and principles of predictive analytics
Learn about the stages involved in producing complete predictive analytics solutions
Understand how to define a problem, propose a solution, and prepare a dataset
Use visualizations to explore relationships and gain insights into the dataset
Learn to build regression and classification models using scikit-learn
Use Keras to build powerful neural network models that produce accurate predictions
Learn to serve a model's predictions as a web application

Who this book is for

This book is for data analysts, data scientists, data engineers, and Python developers who want to learn about predictive modeling and would like to implement predictive analytics solutions using Python's data stack. People from other backgrounds who would like to enter this exciting field will greatly benefit from reading this book. All you need is to be proficient in Python programming and have a basic understanding of statistics and college-level algebra.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Hands-On Predictive Analytics with Python an online PDF/ePUB?

Yes, you can access Hands-On Predictive Analytics with Python by Alvaro Fuentes in PDF and/or ePUB format, as well as other popular books in Computer Science & Programming in Python. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2018

ISBN

9781789134544

Edition

Topic

Computer Science

Subtopic

Programming in Python

Index

Computer Science

Predicting Categories with Machine Learning

In the previous chapter, we learned the basics of machine learning. In this chapter, we will build models that predict categories. This class of machine learning problems is known as classification tasks. Classification models are the ones that are the most useful in practice, and in this chapter we will talk about some of the most popular and foundational classification models.

We begin the chapter by providing an overview of the classification tasks and some of their applications. Then we bring back our credit card default dataset and start preparing it for modeling. After that, we introduce one of the most popular models for classification—logistic regression, which is similar in spirit to the multiple regression models we discussed in the previous chapter. The next model we present is classification trees. We present this model because it is very popular and easy to understand and, besides, it is the basis for one of the most popular and power models used in predictive analytics—random forests.

As we did in the previous chapter, we explain at a high level how these models work and we use scikit-learn to train models in our credit card default dataset. After training the models, we compare their performance on the testing set. Finally, because the credit card default dataset is a binary classification problem, we finish the chapter with a brief section that contains an example of the multiclass classification problem.

These are the learning outcomes for this chapter:

Learn about classification tasks and why classification models are so important
Review the credit card default dataset
Learn about the logistic regression model
Understand the classification trees model
Learn the random forest model
Provide a simple example of multiclass classification
Learn the basics of Naive Bayes classifiers

Technical requirements

The technical requirements for this chapter are as follows:

Python 3.6 or higher
Jupyter Notebook
Recent versions of the following Python libraries: NumPy, pandas, matplotlib, Seaborn, and scikit-learn

Classification tasks

Classification tasks belong to the supervised learning branch of ML. These kinds of tasks are the most widely used in applications in industry and academia. Here are just a few examples of classification tasks in some domains of application:

Direct marketing: Predict whether a customer will give a positive or a negative response to a campaign
Medicine: Predict whether a patient is healthy or is sick; or, for example, which kind of cancer the patient has
Insurance: Classify clients by risk level; for instance, low, average, or high risk
Telecommunication and other industries: Churn models are classification models that predict which customers will switch to another provider
Education: Predict which students will drop out from a program
Email services: Classify emails that go to different places such as inbox, spam, social, and promotions

Of course, our credit card default problem is a classification task because we are trying to predict if a customer will default or pay his credit card next month.

To review what we mentioned in the previous chapter, there are mainly three types of classification problems:

Binary classification: The target has only two categories, which is the case for our credit card default problem.
Multiclass classification: When the target has more than two classes.
Multilabel classification: The problem of assigning more than one category or label to an observation. A popular example could predict the subject of a news article based on its contents. Many news articles hardly fall into just one category; one article could be simultaneously about the broad topics of World News, Politics, and Finance.

Predicting categories and probabilities

ML classification models can output two types of predictions:

Predicted classes: For every observation, the model will directly give the prediction of the class.
Probabilities for each class: For every observation and every class, the model will output probabilities of that observation belonging to that class. Say, for example, we have three classes—A, B, and C—then the output of the model would be a triple of numbers such as [0.2, 0.7, 0.1], meaning the probabilities of the observation belonging to A, B, and C respectively. Note that, since we are dealing with probabilities, the values should add up to 1.

In the case of models that output the probability for every class, the classification is done by predicting the category with the highest probability. This is like the default rule; however, we can (and sometimes should) change this method of using the probabilities for predicting classes based on the goals we set for our predictive analytics project.

For binary classification models, we often name one of the classes "the positive class" and label the class with a 1 and the other class becomes "the negative class", labeled often with a 0 (many people like using a -1 as well, but I don't like it). The positive class is the class around which the analysis is made. Keep in mind that in this context the term "positive" has nothing to do with the regular use of the word, indicating that something is "good"—for instance, in the credit card default, our positive class will be "default", which of course from the point of view of the financial institution is not "positive" at all.

Credit card default dataset

OK, time to get our hands dirty with the credit card default data. We saw the descriptions of the features back in Chapter 2, Problem Understanding and Data Preparation:

SEX: Gender (1 = male; 2 = female).
EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).
AGE: Age (year).
LIMIT_BAL: Amount of the given credit (New Taiwan dollar)—it includes both the individual consumer credit and his/her family (supplementary) credit.
PAY_1 - PAY_6: History of past payment. We tracked the past monthly payment records (from April, 2005, to September, 2005) as follows: 0 = the repayment status in September, 2005; 1 = the repayment status in August, 2005; . . .; 6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and ...