eBook - ePub

Machine Learning with scikit-learn Quick Start Guide

Name: Machine Learning with scikit-learn Quick Start Guide
ISBN: 9781789347371

Classification, regression, and clustering techniques in Python

Kevin Jolly,

172 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Machine Learning with scikit-learn Quick Start Guide

Classification, regression, and clustering techniques in Python

Kevin Jolly,

About this book

Deploy supervised and unsupervised machine learning algorithms using scikit-learn to perform classification, regression, and clustering.

Key Features

Build your first machine learning model using scikit-learn
Train supervised and unsupervised models using popular techniques such as classification, regression and clustering
Understand how scikit-learn can be applied to different types of machine learning problems

Book Description

Scikit-learn is a robust machine learning library for the Python programming language. It provides a set of supervised and unsupervised learning algorithms. This book is the easiest way to learn how to deploy, optimize, and evaluate all of the important machine learning algorithms that scikit-learn provides.

This book teaches you how to use scikit-learn for machine learning. You will start by setting up and configuring your machine learning environment with scikit-learn. To put scikit-learn to use, you will learn how to implement various supervised and unsupervised machine learning models. You will learn classification, regression, and clustering techniques to work with different types of datasets and train your models.

Finally, you will learn about an effective pipeline to help you build a machine learning project from scratch. By the end of this book, you will be confident in building your own machine learning models for accurate predictions.

What you will learn

Learn how to work with all scikit-learn's machine learning algorithms
Install and set up scikit-learn to build your first machine learning model
Employ Unsupervised Machine Learning Algorithms to cluster unlabelled data into groups
Perform classification and regression machine learning
Use an effective pipeline to build a machine learning project from scratch

Who this book is for

This book is for aspiring machine learning developers who want to get started with scikit-learn. Intermediate knowledge of Python programming and some fundamental knowledge of linear algebra and probability will help.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

Print ISBN

eBook ISBN

Edition

Topic

Ciencia de la computación

Subtopic

Ciencias computacionales general

Classification and Regression with Trees

Tree based algorithms are very popular for two reasons: they are interpretable, and they make sound predictions that have won many machine learning competitions on online platforms, such as Kaggle. Furthermore, they have many use cases outside of machine learning for solving problems, both simple and complex.

Building a tree is an approach to decision-making used in almost all industries. Trees can be used to solve both classification- and regression-based problems, and have several use cases that make them the go-to solution!

This chapter is broadly divided into the following two sections:

Classification trees
Regression trees

Each section will cover the fundamental theory of different types of tree based algorithms, along with their implementation in scikit-learn. By the end of this chapter, you will have learned how to aggregate several algorithms into an ensemble and have them vote on what the best prediction is.

Technical requirements

You will be required to have Python 3.6 or greater, Pandas ≥ 0.23.4, Scikit-learn ≥ 0.20.0, and Matplotlib ≥ 3.0.0 installed on your system.

The code files of this chapter can be found on GitHub:
https://github.com/PacktPublishing/Machine-Learning-with-scikit-learn-Quick-Start-Guide/blob/master/Chapter_06.ipynb.

Check out the following video to see the code in action:

http://bit.ly/2SrPP7R

Classification trees

Classification trees are used to predict a category or class. This is similar to the classification algorithms that you have learned about previously in this book, such as the k-nearest neighbors algorithm or logistic regression.

Broadly speaking, there are three tree based algorithms that are used to solve classification problems:

The decision tree classifier
The random forest classifier
The AdaBoost classifier

In this section, you will learn how each of these tree based algorithms works, in order to classify a row of data as a particular class or category.

The decision tree classifier

The decision tree is the simplest tree based algorithm, and serves as the foundation for the other two algorithms. Let's consider the following simple decision tree:

A simple decision tree

A decision tree, in simple terms, is a set of rules that help us classify observations into distinct groups. In the previous diagram, the rule could be written as the following:

If (value of feature is less than 50); then (put the triangles in the left-hand box and put the circles in the right-hand box).

The preceding decision tree perfectly divides the observations into two distinct groups. This is a characteristic of the ideal decision tree. The first box on the top is called the root of the tree, and is the most important feature of the tree when it comes to deciding how to group the observations.

The boxes under the root node are known as the children. In the preceding tree, the children are also the leaf nodes. The leaf is the last set of boxes, usually in the bottommost part of the tree. As you might have guessed, the decision tree represents a regular tree, but inverted, or upside down.

Picking the best feature

How does the decision tree decide which feature is the best? The best feature is one that offers the best possible split, and divides the tree into two or more distinct groups, depending on the number of classes or categories that we have in the data. Let's have a look at the following diagram:

A decision tree showing a good split

In the preceding diagram, the following happens:

The tree splits the data from the root node into two distinct groups.
In the left-hand group, we see that there are two triangles and one circle.
In the right-hand group, we see that there are two circles and one triangle.
Since the tree got the majority of each class into one group, we can say that the tree has done a good job when it comes to splitting the data into distinct groups.

Let's take a look at another example—this time, one in which the split is bad. Consider the following diagram:

A decision tree with a bad split

In the preceding diagram, the following happens:

The tree splits the data in the root node into four distinct groups. This is bad in itself, as it is clear that there are only two categories (circle and triangle).
Furthermore, each group has one triangle and one circle.
There is no majority class or category in any one of the four groups. Each group has 50% of one category; therefore, the tree cannot come to a conclusive decision, unless it relies on more features, which then increases the complexity of the tree.

The Gini coefficient

The metric that the decision tree uses to decide if the root node is called the Gini coefficient. The higher the value of this coefficient, the better the job that this particular feature does at splitting the data into distinct groups. In order to learn how to compute the Gini coefficient for a feature, let's consider the following diagram:

Computing the Gini coefficient

In the preceding diagram, the following happens:

The feature splits the data into two groups.
In the left-hand group, we have two triangles and one circle.
Therefore, the Gini for the left-hand group is (2 triangles/3 total data points)^2+ (1 circle/3 total data points)^2.
To calculate this, do the following:

0.55.
A value of 0.55 for the Gini coefficient indicates that the root of this tree splits the data in such a way that each group has a majority category.
A perfect root feature would have a Gini coefficient of 1. This means that each group has only one class/category.
A bad root feature would have a Gini coefficient of 0.5, which indicates that there is no distinct class/category in a group.

In reality, the decision tree is built in a recursive manner, with the tree picking a random attribute for the root and then computing the Gini coefficient for that attribute. It does this until it finds the attribute that best splits the data in a node into groups that have distinct classes and categories.

Implementing the decision tree classifier in scikit-learn

In this section, you will learn how to implement the decision tree classifier in scikit-learn. We will work with the same fraud detection dataset. The f...

Title Page
Copyright and Credits
Dedication
About Packt
Contributors
Preface
Introducing Machine Learning with scikit-learn
Predicting Categories with K-Nearest Neighbors
Predicting Categories with Logistic Regression
Predicting Categories with Naive Bayes and SVMs
Predicting Numeric Outcomes with Linear Regression
Classification and Regression with Trees
Clustering Data with Unsupervised Machine Learning
Performance Evaluation Methods
Other Books You May Enjoy

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Machine Learning with scikit-learn Quick Start Guide by Kevin Jolly in PDF and/or ePUB format, as well as other popular books in Ciencia de la computación & Ciencias computacionales general. We have over one million books available in our catalogue for you to explore.

About this book

Tools to learn more effectively

Information

Table of contents

Frequently asked questions