Python Data Mining Quick Start Guide
eBook - ePub

Python Data Mining Quick Start Guide

A beginner's guide to extracting valuable insights from your data

  1. 188 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Python Data Mining Quick Start Guide

A beginner's guide to extracting valuable insights from your data

About this book

Explore the different data mining techniques using the libraries and packages offered by Python

Key Features

  • Grasp the basics of data loading, cleaning, analysis, and visualization
  • Use the popular Python libraries such as NumPy, pandas, matplotlib, and scikit-learn for data mining
  • Your one-stop guide to build efficient data mining pipelines without going into too much theory

Book Description

Data mining is a necessary and predictable response to the dawn of the information age. It is typically defined as the pattern and/ or trend discovery phase in the data mining pipeline, and Python is a popular tool for performing these tasks as it offers a wide variety of tools for data mining.

This book will serve as a quick introduction to the concept of data mining and putting it to practical use with the help of popular Python packages and libraries. You will get a hands-on demonstration of working with different real-world datasets and extracting useful insights from them using popular Python libraries such as NumPy, pandas, scikit-learn, and matplotlib. You will then learn the different stages of data mining such as data loading, cleaning, analysis, and visualization. You will also get a full conceptual description of popular data transformation, clustering, and classification techniques.

By the end of this book, you will be able to build an efficient data mining pipeline using Python without any hassle.

What you will learn

  • Explore the methods for summarizing datasets and visualizing/plotting data
  • Collect and format data for analytical work
  • Assign data points into groups and visualize clustering patterns
  • Learn how to predict continuous and categorical outputs for data
  • Clean, filter noise from, and reduce the dimensions of data
  • Serialize a data processing model using scikit-learn's pipeline feature
  • Deploy the data processing model using Python's pickle module

Who this book is for

Python developers interested in getting started with data mining will love this book. Budding data scientists and data analysts looking to quickly get to grips with practical data mining with Python will also find this book to be useful. Knowledge of Python programming is all you need to get started.

Trusted by 375,005 students

Access to over 1 million titles for a fair monthly price.

Study more efficiently using our study tools.

Information

Prediction with Regression and Classification

This chapter will cover the basics of predictive modeling, covering topics related to the mathematical machinery, types of predictive models, and tuning strategies. For many readers, prediction is the ultimate goal of their work, so it is important to understand that this topic is a full field of its own. Take this chapter as an introduction and launching-off point for your learning.
The following topics will be covered in this chapter:
  • Mathematical machinery, including loss functions and gradient descent
  • Linear regression and penalties
  • Logistic regression
  • Tree-based classification, including random forests
  • Support vector machines
  • Tuning methodologies including cross-validation and hyperparameter selection

Scikit-learn Estimator API

One of the reasons scikit-learn is so popular is its ease of use. There are only a few, well thought-out API designs in the library and they are applied in a sweeping manner across many different methods and routines. This chapter will make use of the Estimator API. It's extremely straightforward, and, once you understand how to use it, you can try our new regression and classification estimator methods with ease, because they all work in the same way (in other words, they all make use of the Estimator API).
The steps are given as follows:
  1. Import the module
  2. Instantiate the estimator object (regression or classification model in the following diagram)
  3. Fit the model-to-map input training data (X_train in the following diagram) to the ground truth y_train labels
  4. Predict y_pred on the new test data (X_test in the following diagram)
It can also be represented as a workflow diagram:

Introducing prediction concepts

Predicting the output value (that is, regression) or label (that is, classification) on future unseen data is a common final step in data mining projects.
Before reading the rest of this chapter, please be sure to digest the prerequisite concepts introduced in the Basic data terminology and Basic summary statistics sections in Chapter 2, Basic Terminology and Our End-to-End Example. In particular, the content on data types, variable types, and prediction metrics will be assumed as having been pre-learned throughout the entirety of the chapter.
The main strategy is to collect a training set and build a mapping function (that is, fit a model) from the input variables (X) to the output variable (y). Let's collect our assumptions before moving on:
  • (Assumption) There is a relationship between X and y, namely that X are independent variables and y is dependent on X
  • (Assumption) Future data will have the same distribution as the training set
If both assumptions hold, then you can build the model on a training set and apply the model to new unseen data to generate a meaningful prediction.
Mapping functions can model both linear and non-linear relationships and typically have multiple internal parameters that must be optimized for the best fit. Of course, we do not want to jump in and manually choose parameters of the mapping function, so we need to design an algorithm for building the function and finding the best parameters, which will allow our computing machines to learn the mapping function. If we can use mathematics to quantitatively describe what we want to get done, then our computers can do it for us. This means we have to quantitatively define the following:
  • What behavior is important to our problem statement
  • A strategy for optimizing that behavior
The most common strategy is to formulate the prediction algorithm as a minimization problem. In this mindset, we define bad behavior and how to minimize it. Bad behavior is defined as missed predictions and is quantified by a useful metric that measures the amount and extent of the misses, called loss. The function to calculate loss is appropriately named a loss function and typically compares the predicted output (ypred) to the ground truth output variable (y). Loss can be minimized in a multitude of ways, but the most common is called gradient descent and uses a trick from differential calculus to move a system in the direction of minimization. The details of loss and gradient descent are presented in the following sections.
Furthermore, a number of model parameters for each prediction algorithm can be preset to affect the minimization path. These are called hyperparameters and are set independently of the minimization problem. The process of building and tuning a prediction model's hyperparameters to ensure its reliable generalization to new data is accomplished by following an established and systematic series of steps. This process will be described in the Tuning a prediction model section later in the chapter.

Prediction nomenclature

This chapter will use the X, Y terminology introduced in the Variable types ...

Table of contents

  1. Title Page
  2. Copyright and Credits
  3. Dedication
  4. About Packt
  5. Contributors
  6. Preface
  7. Data Mining and Getting Started with Python Tools
  8. Basic Terminology and Our End-to-End Example
  9. Collecting, Exploring, and Visualizing Data
  10. Cleaning and Readying Data for Analysis
  11. Grouping and Clustering Data
  12. Prediction with Regression and Classification
  13. Advanced Topics - Building a Data Processing Pipeline and Deploying It
  14. Other Books You May Enjoy

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Python Data Mining Quick Start Guide by Nathan Greeneltch in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Modelling & Design. We have over one million books available in our catalogue for you to explore.