eBook - ePub

Mastering Data Mining with Python – Find patterns hidden in your data

Name: Mastering Data Mining with Python – Find patterns hidden in your data
Author: Megan Squire

Megan Squire

Compartir libro

268 páginas
English
ePUB (apto para móviles)
Disponible en iOS y Android

eBook - ePub

Mastering Data Mining with Python – Find patterns hidden in your data

Megan Squire

Detalles del libro

Vista previa del libro

Índice

Citas

Información del libro

Learn how to create more powerful data mining applications with this comprehensive Python guide to advance data analytics techniques

About This Book

Dive deeper into data mining with Python – don't be complacent, sharpen your skills!
From the most common elements of data mining to cutting-edge techniques, we've got you covered for any data-related challenge
Become a more fluent and confident Python data-analyst, in full control of its extensive range of libraries

Who This Book Is For

This book is for data scientists who are already familiar with some basic data mining techniques such as SQL and machine learning, and who are comfortable with Python. If you are ready to learn some more advanced techniques in data mining in order to become a data mining expert, this is the book for you!

What You Will Learn

Explore techniques for finding frequent itemsets and association rules in large data sets
Learn identification methods for entity matches across many different types of data
Identify the basics of network mining and how to apply it to real-world data sets
Discover methods for detecting the sentiment of text and for locating named entities in text
Observe multiple techniques for automatically extracting summaries and generating topic models for text
See how to use data mining to fix data anomalies and how to use machine learning to identify outliers in a data set

In Detail

Data mining is an integral part of the data science pipeline. It is the foundation of any successful data-driven strategy – without it, you'll never be able to uncover truly transformative insights. Since data is vital to just about every modern organization, it is worth taking the next step to unlock even greater value and more meaningful understanding.

If you already know the fundamentals of data mining with Python, you are now ready to experiment with more interesting, advanced data analytics techniques using Python's easy-to-use interface and extensive range of libraries.

In this book, you'll go deeper into many often overlooked areas of data mining, including association rule mining, entity matching, network mining, sentiment analysis, named entity recognition, text summarization, topic modeling, and anomaly detection. For each data mining technique, we'll review the state-of-the-art and current best practices before comparing a wide variety of strategies for solving each problem. We will then implement example solutions using real-world data from the domain of software engineering, and we will spend time learning how to understand and interpret the results we get.

By the end of this book, you will have solid experience implementing some of the most interesting and relevant data mining techniques available today, and you will have achieved a greater fluency in the important field of Python data analytics.

Style and approach

This book will teach you the intricacies in applying data mining using real-world scenarios and will act as a very practical solution to your data mining needs.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?

Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.

¿Cómo descargo los libros?

Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.

¿En qué se diferencian los planes de precios?

Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.

¿Qué es Perlego?

Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.

¿Perlego ofrece la función de texto a voz?

Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.

¿Es Mastering Data Mining with Python – Find patterns hidden in your data un PDF/ePUB en línea?

Sí, puedes acceder a Mastering Data Mining with Python – Find patterns hidden in your data de Megan Squire en formato PDF o ePUB, así como a otros libros populares de Informatique y Programmation en Python. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial

Packt Publishing

Año

2016

ISBN

9781785889950

Edición

Categoría

Informatique

Categoría

Programmation en Python

Mastering Data Mining with Python – Find patterns hidden in your data

Credits

About the Author

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Expanding Your Data Mining Toolbox

What is data mining?

How do we do data mining?

The Fayyad et al. KDD process

The Han et al. KDD process

The CRISP-DM process

The Six Steps process

Which data mining methodology is the best?

What are the techniques used in data mining?

What techniques are we going to use in this book?

How do we set up our data mining work environment?

Summary

2. Association Rule Mining

What are frequent itemsets?

The diapers and beer urban legend

Frequent itemset mining basics

Towards association rules

Support

Confidence

Association rules

An example with data

Added value – fixing a flaw in the plan

Methods for finding frequent itemsets

A project – discovering association rules in software project tags

Summary

3. Entity Matching

What is entity matching?

Merging data

Merging datasets vertically

Merging datasets horizontally

Techniques for matching

Attribute-based similarity matching

Be careful of pairwise comparisons

Leverage rare values

Methods for matching attributes

Range-based or distance from target

String edit distance

Hamming distance

Levenshtein distance

Soundex

Leveraging disjoint sets

Context-based similarity matching

Machine learning-based entity matching

Evaluation of entity matching techniques

Efficiency – how long does it take to do the matching?

Effectiveness – how accurate are the matches that we generate?

Usefulness – how practical is the matching procedure to use?

Entity matching project

Difficulties with matching software projects

Two examples

Matching on project names

Matching on people names

Matching on URLs

Matching on topics and description keywords

The dataset

The code

The results

How many entity matches did we find?

How good are the pairs we found?

Summary

4. Network Analysis

What is a network?

Measuring a network

Degree of a network

Diameter of a network

Walks, paths, and trails in a network

Components of a network

Centrality of a network

Closeness centrality

Degree centrality

Betweenness centrality

Other measures of centrality

Representing graph data

Adjacency matrix

Edge lists and adjacency lists

Differences between graph data structures

Importing data into a graph structure

Adjacency list format

Edge list format

GEXF and GraphML

GDF

Python pickle

JSON

JSON node and link series

JSON trees

Pajek format

A real project

Exploring the data

Generating the network files

Understanding our data as a network

Generating simple network metrics

Playing with the parameters of a network

Analyzing subgraphs

Analyzing cliques and centrality in the subgraphs

Looking for change over time

Summary

5. Sentiment Analysis in Text

What is sentiment analysis?

The basics of sentiment analysis

The structure of an opinion

Document-level and sentence-level analysis

Important features of opinions

Sentiment analysis algorithms

General-purpose data collections

Hu and Liu's sentiment analysis lexicon

SentiWordNet

Vader sentiment

Sentiment mining application

Motivating the project

Data preparation

Data analysis of chat messages

Data analysis of e-mail messages

Summary

6. Named Entity Recognition in Text

Why look for named entities?

Techniques for named entity recognition

Tagging parts of speech

Classes of named entities

Building and evaluating NER systems

NER and partial matches

Handling partial matches

Named entity recognition project

A simple NER tool

Apache Board meeting minutes

Django IRC chat

GnuIRC summaries

LKML e-mails

Summary

7. Automatic Text Summarization

What is automatic text summarization?

Tools for text summarization

Naive text summarization using NLTK

Text summarization using Gensim

Text summarization using Sumy

Sumy's Luhn summarizer

Sumy's TextRank summarizer

Sumy's LSA summarizer

Sumy's Edmundson summarizer

Summary

8. Topic Modeling in Text

What is topic modeling?

Latent Dirichlet Allocation

Gensim for topic modeling

Understanding Gensim LDA topics

Understanding Gensim LDA passes

Applying a Gensim LDA model to new documents

Serializing Gensim LDA objects

Serializing a dictionary

Serializing a corpus

Serializing a model

Gensim LDA for a larger project

Summary

9. Mining for Data Anomalies

What are data anomalies?

Missing data

Locating missing data

Zero values

Fixing missing data

Ignore the problem rows

Fix the problem manually

Use a fabricated value

Use a central measure

Use Last Observation Carried Forward

Use a similar value

Use the most likely value

Data errors

Truncated fields

Data type and character set errors

Logic or semantic errors

Outliers

Visual mining for outliers

Statistical detection of outliers

Detecting outliers with modified z-scores

Detecting outliers by combining statistics and visual mining

Detecting outliers with machine learning

Summary

Index

Mastering Data Mining with Python – Find patterns hidden in your data

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: August 2016

Production reference: 1240816

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-995-0

www.packtpub.c...

Información del libro

Preguntas frecuentes

Información

Mastering Data Mining with Python – Find patterns hidden in your data

Table of Contents

Mastering Data Mining with Python – Find patterns hidden in your data

Índice