Mastering Data Mining with Python – Find patterns hidden in your data
eBook - ePub

Mastering Data Mining with Python – Find patterns hidden in your data

Megan Squire

Compartir libro
  1. 268 páginas
  2. English
  3. ePUB (apto para móviles)
  4. Disponible en iOS y Android
eBook - ePub

Mastering Data Mining with Python – Find patterns hidden in your data

Megan Squire

Detalles del libro
Vista previa del libro
Índice
Citas

Información del libro

Learn how to create more powerful data mining applications with this comprehensive Python guide to advance data analytics techniques

About This Book

  • Dive deeper into data mining with Python – don't be complacent, sharpen your skills!
  • From the most common elements of data mining to cutting-edge techniques, we've got you covered for any data-related challenge
  • Become a more fluent and confident Python data-analyst, in full control of its extensive range of libraries

Who This Book Is For

This book is for data scientists who are already familiar with some basic data mining techniques such as SQL and machine learning, and who are comfortable with Python. If you are ready to learn some more advanced techniques in data mining in order to become a data mining expert, this is the book for you!

What You Will Learn

  • Explore techniques for finding frequent itemsets and association rules in large data sets
  • Learn identification methods for entity matches across many different types of data
  • Identify the basics of network mining and how to apply it to real-world data sets
  • Discover methods for detecting the sentiment of text and for locating named entities in text
  • Observe multiple techniques for automatically extracting summaries and generating topic models for text
  • See how to use data mining to fix data anomalies and how to use machine learning to identify outliers in a data set

In Detail

Data mining is an integral part of the data science pipeline. It is the foundation of any successful data-driven strategy – without it, you'll never be able to uncover truly transformative insights. Since data is vital to just about every modern organization, it is worth taking the next step to unlock even greater value and more meaningful understanding.

If you already know the fundamentals of data mining with Python, you are now ready to experiment with more interesting, advanced data analytics techniques using Python's easy-to-use interface and extensive range of libraries.

In this book, you'll go deeper into many often overlooked areas of data mining, including association rule mining, entity matching, network mining, sentiment analysis, named entity recognition, text summarization, topic modeling, and anomaly detection. For each data mining technique, we'll review the state-of-the-art and current best practices before comparing a wide variety of strategies for solving each problem. We will then implement example solutions using real-world data from the domain of software engineering, and we will spend time learning how to understand and interpret the results we get.

By the end of this book, you will have solid experience implementing some of the most interesting and relevant data mining techniques available today, and you will have achieved a greater fluency in the important field of Python data analytics.

Style and approach

This book will teach you the intricacies in applying data mining using real-world scenarios and will act as a very practical solution to your data mining needs.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?
Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.
¿Cómo descargo los libros?
Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.
¿En qué se diferencian los planes de precios?
Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.
¿Qué es Perlego?
Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.
¿Perlego ofrece la función de texto a voz?
Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.
¿Es Mastering Data Mining with Python – Find patterns hidden in your data un PDF/ePUB en línea?
Sí, puedes acceder a Mastering Data Mining with Python – Find patterns hidden in your data de Megan Squire en formato PDF o ePUB, así como a otros libros populares de Informatique y Programmation en Python. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Año
2016
ISBN
9781785889950
Edición
1
Categoría
Informatique

Mastering Data Mining with Python – Find patterns hidden in your data


Table of Contents

Mastering Data Mining with Python – Find patterns hidden in your data
Credits
About the Author
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Expanding Your Data Mining Toolbox
What is data mining?
How do we do data mining?
The Fayyad et al. KDD process
The Han et al. KDD process
The CRISP-DM process
The Six Steps process
Which data mining methodology is the best?
What are the techniques used in data mining?
What techniques are we going to use in this book?
How do we set up our data mining work environment?
Summary
2. Association Rule Mining
What are frequent itemsets?
The diapers and beer urban legend
Frequent itemset mining basics
Towards association rules
Support
Confidence
Association rules
An example with data
Added value – fixing a flaw in the plan
Methods for finding frequent itemsets
A project – discovering association rules in software project tags
Summary
3. Entity Matching
What is entity matching?
Merging data
Merging datasets vertically
Merging datasets horizontally
Techniques for matching
Attribute-based similarity matching
Be careful of pairwise comparisons
Leverage rare values
Methods for matching attributes
Range-based or distance from target
String edit distance
Hamming distance
Levenshtein distance
Soundex
Leveraging disjoint sets
Context-based similarity matching
Machine learning-based entity matching
Evaluation of entity matching techniques
Efficiency – how long does it take to do the matching?
Effectiveness – how accurate are the matches that we generate?
Usefulness – how practical is the matching procedure to use?
Entity matching project
Difficulties with matching software projects
Two examples
Matching on project names
Matching on people names
Matching on URLs
Matching on topics and description keywords
The dataset
The code
The results
How many entity matches did we find?
How good are the pairs we found?
Summary
4. Network Analysis
What is a network?
Measuring a network
Degree of a network
Diameter of a network
Walks, paths, and trails in a network
Components of a network
Centrality of a network
Closeness centrality
Degree centrality
Betweenness centrality
Other measures of centrality
Representing graph data
Adjacency matrix
Edge lists and adjacency lists
Differences between graph data structures
Importing data into a graph structure
Adjacency list format
Edge list format
GEXF and GraphML
GDF
Python pickle
JSON
JSON node and link series
JSON trees
Pajek format
A real project
Exploring the data
Generating the network files
Understanding our data as a network
Generating simple network metrics
Playing with the parameters of a network
Analyzing subgraphs
Analyzing cliques and centrality in the subgraphs
Looking for change over time
Summary
5. Sentiment Analysis in Text
What is sentiment analysis?
The basics of sentiment analysis
The structure of an opinion
Document-level and sentence-level analysis
Important features of opinions
Sentiment analysis algorithms
General-purpose data collections
Hu and Liu's sentiment analysis lexicon
SentiWordNet
Vader sentiment
Sentiment mining application
Motivating the project
Data preparation
Data analysis of chat messages
Data analysis of e-mail messages
Summary
6. Named Entity Recognition in Text
Why look for named entities?
Techniques for named entity recognition
Tagging parts of speech
Classes of named entities
Building and evaluating NER systems
NER and partial matches
Handling partial matches
Named entity recognition project
A simple NER tool
Apache Board meeting minutes
Django IRC chat
GnuIRC summaries
LKML e-mails
Summary
7. Automatic Text Summarization
What is automatic text summarization?
Tools for text summarization
Naive text summarization using NLTK
Text summarization using Gensim
Text summarization using Sumy
Sumy's Luhn summarizer
Sumy's TextRank summarizer
Sumy's LSA summarizer
Sumy's Edmundson summarizer
Summary
8. Topic Modeling in Text
What is topic modeling?
Latent Dirichlet Allocation
Gensim for topic modeling
Understanding Gensim LDA topics
Understanding Gensim LDA passes
Applying a Gensim LDA model to new documents
Serializing Gensim LDA objects
Serializing a dictionary
Serializing a corpus
Serializing a model
Gensim LDA for a larger project
Summary
9. Mining for Data Anomalies
What are data anomalies?
Missing data
Locating missing data
Zero values
Fixing missing data
Ignore the problem rows
Fix the problem manually
Use a fabricated value
Use a central measure
Use Last Observation Carried Forward
Use a similar value
Use the most likely value
Data errors
Truncated fields
Data type and character set errors
Logic or semantic errors
Outliers
Visual mining for outliers
Statistical detection of outliers
Detecting outliers with modified z-scores
Detecting outliers by combining statistics and visual mining
Detecting outliers with machine learning
Summary
Index

Mastering Data Mining with Python – Find patterns hidden in your data

Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: August 2016
Production reference: 1240816
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-995-0
www.packtpub.c...

Índice