Learning Data Mining with Python - Second Edition
eBook - ePub

Learning Data Mining with Python - Second Edition

  1. 358 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Learning Data Mining with Python - Second Edition

About this book

Harness the power of Python to develop data mining applications, analyze data, delve into machine learning, explore object detection using Deep Neural Networks, and create insightful predictive models.About This Book• Use a wide variety of Python libraries for practical data mining purposes.• Learn how to find, manipulate, analyze, and visualize data using Python.• Step-by-step instructions on data mining techniques with Python that have real-world applications.Who This Book Is ForIf you are a Python programmer who wants to get started with data mining, then this book is for you. If you are a data analyst who wants to leverage the power of Python to perform data mining efficiently, this book will also help you. No previous experience with data mining is expected.What You Will Learn• Apply data mining concepts to real-world problems• Predict the outcome of sports matches based on past results• Determine the author of a document based on their writing style• Use APIs to download datasets from social media and other online services• Find and extract good features from difficult datasets• Create models that solve real-world problems• Design and develop data mining applications using a variety of datasets• Perform object detection in images using Deep Neural Networks• Find meaningful insights from your data through intuitive visualizations• Compute on big data, including real-time data from the internetIn DetailThis book teaches you to design and develop data mining applications using a variety of datasets, starting with basic classification and affinity analysis. This book covers a large number of libraries available in Python, including the Jupyter Notebook, pandas, scikit-learn, and NLTK.You will gain hands on experience with complex data types including text, images, and graphs. You will also discover object detection using Deep Neural Networks, which is one of the big, difficult areas of machine learning right now.With restructured examples and code samples updated for the latest edition of Python, each chapter of this book introduces you to new algorithms and techniques. By the end of the book, you will have great insights into using Python for data mining and understanding of the algorithms as well as implementations.Style and approachThis book will be your comprehensive guide to learning the various data mining techniques and implementing them in Python. A variety of real-world datasets is used to explain data mining techniques in a very crisp and easy to understand manner.

Tools to learn more effectively

Saving Books

Saving Books

Keyword Search

Keyword Search

Annotating Text

Annotating Text

Listen to it instead

Listen to it instead

Information

Year
2017
eBook ISBN
9781787129566
Edition
2
Subtopic
Data Mining

Features and scikit-learn Transformers

The datasets we have used so far have been described in terms of features. In the previous chapter, we used a transaction-centric dataset. However, ultimately this was just a different format for representing feature-based data.
There are many other types of datasets, including text, images, sounds, movies, or even real objects. Most data mining algorithms rely on having numerical or categorical features. This means we need a way to represent these types before we input them into the data mining algorithm. We call this representation a model.
In this chapter, we will discuss how to extract numerical and categorical features, and choose the best features when we do have them. We will discuss some common patterns and techniques for extracting features. Choosing your model appropriately is critically important to the outcome of the data mining exercise, more so than the choice of classification algorithm.
The key concepts introduced in this chapter include:
  • Extracting features from datasets
  • Creating models for your data
  • Creating new features
  • Selecting good features
  • Creating your own transformer for custom datasets

Feature extraction

Extracting features is one of the most critical tasks in data mining, and it generally affects your end result more than the choice of data mining algorithm. Unfortunately, there are no hard and fast rules for choosing features that will result in high-performance data mining. The choice of features determines the model that you are using to represent your data.
Model creation is where the science of data mining becomes more of an art and why automated methods of performing data mining (there are several methods of this type) focus on algorithm choice and not model creation. Creating good models relies on intuition, domain expertise, data mining experience, trial and error, and sometimes a little luck.

Representing reality in models

Given what we have done so far in the book, it is easy to forget that the reason we are performing data mining is to affect real world objects, not just manipulating a matrix of values. Not all datasets are presented in terms of features. Sometimes, a dataset consists of nothing more than all of the books that have been written by a given author. Sometimes, it is the film of each of the movies released in 1979. At other times, it is a library collection of interesting historical artifacts.
From these datasets, we may want to perform a data mining task. For the books, we may want to know the different categories that the author writes. In the films, we may wish to see how women are portrayed. In the historical artifacts, we may want to know whether they are from one country or another. It isn't possible to just pass these raw datasets into a decision tree and see what the result is.
For a data mining algorithm to assist us here, we need to represent these as features. Features are a way to create a model and the model provides an approximation of reality in a way that data mining algorithms can understand. Therefore, a model is just a simplified version of some aspect of the real world. As an example, the game of chess is a simplified model (in game form) for historical warfare.
Selecting features has another advantage: they reduce the complexity of the real world into a more manageable model.
Imagine how much information it would take to properly, accurately, and fully describe a real-world object to someone that has no background knowledge of the item. You would need to describe the size, weight, texture, composition, age, flaws, purpose, origin, and so on.
As the complexity of real objects is too much for current algorithms, we use these simpler models instead.
This simplification also focuses our intent in the data mining application. In later chapters, we will look at clustering and where it is critically important. If you put random features in, you will get random results out.
However, there is a downside as this simplification reduces the detail, or may remove good indicators of the things we wish to perform data mining on.
Thought should always be given to how to represent reality in the form of a model. Rather than just using what has been used in the pas...

Table of contents

  1. Title Page
  2. Copyright
  3. Credits
  4. About the Author
  5. About the Reviewer
  6. www.PacktPub.com
  7. Customer Feedback
  8. Preface
  9. Getting Started with Data Mining
  10. Classifying with scikit-learn Estimators
  11. Predicting Sports Winners with Decision Trees
  12. Recommending Movies Using Affinity Analysis
  13. Features and scikit-learn Transformers
  14. Social Media Insight using Naive Bayes
  15. Follow Recommendations Using Graph Mining
  16. Beating CAPTCHAs with Neural Networks
  17. Authorship Attribution
  18. Clustering News Articles
  19. Object Detection in Images using Deep Neural Networks
  20. Working with Big Data
  21. Next Steps...

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Learning Data Mining with Python - Second Edition by Robert Layton in PDF and/or ePUB format, as well as other popular books in Informatik & Data Mining. We have over one million books available in our catalogue for you to explore.