Mastering Machine Learning with R
eBook - ePub

Mastering Machine Learning with R

Advanced machine learning techniques for building smart applications with R 3.5, 3rd Edition

  1. 354 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Mastering Machine Learning with R

Advanced machine learning techniques for building smart applications with R 3.5, 3rd Edition

About this book

Stay updated with expert techniques for solving data analytics and machine learning challenges and gain insights from complex projects and power up your applications

Key Features

  • Build independent machine learning (ML) systems leveraging the best features of R 3.5
  • Understand and apply different machine learning techniques using real-world examples
  • Use methods such as multi-class classification, regression, and clustering

Book Description

Given the growing popularity of the R-zerocost statistical programming environment, there has never been a better time to start applying ML to your data. This book will teach you advanced techniques in ML, using? the latest code in R 3.5. You will delve into various complex features of supervised learning, unsupervised learning, and reinforcement learning algorithms to design efficient and powerful ML models.

This newly updated edition is packed with fresh examples covering a range of tasks from different domains. Mastering Machine Learning with R starts by showing you how to quickly manipulate data and prepare it for analysis. You will explore simple and complex models and understand how to compare them. You'll also learn to use the latest library support, such as TensorFlow and Keras-R, for performing advanced computations. Additionally, you'll explore complex topics, such as natural language processing (NLP), time series analysis, and clustering, which will further refine your skills in developing applications. Each chapter will help you implement advanced ML algorithms using real-world examples. You'll even be introduced to reinforcement learning, along with its various use cases and models. In the concluding chapters, you'll get a glimpse into how some of these blackbox models can be diagnosed and understood.

By the end of this book, you'll be equipped with the skills to deploy ML techniques in your own projects or at work.

What you will learn

  • Prepare data for machine learning methods with ease
  • Understand how to write production-ready code and package it for use
  • Produce simple and effective data visualizations for improved insights
  • Master advanced methods, such as Boosted Trees and deep neural networks
  • Use natural language processing to extract insights in relation to text
  • Implement tree-based classifiers, including Random Forest and Boosted Tree

Who this book is for

This book is for data science professionals, machine learning engineers, or anyone who is looking for the ideal guide to help them implement advanced machine learning algorithms. The book will help you take your skills to the next level and advance further in this field. Working knowledge of machine learning with R is mandatory.

Trusted by 375,005 students

Access to over 1 million titles for a fair monthly price.

Study more efficiently using our study tools.

Information

Year
2019
Edition
3
eBook ISBN
9781789613568

Text Mining

"What then is, generally speaking, the truth of history? A fable agreed upon. As it has been very ingeniously remarked"
- Napoleon Bonaparte
The world is awash with textual data. If you Google, Bing, or Yahoo! how much of that data is unstructured, that is, in a textual format, estimates would range from 80 to 90 percent. The real number doesn't matter. It matters that a large proportion of the data is in text format. The implication is that anyone seeking to find insights in that data must develop the capability to process and analyze text.
When I first started out as a market researcher, I used to manually pore through page after page of moderator-led focus group and interview transcripts with the hope of capturing some qualitative insight, an aha moment if you will, and then haggle with fellow team members over whether they had the same insight or not. Then, you would always have that one individual in a project who would swoop in and listen to two interviews—out of the 30 or 40 on the schedule—and, alas, they had their mind made up on what was really happening in the world. Contrast that with the techniques being used now, where an analyst can quickly distill data into meaningful quantitative results, support qualitative understanding, and maybe even sway the swooper.
Over the last few years, I've applied the techniques discussed here to mine physician-patient interactions, understand FDA fears on prescription drug advertising, capture patient concerns about rare cancer, and capture customer maintenance problems, to name just a few. Using R and the methods in this chapter, you too can extract the powerful information in textual data.
The following topics will be covered in this chapter:
  • Text mining framework and methods
  • Data overview
  • Word frequency
  • Sentiment analysis
  • N-grams
  • Topic models
  • Classifying text
  • Additional quantitative analysis

Text mining framework and methods

There are many different methods to use in text mining. The goal here is to provide a basic framework to apply to such an endeavor. This framework is not inclusive of all the possible methods, but will cover those that are probably the most important for the vast majority of projects that you will work on. Additionally, I will discuss the modeling methods in as succinct and clear a manner as possible, because they can get quite complicated. Gathering and compiling text data is a topic that could take up several chapters. One of the things I prefer and will put forward here is the use of the tidy framework. It will allow us to use tibbles and data frames for most of the steps, and the tidytext functions allow an easy transition to other types of text mining structures, such as a corpus.
The first task is to put the text files into a data frame. With that created, the data preparation can begin with the text transformation.
The following list is composed of probably some of the most common and useful transformations for text files:
  • Change capital letters to lowercase
  • Remove numbers
  • Remove punctuation
  • Remove stop words
  • Remove excess whitespace characters
  • Word stemming
  • Word replacement
With these transformations, you are creating a more compact dataset and simplify the structure in order to facilitate relationships between the words, thereby leading to increased understanding. However, keep in mind that not all of these transformations are necessary all the time and judgment must be applied, or you can iterate to find the transformations that make the most sense.
By changing words to lowercase, you can prevent the improper counting of words. Say that you have a count for hockey three times and Hockey once, where it is the first word in a sentence. R will not give you a count of hockey=4, but hockey=3 and Hockey=1.
Removing punctuation also achieves the same purpose, but in some cases, punctuation is important, especially if you want to tokenize your documents by sentences.
In removing stop words, you are getting rid of the common words that have no value; in fact, they are detrimental to the analysis, as their frequency masks important words. Examples of stop words are and, is, the, not, and to.
Removing whitespace makes data more compact by getting rid of things such as tabs, paragraph breaks, double-spacing, and so on.
The stemming of words can get tricky and might add to your confusion because it deletes word suffixes, creating the base word, or what is known as the radical. I personally am not a big fan of stemming and the analysts I've worked with agree with that sentiment. Recall that R would count this as two separate words. By running a stemming algorithm, the stemmed word for the two instances would become famili. This would prevent the incorrect count, but in some cases it can be odd to interpret and is not very visually appealing in a word cloud for presentation purposes. In some cases, it may make sense to run your analysis with both stemmed and unstemmed words in order to see which one facilitates understanding.
Probably the most optional of the transformations is to replace the words. The goal of replacement is to combine words with a similar meaning, for example, management and leadership. You can also use it in lieu of stemming. I once examined the outcome of stemmed and unstemmed words and concluded that I could achieve a more meaningful result by replacing about a dozen words instead of stemming. It can be important when you have manual data entry and different operators input data differently. For example, tech support person one types in the system turbocharger, while tech support person two types in turbo charger half the time, and turbo-charger the other half. All three versions are the same, so applying a replacement function such as gsub() or grepl() will solve the problem.
With transformations completed, one structure to create for topic modeling or classification is either a document-term matrix (DTM) or term-document matrix (TDM). What either of these matrices does is create a matrix of word counts for each individual document in the matrix. A DTM would have the documents as rows and the words as columns, while in a TDM, the reverse is true. We will be using a DTM for our example.

Topic models

Topic models are a powerful method to group documents by their main topics. Topic models allow probabilistic modeling of term frequency occurrence in documents. The fitted model can be used to estimate the similarity between documents, as well as between a set of specified keywords using an additional layer of latent variables, which are referred to as topics (Grun and Hornik, 2011). In essence, a document is assigned to a topic based on the distribution of the words in that document, and the other documents in that topic will have roughly the same frequency of words.
The algorithm that we will focus on is Latent Dirichlet Allocation (LDA) with Gibbs sampling, which is probably the most commonly used sampling algorithm. In building topic models, the number of topics must be determined before running the algorithm (k-dimensions). If no a priori reason for the number of topics exists, then you can build several and apply judgment and knowledge to the final selection. LDA with Gibbs sampling is quite complicated mathematically, but my intent is to provide an introduction so that you are at least able to describe how the algorithm learns to assign a document to a topic in layperson terms. If you are interested in mastering the math associated with the method, block out a couple of hours on your calendar and have a go at it. Excellent background material is av...

Table of contents

  1. Title Page
  2. Copyright and Credits
  3. About Packt
  4. Contributors
  5. Preface
  6. Preparing and Understanding Data
  7. Linear Regression
  8. Logistic Regression
  9. Advanced Feature Selection in Linear Models
  10. K-Nearest Neighbors and Support Vector Machines
  11. Tree-Based Classification
  12. Neural Networks and Deep Learning
  13. Creating Ensembles and Multiclass Methods
  14. Cluster Analysis
  15. Principal Component Analysis
  16. Association Analysis
  17. Time Series and Causality
  18. Text Mining
  19. Creating a Package
  20. Other Books You May Enjoy

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Mastering Machine Learning with R by Cory Lesmeister in PDF and/or ePUB format, as well as other popular books in Computer Science & Computer Science General. We have over one million books available in our catalogue for you to explore.