eBook - ePub

Machine Learning Algorithms

Name: Machine Learning Algorithms
Author: Giuseppe Bonaccorso

Giuseppe Bonaccorso

Share book

360 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Machine Learning Algorithms

Giuseppe Bonaccorso

Book details

Book preview

Table of contents

Citations

About This Book

Build strong foundation for entering the world of Machine Learning and data science with the help of this comprehensive guideAbout This Book• Get started in the field of Machine Learning with the help of this solid, concept-rich, yet highly practical guide.• Your one-stop solution for everything that matters in mastering the whats and whys of Machine Learning algorithms and their implementation.• Get a solid foundation for your entry into Machine Learning by strengthening your roots (algorithms) with this comprehensive guide.Who This Book Is ForThis book is for IT professionals who want to enter the field of data science and are very new to Machine Learning. Familiarity with languages such as R and Python will be invaluable here.What You Will Learn• Acquaint yourself with important elements of Machine Learning• Understand the feature selection and feature engineering process• Assess performance and error trade-offs for Linear Regression• Build a data model and understand how it works by using different types of algorithm• Learn to tune the parameters of Support Vector machines• Implement clusters to a dataset• Explore the concept of Natural Processing Language and Recommendation Systems• Create a ML architecture from scratch.In DetailAs the amount of data continues to grow at an almost incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, spam detection, document search, and trading strategies, to speech recognition. This makes machine learning well-suited to the present-day era of Big Data and Data Science. The main challenge is how to transform data into actionable knowledge.In this book you will learn all the important Machine Learning algorithms that are commonly used in the field of data science. These algorithms can be used for supervised as well as unsupervised learning, reinforcement learning, and semi-supervised learning. A few famous algorithms that are covered in this book are Linear regression, Logistic Regression, SVM, Naive Bayes, K-Means, Random Forest, TensorFlow, and Feature engineering. In this book you will also learn how these algorithms work and their practical implementation to resolve your problems. This book will also introduce you to the Natural Processing Language and Recommendation systems, which help you run multiple algorithms simultaneously.On completion of the book you will have mastered selecting Machine Learning algorithms for clustering, classification, or regression based on for your problem.Style and approachAn easy-to-follow, step-by-step guide that will help you get to grips with real -world applications of Algorithms for Machine Learning.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Machine Learning Algorithms an online PDF/ePUB?

Yes, you can access Machine Learning Algorithms by Giuseppe Bonaccorso in PDF and/or ePUB format, as well as other popular books in Computer Science & Programming Algorithms. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2017

ISBN

9781785884511

Edition

Topic

Computer Science

Subtopic

Programming Algorithms

Index

Computer Science

Topic Modeling and Sentiment Analysis in NLP

In this chapter, we're going to introduce some common topic modeling methods, discussing some applications. Topic modeling is a very important NLP section and its purpose is to extract semantic pieces of information out of a corpus of documents. We're going to discuss latent semantic analysis, one of most famous methods; it's based on the same philosophy already discussed for model-based recommendation systems. We'll also discuss its probabilistic variant, PLSA, which is aimed at building a latent factor probability model without any assumption of prior distributions. On the other hand, the Latent Dirichlet Allocation is a similar approach that assumes a prior Dirichlet distribution for latent variables. In the last section, we're going to discuss sentiment analysis with a concrete example based on a Twitter dataset.

Topic modeling

The main goal of topic modeling in natural language processing is to analyze a corpus in order to identify common topics among documents. In this context, even if we talk about semantics, this concept has a particular meaning, driven by a very important assumption. A topic derives from the usage of particular terms in the same document and it is confirmed by the multiplicity of different documents where the first condition is true.

In other words, we don't consider a human-oriented semantics but a statistical modeling that works with meaningful documents (this guarantees that the usage of terms is aimed to express a particular concept and, therefore, there's a human semantic purpose behind them). For this reason, the starting point of all our methods is an occurrence matrix, normally defined as a document-term matrix (we have already discussed count vectorizing and tf-idf in Chapter 12, Introduction to NLP):

In many papers, this matrix is transposed (it's a term-document one); however, scikit-learn produces document-term matrices, and, to avoid confusion, we are going to consider this structure.

Latent semantic analysis

The idea behind latent semantic analysis is factorizing M_dw so as to extract a set of latent variables (this means that we can assume their existence but they cannot be observed directly) that work as connectors between the document and terms. As discussed in Chapter 11, Introduction to Recommendation Systems, a very common decomposition method is SVD:

However, we're not interested in a full decomposition; we are interested only in the subspace defined by the top k singular values:

This approximation has the reputation of being the best one considering the Frobenius norm, so it guarantees a very high level of accuracy. When applying it to a document-term matrix, we obtain the following decomposition:

Or, in a more compact way:

Here, the first matrix defines a relationship among documents and k latent variables, and the second a relationship among k latent variables and words. Considering the structure of the original matrix and what is explained at the beginning of this chapter, we can consider the latent variables as topics that define a subspace where the documents are projected. A generic document can now be defined as:

Furthermore, each topic becomes a linear combination of words. As the weight of many words is close to zero, we can decide to take only the top r words to define a topic; therefore, we get:

Here, each h_ji is obtained after sorting the columns of M_twk. To better understand the process, let's show a complete example based on a subset of Brown corpus (500 documents from the news category):

 from nltk.corpus import brown

>>> sentences = brown.sents(categories=['news'])[0:500]
>>> corpus = []

>>> for s in sentences:
>>> corpus.append(' '.join(s))

After defining the corpus, we need to tokenize and vectorize using a tf-idf approach:

 from sklearn.feature_extraction.text import TfidfVectorizer

>>> vectorizer = TfidfVectorizer(strip_accents='unicode', stop_words='english', norm='l2', sublinear_tf=True)
>>> Xc = vectorizer.fit_transform(corpus).todense()

Now it's possible to apply an SVD to the Xc matrix (remember that in SciPy, the V matrix is already transposed):

 from scipy.linalg import svd

>>> U, s, V = svd(Xc, full_matrices=False)

As the corpus is not very small, it's useful to set the parameter full_matrices=False to save computational time. We assume we have two topics, so we can extract our sub-matrices:

 import numpy as np

>>> rank = 2

>>> Uk = U[:, 0:rank]
>>> sk = np.diag(s)[0:rank, 0:rank]
>>> Vk = V[0:rank, :]

If we want to analyze the top 10 words per topic, we need to consid...