eBook - ePub

Natural Language Processing and Computational Linguistics

Name: Natural Language Processing and Computational Linguistics
Author: Bhargav Srinivasa-Desikan

A practical guide to text analysis with Python, Gensim, spaCy, and Keras

Bhargav Srinivasa-Desikan

Partager le livre

306 pages
English
ePUB (adapté aux mobiles)
Disponible sur iOS et Android

eBook - ePub

Natural Language Processing and Computational Linguistics

A practical guide to text analysis with Python, Gensim, spaCy, and Keras

Bhargav Srinivasa-Desikan

Détails du livre

Aperçu du livre

Table des matières

Citations

À propos de ce livre

Work with Python and powerful open source tools such as Gensim and spaCy to perform modern text analysis, natural language processing, and computational linguistics algorithms.About This Book• Discover the open source Python text analysis ecosystem, using spaCy, Gensim, scikit-learn, and Keras• Hands-on text analysis with Python, featuring natural language processing and computational linguistics algorithms• Learn deep learning techniques for text analysisWho This Book Is ForThis book is for you if you want to dive in, hands-first, into the interesting world of text analysis and NLP, and you're ready to work with the rich Python ecosystem of tools and datasets waiting for you!What You Will Learn• Why text analysis is important in our modern age• Understand NLP terminology and get to know the Python tools and datasets• Learn how to pre-process and clean textual data• Convert textual data into vector space representations• Using spaCy to process text• Train your own NLP models for computational linguistics• Use statistical learning and Topic Modeling algorithms for text, using Gensim and scikit-learn• Employ deep learning techniques for text analysis using KerasIn DetailModern text analysis is now very accessible using Python and open source tools, so discover how you can now perform modern text analysis in this era of textual data.This book shows you how to use natural language processing, and computational linguistics algorithms, to make inferences and gain insights about data you have. These algorithms are based on statistical machine learning and artificial intelligence techniques. The tools to work with these algorithms are available to you right now - with Python, and tools like Gensim and spaCy.You'll start by learning about data cleaning, and then how to perform computational linguistics from first concepts. You're then ready to explore the more sophisticated areas of statistical NLP and deep learning using Python, with realistic language and text samples. You'll learn to tag, parse, and model text using the best tools. You'll gain hands-on knowledge of the best frameworks to use, and you'll know when to choose a tool like Gensim for topic models, and when to work with Keras for deep learning.This book balances theory and practical hands-on examples, so you can learn about and conduct your own natural language processing projects and computational linguistics. You'll discover the rich ecosystem of Python tools you have available to conduct NLP - and enter the interesting world of modern text analysis.Style and approachThe book teaches NLP from the angle of a practitioner as well as that of a student. This is a tad unusual, but given the enormous speed at which new algorithms and approaches travel from scientific beginnings to industrial implementation, first principles can be clarified with the help of entirely practical examples.

Foire aux questions

Comment puis-je résilier mon abonnement ?

Il vous suffit de vous rendre dans la section compte dans paramètres et de cliquer sur « Résilier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez résilié votre abonnement, il restera actif pour le reste de la période pour laquelle vous avez payé. Découvrez-en plus ici.

Puis-je / comment puis-je télécharger des livres ?

Pour le moment, tous nos livres en format ePub adaptés aux mobiles peuvent être téléchargés via l’application. La plupart de nos PDF sont également disponibles en téléchargement et les autres seront téléchargeables très prochainement. Découvrez-en plus ici.

Quelle est la différence entre les formules tarifaires ?

Les deux abonnements vous donnent un accès complet à la bibliothèque et à toutes les fonctionnalités de Perlego. Les seules différences sont les tarifs ainsi que la période d’abonnement : avec l’abonnement annuel, vous économiserez environ 30 % par rapport à 12 mois d’abonnement mensuel.

Qu’est-ce que Perlego ?

Nous sommes un service d’abonnement à des ouvrages universitaires en ligne, où vous pouvez accéder à toute une bibliothèque pour un prix inférieur à celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! Découvrez-en plus ici.

Prenez-vous en charge la synthèse vocale ?

Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte à haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accélérer ou le ralentir. Découvrez-en plus ici.

Est-ce que Natural Language Processing and Computational Linguistics est un PDF/ePUB en ligne ?

Oui, vous pouvez accéder à Natural Language Processing and Computational Linguistics par Bhargav Srinivasa-Desikan en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Computer Science et Artificial Intelligence (AI) & Semantics. Nous disposons de plus d’un million d’ouvrages à découvrir dans notre catalogue.

Informations

Éditeur

Packt Publishing

Année

2018

ISBN

9781788837033

Édition

Sujet

Computer Science

Sous-sujet

Artificial Intelligence (AI) & Semantics

Word2Vec, Doc2Vec, and Gensim

We have previously talked about vectors a lot throughout the book – they are used to understand and represent our textual data in a mathematical form, and the basis of all the machine learning methods we use rely on these representations. We will be taking this one step further, and use machine learning techniques to generate vector representations of words that better encapsulate the meaning of a word. This technique is generally referred to as word embeddings, and Word2Vec and Doc2Vec are two popular variations of these.

Word2Vec
Doc2Vec
Other word embeddings

Word2Vec

Arguably the most important application of machine learning in text analysis, the Word2Vec algorithm is both a fascinating and very useful tool. As the name suggests, it creates a vector representation of words based on the corpus we are using. But the magic of Word2Vec is in how it manages to capture the semantic representation of words in a vector. The papers, Efficient Estimation of Word Representations in Vector Space [1] [Mikolov and others, 2013], Distributed Representations of Words and Phrases and their Compositionality [2] [Mikolov and others, 2013], and Linguistic Regularities in Continuous Space Word Representations [3] [Mikolov and others, 2013] lay the foundations for Word2Vec and describe their uses.

We've mentioned that these word vectors help represent the semantics of words – what exactly does this mean? Well for starters, it means we could use vector reasoning for these words – one of the most famous examples is from Mikolov's paper, where we see that if we use the word vectors and perform (here, we use V(word) to represent the vector representation of the word) V(King) - V(Man) + V(Woman), and the resulting vector is closest to V(Queen). It is easy to see why this is remarkable – our intuitive understanding of these words is reflected in the learned vector representations of the words!

This gives us the ability to add more of a punch in our text analysis pipelines – having an intuitive semantic representation of vectors (and by extension, documents – but we'll get to that later) will come in handy more than once.

Finding word-pair relationships is one such interesting use – if we define a relationship between two words such as France : Paris, using the appropriate vector difference we can identify other similar relationships – Italy : Rome, Japan : Tokyo are two such examples which are found using Word2Vec. We can continue to play with these vectors like any other vectors – by adding two vectors, we can attempt to get what we would consider the addition of two words. For example, V(Vietnam) + V(Capital) is closest to the vector representation of V(Hanoi).

How exactly does this technique result in such an understanding of words? Word2Vec works by understanding context – in particular, what of words tend to appear in certain words? We choose a sliding window size, and based on this window size, attempt to identify the conditional probability of observing the output word based on the surrounding words. For example, if the sentence is The personal nature of text data always adds an extra bit of motivation, and it also likely means we are aware of the nature of the data, and what kind of results to expect., and our target word is the word in bold, motivation, we try and figure out what are the odds of finding the word motivation if the context is always adds an extra bit of on the left-hand side of the window and and it also likely means on the right. Of course, this is just an illustrative example – the exact training procedure requires us to choose a window size and the number of dimensions among other details.

There are two main methods to perform Word2Vec training, which are the Continuous Bag of Words model (CBOW) and the Skip Gram model. The underlying architecture of these models is described in the original research paper, but both of these methods involve in understanding the context which we talked about before. The papers written by Mikolov and others provide further details of the training process, and since the code is public, it means we actually know what's going on under the hood!

The blog post [4], Word2Vec Tutorial - The Skip-Gram Model, by Chris McCormick explains some of the mathematical intuition behind the skip-gram word2vec model, and the post [5], The amazing power of word vectors, by Adrian Colyer talks about the some of the things we can do with word2vec. The links are useful if you wish to dig a little deeper into the mathematical details of Word2Vec, a topic we will not be covering in this chapter. The resources page [6] contains theory and code resources for Word2Vec and is also useful in case you wish to look up the original material or other implementation details.

While Word2Vec remains the most popular word vector implementation, this is not the first time it has been attempted, and certainly not the last either – we will discuss some of the other word embeddings techniques in the last section of this chapter. Right now, let's jump into using these word vectors ourselves.

Gensim comes to our assistance again and is arguably the most reliable open source implementation of the algorithm, and we will explore how to use it.

Using Word2Vec with Gensim

While the original C code [7] released by Google does an impressive job, Gensims' implementation is a case where an open source implementation is more efficient than the original.

The Gensim implementation was coded up back in 2013 around the time the original algorithm was released – the blog post by Radim Řehůřek [8] chronicles some of the thoughts and problems encountered in implementing the same for Gensim, and is worth reading if you would like to know the process of coding word2vec in Python. The interactive web tutorial [9] involving Word2Vec is quite fun and illustrates some of the examples of Word2Vec we previously talked about. It is worth looking at if you're interested in running Gensim Word2Vec code online, and can also serve as a quick tutorial of using Word2Vec in Gensim.

We will now get into actually training our own Word2Vec model. The first step, like all the other Gensim models we used, involved importing the appropriate model.

from gensim.models import word2vec

At this point, it is important to go through the documentation for the word2vec class, as well as the KeyedVector class, which we will both use a lot. From the documentation page, we list the parameters for the word2vec.Word2Vec class.

sg: This defines the training algorithm. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram is employed.
size: This is the dimensionality of the feature vectors.
window: This is the maximum distance between the current and predicted word within a sentence.
alpha: This is the initial learning rate (will linearly drop to min_alpha as training progresses).
seed: This is used for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires the use of the PYTHONHASHSEED environment variable to control hash randomization.)
min_count: Ignore all words with a total frequency lower than this.
max_vocab_size: Limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1 GB of RAM. Set to None for no limit (default).
sample: This is the ...