Natural Language Processing and Computational Linguistics
eBook - ePub

Natural Language Processing and Computational Linguistics

A practical guide to text analysis with Python, Gensim, spaCy, and Keras

Bhargav Srinivasa-Desikan

Share book
  1. 306 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Natural Language Processing and Computational Linguistics

A practical guide to text analysis with Python, Gensim, spaCy, and Keras

Bhargav Srinivasa-Desikan

Book details
Book preview
Table of contents
Citations

About This Book

Work with Python and powerful open source tools such as Gensim and spaCy to perform modern text analysis, natural language processing, and computational linguistics algorithms.About This Bookā€¢ Discover the open source Python text analysis ecosystem, using spaCy, Gensim, scikit-learn, and Kerasā€¢ Hands-on text analysis with Python, featuring natural language processing and computational linguistics algorithmsā€¢ Learn deep learning techniques for text analysisWho This Book Is ForThis book is for you if you want to dive in, hands-first, into the interesting world of text analysis and NLP, and you're ready to work with the rich Python ecosystem of tools and datasets waiting for you!What You Will Learnā€¢ Why text analysis is important in our modern ageā€¢ Understand NLP terminology and get to know the Python tools and datasetsā€¢ Learn how to pre-process and clean textual dataā€¢ Convert textual data into vector space representationsā€¢ Using spaCy to process textā€¢ Train your own NLP models for computational linguisticsā€¢ Use statistical learning and Topic Modeling algorithms for text, using Gensim and scikit-learnā€¢ Employ deep learning techniques for text analysis using KerasIn DetailModern text analysis is now very accessible using Python and open source tools, so discover how you can now perform modern text analysis in this era of textual data.This book shows you how to use natural language processing, and computational linguistics algorithms, to make inferences and gain insights about data you have. These algorithms are based on statistical machine learning and artificial intelligence techniques. The tools to work with these algorithms are available to you right now - with Python, and tools like Gensim and spaCy.You'll start by learning about data cleaning, and then how to perform computational linguistics from first concepts. You're then ready to explore the more sophisticated areas of statistical NLP and deep learning using Python, with realistic language and text samples. You'll learn to tag, parse, and model text using the best tools. You'll gain hands-on knowledge of the best frameworks to use, and you'll know when to choose a tool like Gensim for topic models, and when to work with Keras for deep learning.This book balances theory and practical hands-on examples, so you can learn about and conduct your own natural language processing projects and computational linguistics. You'll discover the rich ecosystem of Python tools you have available to conduct NLP - and enter the interesting world of modern text analysis.Style and approachThe book teaches NLP from the angle of a practitioner as well as that of a student. This is a tad unusual, but given the enormous speed at which new algorithms and approaches travel from scientific beginnings to industrial implementation, first principles can be clarified with the help of entirely practical examples.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on ā€œCancel Subscriptionā€ - itā€™s as simple as that. After you cancel, your membership will stay active for the remainder of the time youā€™ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlegoā€™s features. The only differences are the price and subscription period: With the annual plan youā€™ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weā€™ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Natural Language Processing and Computational Linguistics an online PDF/ePUB?
Yes, you can access Natural Language Processing and Computational Linguistics by Bhargav Srinivasa-Desikan in PDF and/or ePUB format, as well as other popular books in Computer Science & Artificial Intelligence (AI) & Semantics. We have over one million books available in our catalogue for you to explore.

Information

Word2Vec, Doc2Vec, and Gensim

We have previously talked about vectors a lot throughout the book ā€“ they are used to understand and represent our textual data in a mathematical form, and the basis of all the machine learning methods we use rely on these representations. We will be taking this one step further, and use machine learning techniques to generate vector representations of words that better encapsulate the meaning of a word. This technique is generally referred to as word embeddings, and Word2Vec and Doc2Vec are two popular variations of these.
  • Word2Vec
  • Doc2Vec
  • Other word embeddings

Word2Vec

Arguably the most important application of machine learning in text analysis, the Word2Vec algorithm is both a fascinating and very useful tool. As the name suggests, it creates a vector representation of words based on the corpus we are using. But the magic of Word2Vec is in how it manages to capture the semantic representation of words in a vector. The papers, Efficient Estimation of Word Representations in Vector Space [1] [Mikolov and others, 2013], Distributed Representations of Words and Phrases and their Compositionality [2] [Mikolov and others, 2013], and Linguistic Regularities in Continuous Space Word Representations [3] [Mikolov and others, 2013] lay the foundations for Word2Vec and describe their uses.
We've mentioned that these word vectors help represent the semantics of words ā€“ what exactly does this mean? Well for starters, it means we could use vector reasoning for these words ā€“ one of the most famous examples is from Mikolov's paper, where we see that if we use the word vectors and perform (here, we use V(word) to represent the vector representation of the word) V(King) - V(Man) + V(Woman), and the resulting vector is closest to V(Queen). It is easy to see why this is remarkable ā€“ our intuitive understanding of these words is reflected in the learned vector representations of the words!
This gives us the ability to add more of a punch in our text analysis pipelines ā€“ having an intuitive semantic representation of vectors (and by extension, documents ā€“ but we'll get to that later) will come in handy more than once.
Finding word-pair relationships is one such interesting use ā€“ if we define a relationship between two words such as France : Paris, using the appropriate vector difference we can identify other similar relationships ā€“ Italy : Rome, Japan : Tokyo are two such examples which are found using Word2Vec. We can continue to play with these vectors like any other vectors ā€“ by adding two vectors, we can attempt to get what we would consider the addition of two words. For example, V(Vietnam) + V(Capital) is closest to the vector representation of V(Hanoi).
How exactly does this technique result in such an understanding of words? Word2Vec works by understanding context ā€“ in particular, what of words tend to appear in certain words? We choose a sliding window size, and based on this window size, attempt to identify the conditional probability of observing the output word based on the surrounding words. For example, if the sentence is The personal nature of text data always adds an extra bit of motivation, and it also likely means we are aware of the nature of the data, and what kind of results to expect., and our target word is the word in bold, motivation, we try and figure out what are the odds of finding the word motivation if the context is always adds an extra bit of on the left-hand side of the window and and it also likely means on the right. Of course, this is just an illustrative example ā€“ the exact training procedure requires us to choose a window size and the number of dimensions among other details.
There are two main methods to perform Word2Vec training, which are the Continuous Bag of Words model (CBOW) and the Skip Gram model. The underlying architecture of these models is described in the original research paper, but both of these methods involve in understanding the context which we talked about before. The papers written by Mikolov and others provide further details of the training process, and since the code is public, it means we actually know what's going on under the hood!
The blog post [4], Word2Vec Tutorial - The Skip-Gram Model, by Chris McCormick explains some of the mathematical intuition behind the skip-gram word2vec model, and the post [5], The amazing power of word vectors, by Adrian Colyer talks about the some of the things we can do with word2vec. The links are useful if you wish to dig a little deeper into the mathematical details of Word2Vec, a topic we will not be covering in this chapter. The resources page [6] contains theory and code resources for Word2Vec and is also useful in case you wish to look up the original material or other implementation details.
While Word2Vec remains the most popular word vector implementation, this is not the first time it has been attempted, and certainly not the last either ā€“ we will discuss some of the other word embeddings techniques in the last section of this chapter. Right now, let's jump into using these word vectors ourselves.
Gensim comes to our assistance again and is arguably the most reliable open source implementation of the algorithm, and we will explore how to use it.

Using Word2Vec with Gensim

While the original C code [7] released by Google does an impressive job, Gensims' implementation is a case where an open source implementation is more efficient than the original.
The Gensim implementation was coded up back in 2013 around the time the original algorithm was released ā€“ the blog post by Radim ŘehÅÆřek [8] chronicles some of the thoughts and problems encountered in implementing the same for Gensim, and is worth reading if you would like to know the process of coding word2vec in Python. The interactive web tutorial [9] involving Word2Vec is quite fun and illustrates some of the examples of Word2Vec we previously talked about. It is worth looking at if you're interested in running Gensim Word2Vec code online, and can also serve as a quick tutorial of using Word2Vec in Gensim.
We will now get into actually training our own Word2Vec model. The first step, like all the other Gensim models we used, involved importing the appropriate model.
from gensim.models import word2vec
At this point, it is important to go through the documentation for the word2vec class, as well as the KeyedVector class, which we will both use a lot. From the documentation page, we list the parameters for the word2vec.Word2Vec class.
  1. sg: This defines the training algorithm. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram is employed.
  2. size: This is the dimensionality of the feature vectors.
  3. window: This is the maximum distance between the current and predicted word within a sentence.
  4. alpha: This is the initial learning rate (will linearly drop to min_alpha as training progresses).
  5. seed: This is used for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires the use of the PYTHONHASHSEED environment variable to control hash randomization.)
  6. min_count: Ignore all words with a total frequency lower than this.
  7. max_vocab_size: Limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1 GB of RAM. Set to None for no limit (default).
  8. sample: This is the ...

Table of contents