Advanced Natural Language Processing with TensorFlow 2
eBook - ePub

Advanced Natural Language Processing with TensorFlow 2

Build effective real-world NLP applications using NER, RNNs, seq2seq models, Transformers, and more

Ashish Bansal

Share book
  1. 380 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Advanced Natural Language Processing with TensorFlow 2

Build effective real-world NLP applications using NER, RNNs, seq2seq models, Transformers, and more

Ashish Bansal

Book details
Book preview
Table of contents
Citations

About This Book

One-stop solution for NLP practitioners, ML developers, and data scientists to build effective NLP systems that can perform real-world complicated tasks

Key Features

  • Apply deep learning algorithms and techniques such as BiLSTMS, CRFs, BPE and more using TensorFlow 2
  • Explore applications like text generation, summarization, weakly supervised labelling and more
  • Read cutting edge material with seminal papers provided in the GitHub repository with full working code

Book Description

Recently, there have been tremendous advances in NLP, and we are now moving from research labs into practical applications. This book comes with a perfect blend of both the theoretical and practical aspects of trending and complex NLP techniques.

The book is focused on innovative applications in the field of NLP, language generation, and dialogue systems. It helps you apply the concepts of pre-processing text using techniques such as tokenization, parts of speech tagging, and lemmatization using popular libraries such as Stanford NLP and SpaCy. You will build Named Entity Recognition (NER) from scratch using Conditional Random Fields and Viterbi Decoding on top of RNNs.

The book covers key emerging areas such as generating text for use in sentence completion and text summarization, bridging images and text by generating captions for images, and managing dialogue aspects of chatbots. You will learn how to apply transfer learning and fine-tuning using TensorFlow 2.

Further, it covers practical techniques that can simplify the labelling of textual data. The book also has a working code that is adaptable to your use cases for each tech piece.

By the end of the book, you will have an advanced knowledge of the tools, techniques and deep learning architecture used to solve complex NLP problems.

What you will learn

  • Grasp important pre-steps in building NLP applications like POS tagging
  • Use transfer and weakly supervised learning using libraries like Snorkel
  • Do sentiment analysis using BERT
  • Apply encoder-decoder NN architectures and beam search for summarizing texts
  • Use Transformer models with attention to bring images and text together
  • Build apps that generate captions and answer questions about images using custom Transformers
  • Use advanced TensorFlow techniques like learning rate annealing, custom layers, and custom loss functions to build the latest DeepNLP models

Who this book is for

This is not an introductory book and assumes the reader is familiar with basics of NLP and has fundamental Python skills, as well as basic knowledge of machine learning and undergraduate-level calculus and linear algebra.

The readers who can benefit the most from this book include intermediate ML developers who are familiar with the basics of supervised learning and deep learning techniques and professionals who already use TensorFlow/Python for purposes such as data science, ML, research, analysis, etc.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Advanced Natural Language Processing with TensorFlow 2 an online PDF/ePUB?
Yes, you can access Advanced Natural Language Processing with TensorFlow 2 by Ashish Bansal in PDF and/or ePUB format, as well as other popular books in Computer Science & Natural Language Processing. We have over one million books available in our catalogue for you to explore.

Information

Year
2021
ISBN
9781800201057
Edition
1

7

Multi-Modal Networks and Image Captioning with ResNets and Transformer Networks

"A picture is worth a thousand words" is a famous adage. In this chapter, we'll put this adage to the test and generate captions for an image. In doing so, we'll work with multi-modal networks. Thus far, we have operated on text as input. Humans can handle multiple sensory inputs together to make sense of the environment around them. We can watch a video with subtitles and combine the information provided to understand the scene. We can use facial expressions and lip movement along with sounds to understand speech. We can recognize text in an image, and we can answer natural language questions about images. In other words, we have the ability to process information from different modalities at the same time, and then put them together to understand the world around us. The future of artificial intelligence and deep learning is in building multi-modal networks as they closely mimic human cognitive functions.
Recent advances in image, speech, and text processing lay a solid foundation for multi-modal networks. This chapter transitions you from the world of NLP to the world of multi-modal learning, where we will combine visual and textual features using the familiar Transformer architecture.
We will cover the following topics in this chapter:
  • Overview of multi-modal deep learning
  • Vision and language tasks
  • Detailed overview of the Image Captioning task and the MS-COCO dataset
  • Architecture of a residual network, specifically ResNet
  • Extracting features from images using pre-trained ResNet50
  • Building a full Transformer model from scratch
  • Ideas for improving the performance of image captioning
Our journey starts with an overview of the various tasks in the visual understanding domain, with a focus on tasks that combine language and images.

Multi-modal deep learning

The dictionary definition of "modality" states that it is "a particular mode in which something exists or is experienced or expressed." Sensory modalities, like touch, taste, smell, vision, and sound, allow humans to experience the world around them. Suppose you are out at the farm picking strawberries, and your friend tells you to pick ripe and red strawberries. The instruction, ripe and red strawberries, is processed and converted into a visual and haptic criterion. As you see strawberries and feel them, you know instinctively if they match the criteria of ripe and red. This task is an example of multiple modalities working together for a task. As you can imagine, these capabilities are essential for robotics.
As a direct application of the preceding example, consider a harvesting robot that needs to pick ripe and ready fruit. In December 1976, Harry McGurk and John MacDonald published a piece of research titled Hearing lips and seeing voices (https://www.nature.com/articles/264746a0) in the reputed journal, Nature. They recorded a video of a young woman talking, where utterances of the syllable ba had been dubbed onto the lip movement of the syllable ga. When this video was played back to adults, people repeated hearing the syllable da. When the audio track was played without the video, the right syllable was reported. This research paper highlighted the role of vision in speech recognition. Speech recognition models using lip-reading information were developed in the field of Audio-Visual Speech Recognition (AVSR). There are several exciting applications of multi-modal deep learning models in medical devices and diagnosis, learning technology, and other Artificial Intelligence (AI) areas.
Let's drill down into the specific interaction of vision and language and the various tasks we can perform.

Vision and language tasks

A combination of Computer Vision (CV) and Natural Language Processing (NLP) allows us to build smart AI systems that can see and talk. CV and NLP together produce interesting tasks for model development. Taking an image and generating a caption for it is a well-known task. A practical application of this task is generating alt-text tags for images on web pages. Visually impaired readers use screen readers, which can read these tags while reading the page, improving the accessibility of web pages. Other topics in this area include video captioning and storytelling – composing a story from a sequence of images. The following image shows some examples of images and captions. Our primary focus in this chapter is on image captioning:
A picture containing photo, room, bunch, many

Description automatically generated
Figure 7.1: Example images with captions
Visual Question Answering (VQA) is the challenging task of answering questions about objects in the image. The following image shows some examples from the VQA dataset. Compared to image captioning, where prominent objects are reflected in the caption, VQA is a more complex task. Answering the question may also require some reasoning.
Consider the bottom-right panel in the following image. Answering the question, "Does this person have 20/20 vision?" requires reasoning. Datasets for VQA are available at visualqa.org:
A person posing for a photo

Description automatically generated
Figure 7.2: Examples from the VQA Dataset (Source: VQA: Visual Question Answering by Agrawal et al.)
Reasoning leads to another challenging but fascinating task – Visual Commonsense Reasoning (VCR). When we look at an image, we can guess emotions, actions, and frame a hypothesis of what is happening. Such a task is quite easy for people and may even happen without conscious effort. The aim of the VCR task is to build models that can perform such a task. These models should also be able to explain or choose an appropriate reason for the logical inference that's been made. The following image shows an example from the VCR dataset. More details on the VCR dataset can be found at visualcommonsense.com:
A screenshot of a social media post

Description automatically generated
Figure 7.3: VCR example (Source: From Recognition to Cognition: Visual Commonsense Reasoning by Zellers et al.)
Thus far, we have gone from images to text. The reverse is also possible and is an active area of research. In this task, images or videos are generated from text using GANs and other generative architectures. Imagine being able to generate an illustrative comic book from the text of a story! This particular task is at the forefront of research currently.
A critical concept in this area is visual grounding. Grounding enables tying concepts in language to the real world. Simply put, it matches words to objects in a picture. By combining vision and language, we can ground concepts from languages to parts of an image. For example, mapping the word "basketball" to something that looks like one in an image is called visual grounding. There can be more abstract concepts that can be grounded. For example, a short elephant and a short person have different measurements. Grounding provides us with a way to see what models are learning and helps us guide them in the right direction.
Now that we have a proper perspective on vision and language tasks, let's dive deep into an image captioning task.

Image captioning

Image captioning is all about describing the contents of an image in a sentence. Captions can help in content-based image retrieval and visual search. We already discussed how captions could improve the accessibility of websites by making it easier for screen readers to summarize the content of an image. A caption can be considered a summary of the image. Once we frame the problem as an image summarization problem, we can adapt the seq2seq model from the previous chapter to solve...

Table of contents