Natural Language Processing with Java
eBook - ePub

Natural Language Processing with Java

Techniques for building machine learning and neural network models for NLP, 2nd Edition

  1. 318 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Natural Language Processing with Java

Techniques for building machine learning and neural network models for NLP, 2nd Edition

About this book

Explore various approaches to organize and extract useful text from unstructured data using Java

Key Features

  • Use deep learning and NLP techniques in Java to discover hidden insights in text
  • Work with popular Java libraries such as CoreNLP, OpenNLP, and Mallet
  • Explore machine translation, identifying parts of speech, and topic modeling

Book Description

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes.

You'll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you'll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You'll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You'll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more.

By the end of this book, you'll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.

What you will learn

  • Understand basic NLP tasks and how they relate to one another
  • Discover and use the available tokenization engines
  • Apply search techniques to find people, as well as things, within a document
  • Construct solutions to identify parts of speech within sentences
  • Use parsers to extract relationships between elements of a document
  • Identify topics in a set of documents
  • Explore topic modeling from a document

Who this book is for

Natural Language Processing with Java is for you if you are a data analyst, data scientist, or machine learning engineer who wants to extract information from a language using Java. Knowledge of Java programming is needed, while a basic understanding of statistics will be useful but not mandatory.

Tools to learn more effectively

Saving Books

Saving Books

Keyword Search

Keyword Search

Annotating Text

Annotating Text

Listen to it instead

Listen to it instead

Finding Parts of Text

Finding parts of text is concerned with breaking text down into individual units, called tokens, and optionally performing additional processing on those tokens. This additional processing can include stemming, lemmatization, stopword removal, synonym expansion, and converting text to lowercase.
We will demonstrate several tokenization techniques found in the standard Java distribution. These are included because sometimes this is all you may need to do the job. There may be no need to import NLP libraries in this situation. However, these techniques are limited. This is followed by a discussion of specific tokenizers or tokenization approaches supported by NLP APIs. These examples will provide a reference for how the tokenizers are used and the type of output they produce. This is followed by a simple comparison of the differences between the approaches.
There are many specialized tokenizers. For example, the Apache Lucene project supports tokenizers for various languages and specialized documents. The WikipediaTokenizer class is a tokenizer that handles Wikipedia-specific documents, and the ArabicAnalyzer class handles Arabic text. It is not possible to illustrate all of these varying approaches here.
We will also examine how certain tokenizers can be trained to handle specialized text. This can be useful when a different form of text is encountered. It can often eliminate the need to write a new and specialized tokenizer.
Next, we will illustrate how some of these tokenizers can be used to support specific operations, such as stemming, lemmatization, and stopword removal. POS can also be considered as a special instance of parts of text. However, this topic is investigated in Chapter 5, Detecting Parts of Speech.
Therefore, we will be covering the following topics in this chapter:
  • What is tokenization?
  • Uses of tokenizers
  • NLP tokenizer APIs
  • Understanding normalization

Understanding the parts of text

There are a number of ways to categorize parts of text. For example, we may be concerned with character-level issues, such as punctuation, with a possible need to ignore or expand contractions. At the word level, we may need to perform different operations, such as the following:
  • Identifying morphemes using stemming and/or lemmatization
  • Expanding abbreviations and acronyms
  • Isolating number units
We cannot always split words with punctuation, because the punctuation is sometimes considered to be part of the word, such as the word can't. We may also be concerned with grouping multiple words to form meaningful phrases. Sentence-detection can also be a factor. We do not necessarily want to group words that cross sentence boundaries.
In this chapter, we are primarily concerned with the tokenization process and a few specialized techniques, such as stemming. We will not attempt to show how they are used in other NLP tasks. Those efforts are reserved for later chapters.

What is tokenization?

Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need, at times, to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important:
Character
Meaning
Unicode space character
(space_separator, line_separator, or paragraph_separator)
\t
U+0009 horizontal tabulation
\n
U+000A line feed
\u000B
U+000B vertical tabulation
\f
U+000C form feed
\r
U+000D carriage return
\u001C
U+001C file separator
\u001D
U+001D group separator
\u001E
U+001E record separator
\u001F
U+001F unit separator
The tokenization process is complicated by a large number of factors, such as the following:
  • Language: Different languages present unique challenges. Whitespace is a commonly-used delimiter, but it will not be sufficient if we need to work with Chinese, where it is not used.
  • Text format: Text is often stored or presented using different formats. How simple text is processed versus HTML or other markup techniques will complicate the tokenization process.
  • Stopwords: Commonly-used words might not be important for some NLP tasks, such as general searches. These common words are called stopwords. Stopwords are sometimes removed when they do not contribute to the NLP task at hand. These can include words such as a, and, and she.
  • Text-expansion: For acronyms and abbreviations, it is sometimes desirable
    to expand them so that postprocesses can produce better-quality results.
    For example, if a search is interested in the word machine, knowing that IBM stands for International Business Machines can be useful.
  • Case: The case of a word (upper or lower) may be significant in some situations. For example, the case of a word can help identify proper nouns. When identifying the par...

Table of contents

  1. Title Page
  2. Copyright and Credits
  3. Dedication
  4. Packt Upsell
  5. Contributors
  6. Preface
  7. Introduction to NLP
  8. Finding Parts of Text
  9. Finding Sentences
  10. Finding People and Things
  11. Detecting Part of Speech
  12. Representing Text with Features
  13. Information Retrieval
  14. Classifying Texts and Documents
  15. Topic Modeling
  16. Using Parsers to Extract Relationships
  17. Combined Pipeline
  18. Creating a Chatbot
  19. Other Books You May Enjoy

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Natural Language Processing with Java by Richard M. Reese, AshishSingh Bhatia in PDF and/or ePUB format, as well as other popular books in Computer Science & Natural Language Processing. We have over one million books available in our catalogue for you to explore.