eBook - ePub

Natural Language Processing with Java

Name: Natural Language Processing with Java
ISBN: 9781788993067

Techniques for building machine learning and neural network models for NLP, 2nd Edition

Richard M. Reese,

AshishSingh Bhatia,

318 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Natural Language Processing with Java

Techniques for building machine learning and neural network models for NLP, 2nd Edition

Richard M. Reese,

AshishSingh Bhatia,

About this book

Explore various approaches to organize and extract useful text from unstructured data using Java

Key Features

Use deep learning and NLP techniques in Java to discover hidden insights in text
Work with popular Java libraries such as CoreNLP, OpenNLP, and Mallet
Explore machine translation, identifying parts of speech, and topic modeling

Book Description

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes.

You'll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you'll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You'll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You'll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more.

By the end of this book, you'll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.

What you will learn

Understand basic NLP tasks and how they relate to one another
Discover and use the available tokenization engines
Apply search techniques to find people, as well as things, within a document
Construct solutions to identify parts of speech within sentences
Use parsers to extract relationships between elements of a document
Identify topics in a set of documents
Explore topic modeling from a document

Who this book is for

Natural Language Processing with Java is for you if you are a data analyst, data scientist, or machine learning engineer who wants to extract information from a language using Java. Knowledge of Java programming is needed, while a basic understanding of statistics will be useful but not mandatory.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Packt Publishing

Year

2018

Edition

eBook ISBN

9781788993067

Topic

Computer Science

Subtopic

Natural Language Processing

Index

Computer Science

Finding Parts of Text

Finding parts of text is concerned with breaking text down into individual units, called tokens, and optionally performing additional processing on those tokens. This additional processing can include stemming, lemmatization, stopword removal, synonym expansion, and converting text to lowercase.

We will demonstrate several tokenization techniques found in the standard Java distribution. These are included because sometimes this is all you may need to do the job. There may be no need to import NLP libraries in this situation. However, these techniques are limited. This is followed by a discussion of specific tokenizers or tokenization approaches supported by NLP APIs. These examples will provide a reference for how the tokenizers are used and the type of output they produce. This is followed by a simple comparison of the differences between the approaches.

There are many specialized tokenizers. For example, the Apache Lucene project supports tokenizers for various languages and specialized documents. The WikipediaTokenizer class is a tokenizer that handles Wikipedia-specific documents, and the ArabicAnalyzer class handles Arabic text. It is not possible to illustrate all of these varying approaches here.

We will also examine how certain tokenizers can be trained to handle specialized text. This can be useful when a different form of text is encountered. It can often eliminate the need to write a new and specialized tokenizer.

Next, we will illustrate how some of these tokenizers can be used to support specific operations, such as stemming, lemmatization, and stopword removal. POS can also be considered as a special instance of parts of text. However, this topic is investigated in Chapter 5, Detecting Parts of Speech.

Therefore, we will be covering the following topics in this chapter:

What is tokenization?
Uses of tokenizers
NLP tokenizer APIs
Understanding normalization

Understanding the parts of text

There are a number of ways to categorize parts of text. For example, we may be concerned with character-level issues, such as punctuation, with a possible need to ignore or expand contractions. At the word level, we may need to perform different operations, such as the following:

Identifying morphemes using stemming and/or lemmatization
Expanding abbreviations and acronyms
Isolating number units

We cannot always split words with punctuation, because the punctuation is sometimes considered to be part of the word, such as the word can't. We may also be concerned with grouping multiple words to form meaningful phrases. Sentence-detection can also be a factor. We do not necessarily want to group words that cross sentence boundaries.

In this chapter, we are primarily concerned with the tokenization process and a few specialized techniques, such as stemming. We will not attempt to show how they are used in other NLP tasks. Those efforts are reserved for later chapters.

What is tokenization?

Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need, at times, to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important:

Character	Meaning
Unicode space character	(space_separator, line_separator, or paragraph_separator)
\t	U+0009 horizontal tabulation
\n	U+000A line feed
\u000B	U+000B vertical tabulation
\f	U+000C form feed
\r	U+000D carriage return
\u001C	U+001C file separator
\u001D	U+001D group separator
\u001E	U+001E record separator
\u001F	U+001F unit separator

The tokenization process is complicated by a large number of factors, such as the following:

Language: Different languages present unique challenges. Whitespace is a commonly-used delimiter, but it will not be sufficient if we need to work with Chinese, where it is not used.
Text format: Text is often stored or presented using different formats. How simple text is processed versus HTML or other markup techniques will complicate the tokenization process.
Stopwords: Commonly-used words might not be important for some NLP tasks, such as general searches. These common words are called stopwords. Stopwords are sometimes removed when they do not contribute to the NLP task at hand. These can include words such as a, and, and she.
Text-expansion: For acronyms and abbreviations, it is sometimes desirable
to expand them so that postprocesses can produce better-quality results.
For example, if a search is interested in the word machine, knowing that IBM stands for International Business Machines can be useful.
Case: The case of a word (upper or lower) may be significant in some situations. For example, the case of a word can help identify proper nouns. When identifying the par...

Title Page
Copyright and Credits
Dedication
Packt Upsell
Contributors
Preface
Introduction to NLP
Finding Parts of Text
Finding Sentences
Finding People and Things
Detecting Part of Speech
Representing Text with Features
Information Retrieval
Classifying Texts and Documents
Topic Modeling
Using Parsers to Extract Relationships
Combined Pipeline
Creating a Chatbot
Other Books You May Enjoy

Frequently asked questions

Can I cancel at any time?

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

Can I download books?

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

What is the difference between the pricing plans?

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

How does Perlego work?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Can I read on my tablet or smartphone?

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Is Natural Language Processing with Java an online PDF/ePUB?

Yes, you can access Natural Language Processing with Java by Richard M. Reese, AshishSingh Bhatia in PDF and/or ePUB format, as well as other popular books in Computer Science & Natural Language Processing. We have over 1.5 million books available in our catalogue for you to explore.

About this book

Trusted by 375,005 students

Information

Table of contents

Frequently asked questions