eBook - ePub

Natural Language Processing with Java

Name: Natural Language Processing with Java
Author: Richard M. Reese, AshishSingh Bhatia

Techniques for building machine learning and neural network models for NLP, 2nd Edition

Richard M. Reese, AshishSingh Bhatia

Share book

318 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Natural Language Processing with Java

Techniques for building machine learning and neural network models for NLP, 2nd Edition

Richard M. Reese, AshishSingh Bhatia

Book details

Book preview

Table of contents

Citations

About This Book

Explore various approaches to organize and extract useful text from unstructured data using Java

Key Features

Use deep learning and NLP techniques in Java to discover hidden insights in text
Work with popular Java libraries such as CoreNLP, OpenNLP, and Mallet
Explore machine translation, identifying parts of speech, and topic modeling

Book Description

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes.

You'll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you'll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You'll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You'll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more.

By the end of this book, you'll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.

What you will learn

Understand basic NLP tasks and how they relate to one another
Discover and use the available tokenization engines
Apply search techniques to find people, as well as things, within a document
Construct solutions to identify parts of speech within sentences
Use parsers to extract relationships between elements of a document
Identify topics in a set of documents
Explore topic modeling from a document

Who this book is for

Natural Language Processing with Java is for you if you are a data analyst, data scientist, or machine learning engineer who wants to extract information from a language using Java. Knowledge of Java programming is needed, while a basic understanding of statistics will be useful but not mandatory.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Natural Language Processing with Java an online PDF/ePUB?

Yes, you can access Natural Language Processing with Java by Richard M. Reese, AshishSingh Bhatia in PDF and/or ePUB format, as well as other popular books in Informatique & Programmation en Java. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2018

ISBN

9781788993067

Edition

Topic

Informatique

Subtopic

Programmation en Java

Finding Parts of Text

Finding parts of text is concerned with breaking text down into individual units, called tokens, and optionally performing additional processing on those tokens. This additional processing can include stemming, lemmatization, stopword removal, synonym expansion, and converting text to lowercase.

We will demonstrate several tokenization techniques found in the standard Java distribution. These are included because sometimes this is all you may need to do the job. There may be no need to import NLP libraries in this situation. However, these techniques are limited. This is followed by a discussion of specific tokenizers or tokenization approaches supported by NLP APIs. These examples will provide a reference for how the tokenizers are used and the type of output they produce. This is followed by a simple comparison of the differences between the approaches.

There are many specialized tokenizers. For example, the Apache Lucene project supports tokenizers for various languages and specialized documents. The WikipediaTokenizer class is a tokenizer that handles Wikipedia-specific documents, and the ArabicAnalyzer class handles Arabic text. It is not possible to illustrate all of these varying approaches here.

We will also examine how certain tokenizers can be trained to handle specialized text. This can be useful when a different form of text is encountered. It can often eliminate the need to write a new and specialized tokenizer.

Next, we will illustrate how some of these tokenizers can be used to support specific operations, such as stemming, lemmatization, and stopword removal. POS can also be considered as a special instance of parts of text. However, this topic is investigated in Chapter 5, Detecting Parts of Speech.

Therefore, we will be covering the following topics in this chapter:

What is tokenization?
Uses of tokenizers
NLP tokenizer APIs
Understanding normalization

Understanding the parts of text

There are a number of ways to categorize parts of text. For example, we may be concerned with character-level issues, such as punctuation, with a possible need to ignore or expand contractions. At the word level, we may need to perform different operations, such as the following:

Identifying morphemes using stemming and/or lemmatization
Expanding abbreviations and acronyms
Isolating number units

We cannot always split words with punctuation, because the punctuation is sometimes considered to be part of the word, such as the word can't. We may also be concerned with grouping multiple words to form meaningful phrases. Sentence-detection can also be a factor. We do not necessarily want to group words that cross sentence boundaries.

In this chapter, we are primarily concerned with the tokenization process and a few specialized techniques, such as stemming. We will not attempt to show how they are used in other NLP tasks. Those efforts are reserved for later chapters.

What is tokenization?

Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need, at times, to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important:

Character	Meaning
Unicode space character	(space_separator, line_separator, or paragraph_separator)
\t	U+0009 horizontal tabulation
\n	U+000A line feed
\u000B	U+000B vertical tabulation
\f	U+000C form feed
\r	U+000D carriage return
\u001C	U+001C file separator
\u001D	U+001D group separator
\u001E	U+001E record separator
\u001F	U+001F unit separator

The tokenization process is complicated by a large number of factors, such as the following:

Language: Different languages present unique challenges. Whitespace is a commonly-used delimiter, but it will not be sufficient if we need to work with Chinese, where it is not used.
Text format: Text is often stored or presented using different formats. How simple text is processed versus HTML or other markup techniques will complicate the tokenization process.
Stopwords: Commonly-used words might not be important for some NLP tasks, such as general searches. These common words are called stopwords. Stopwords are sometimes removed when they do not contribute to the NLP task at hand. These can include words such as a, and, and she.
Text-expansion: For acronyms and abbreviations, it is sometimes desirable
to expand them so that postprocesses can produce better-quality results.
For example, if a search is interested in the word machine, knowing that IBM stands for International Business Machines can be useful.
Case: The case of a word (upper or lower) may be significant in some situations. For example, the case of a word can help identify proper nouns. When identifying the par...