Natural Language Processing with Java
Techniques for building machine learning and neural network models for NLP, 2nd Edition
Richard M. Reese, AshishSingh Bhatia
- 318 pages
- English
- ePUB (mobile friendly)
- Available on iOS & Android
Natural Language Processing with Java
Techniques for building machine learning and neural network models for NLP, 2nd Edition
Richard M. Reese, AshishSingh Bhatia
About This Book
Explore various approaches to organize and extract useful text from unstructured data using Java
Key Features
- Use deep learning and NLP techniques in Java to discover hidden insights in text
- Work with popular Java libraries such as CoreNLP, OpenNLP, and Mallet
- Explore machine translation, identifying parts of speech, and topic modeling
Book Description
Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes.
You'll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you'll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You'll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You'll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more.
By the end of this book, you'll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.
What you will learn
- Understand basic NLP tasks and how they relate to one another
- Discover and use the available tokenization engines
- Apply search techniques to find people, as well as things, within a document
- Construct solutions to identify parts of speech within sentences
- Use parsers to extract relationships between elements of a document
- Identify topics in a set of documents
- Explore topic modeling from a document
Who this book is for
Natural Language Processing with Java is for you if you are a data analyst, data scientist, or machine learning engineer who wants to extract information from a language using Java. Knowledge of Java programming is needed, while a basic understanding of statistics will be useful but not mandatory.
Frequently asked questions
Information
Finding Parts of Text
- What is tokenization?
- Uses of tokenizers
- NLP tokenizer APIs
- Understanding normalization
Understanding the parts of text
- Identifying morphemes using stemming and/or lemmatization
- Expanding abbreviations and acronyms
- Isolating number units
What is tokenization?
Character | Meaning |
Unicode space character | (space_separator, line_separator, or paragraph_separator) |
\t | U+0009 horizontal tabulation |
\n | U+000A line feed |
\u000B | U+000B vertical tabulation |
\f | U+000C form feed |
\r | U+000D carriage return |
\u001C | U+001C file separator |
\u001D | U+001D group separator |
\u001E | U+001E record separator |
\u001F | U+001F unit separator |
- Language: Different languages present unique challenges. Whitespace is a commonly-used delimiter, but it will not be sufficient if we need to work with Chinese, where it is not used.
- Text format: Text is often stored or presented using different formats. How simple text is processed versus HTML or other markup techniques will complicate the tokenization process.
- Stopwords: Commonly-used words might not be important for some NLP tasks, such as general searches. These common words are called stopwords. Stopwords are sometimes removed when they do not contribute to the NLP task at hand. These can include words such as a, and, and she.
- Text-expansion: For acronyms and abbreviations, it is sometimes desirable
to expand them so that postprocesses can produce better-quality results.
For example, if a search is interested in the word machine, knowing that IBM stands for International Business Machines can be useful. - Case: The case of a word (upper or lower) may be significant in some situations. For example, the case of a word can help identify proper nouns. When identifying the par...