eBook - ePub

Mastering Java for Data Science

Name: Mastering Java for Data Science
Author: Alexey Grigorev

Alexey Grigorev

Share book

364 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Mastering Java for Data Science

Alexey Grigorev

Book details

Book preview

Table of contents

Citations

About This Book

Use Java to create a diverse range of Data Science applications and bring Data Science into productionAbout This Book• An overview of modern Data Science and Machine Learning libraries available in Java• Coverage of a broad set of topics, going from the basics of Machine Learning to Deep Learning and Big Data frameworks.• Easy-to-follow illustrations and the running example of building a search engine.Who This Book Is ForThis book is intended for software engineers who are comfortable with developing Java applications and are familiar with the basic concepts of data science. Additionally, it will also be useful for data scientists who do not yet know Java but want or need to learn it.If you are willing to build efficient data science applications and bring them in the enterprise environment without changing the existing stack, this book is for you!What You Will Learn• Get a solid understanding of the data processing toolbox available in Java• Explore the data science ecosystem available in Java• Find out how to approach different machine learning problems with Java• Process unstructured information such as natural language text or images• Create your own search engine• Get state-of-the-art performance with XGBoost• Learn how to build deep neural networks with DeepLearning4j• Build applications that scale and process large amounts of data• Deploy data science models to production and evaluate their performanceIn DetailJava is the most popular programming language, according to the TIOBE index, and it is a typical choice for running production systems in many companies, both in the startup world and among large enterprises.Not surprisingly, it is also a common choice for creating data science applications: it is fast and has a great set of data processing tools, both built-in and external. What is more, choosing Java for data science allows you to easily integrate solutions with existing software, and bring data science into production with less effort.This book will teach you how to create data science applications with Java. First, we will revise the most important things when starting a data science application, and then brush up the basics of Java and machine learning before diving into more advanced topics. We start by going over the existing libraries for data processing and libraries with machine learning algorithms. After that, we cover topics such as classification and regression, dimensionality reduction and clustering, information retrieval and natural language processing, and deep learning and big data.Finally, we finish the book by talking about the ways to deploy the model and evaluate it in production settings.Style and approachThis is a practical guide where all the important concepts such as classification, regression, and dimensionality reduction are explained with the help of examples.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Mastering Java for Data Science an online PDF/ePUB?

Yes, you can access Mastering Java for Data Science by Alexey Grigorev in PDF and/or ePUB format, as well as other popular books in Informatik & Datenverarbeitung. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2017

ISBN

9781785887390

Edition

Topic

Informatik

Subtopic

Datenverarbeitung

Working with Text - Natural Language Processing and Information Retrieval

In the previous two chapters, we covered the basics of machine learning: we spoke about supervised and unsupervised problems.

In this chapter, we will take a look at how to use these methods for processing textual information, and we will illustrate most of our ideas with our running example: building a search engine. Here, we will finally use the text information from the HTML and include it into the machine learning models.

First, we will start with the basics of natural language processing, and implement some of the basic ideas ourselves, and then look into efficient implementations available in NLP libraries.

This chapter covers the following topics:

Basics of information retrieval
Indexing and searching with Apache Lucene
Basics of natural language processing
Unsupervised models for texts - dimensionality reduction, clustering, and word embeddings
Supervised models for texts - text classification and learning to rank

By the end of this chapter you will learn how to do simple text pre-processing for machine learning, how to use Apache Lucene for indexing, how to transform words into vectors, and finally, how to cluster and classify texts.

Natural Language Processing and information retrieval

Natural Language Processing (NLP) is a part of computer science and computational linguistics that deals with textual data. To a computer, texts are unstructured, and NLP helps find the structure and extract useful information from them.

Information retrieval (IR) is a discipline that studies searching in large unstructured datasets. Typically, these datasets are texts, and the IR systems help users find what they want. Search engines such as Google or Bing are examples of such IR systems: they take in a query and provide a collection of documents ranked according to relevance with respect to the query.

Usually, IR systems use NLP for understanding what the documents are about - so later, when the user needs, these documents can be retrieved. In this chapter, we will go over the basics of text processing for information retrieval.

Vector Space Model - Bag of Words and TF-IDF

For a computer, a text is just a string of characters with no particular structure imposed on it. Hence, we call texts unstructured data. However, to humans, texts certainly has a structure, which we use to understand the content. What IR and NLP models try to do is similar: they find the structure in texts, use it to extract the information there, and understand what the text is about.

The simplest possible way of achieving it is called Bag of Words: we take a text, split it into individual words (which we call tokens), and then represent the text as an unordered collection of tokens along with some weights associated with each token.

Let us consider an example. If we take a document, that consists of one sentence (we use Java for Data Science because we like Java), it can be represented as follows:

 (because, 1), (data, 1), (for, 1), (java, 2), (science, 1), (use, 1), (we, 2)

Here, each word from the sentence is weighted by the number of times the word occurs there.

Now, when we are able to represent documents in such a way, we can use it for comparing one document to another.

For example, if we take another sentence such as Java is good enterprise development, we can represent it as follows:

 (development, 1), (enterprise, 1), (for, 1), (good, 1), (java, 1)

We can see that there is some intersection between these two documents, which may mean that these two documents are similar, and the higher the intersection, the more similar the documents are.

Now, if we think of words as dimensions in some vector space, and weights as the values for these dimensions, then we can represent documents as vectors:

If we take this vectorial representation, we can use the inner product between two vectors as a measure of similarity. Indeed, if two documents have a lot of common words, the inner product between them will be high, and if they share no documents, the inner product is zero.

This idea is called Vector Space Model, and this is what is used in many information retrieval systems: all documents as well as the user queries are represented as vectors. Once the query and the documents are in the same space, we can think of similarity between a query and a document as the relevance between them. So, we sort the documents by their similarity to the user query.

Going from raw text to a vector involves a few steps. Usually, they are as follows:

First, we tokenize the text, that is, convert it into a collection of individual tokens.
Then, we remove function words such as is, will, to, and others. They are often used for linking purposes only and do not carry any significant meaning. These words are called stop words.
Sometimes we also convert tokens to some normal form. For example, we may want to map cat and cats to cat because the concept is the same behind these two different words. This is achieved through stemming or lemmatization.
Finally, we compute the weight of each token and put them into the vector space.

Previously, we used the number of occurrences for weighting terms; this is called Term Frequency weighting. However, some words are more important than others and Term Frequency does not always capture that.

For example, hammer can be more important than tool because it is more specific. Inverse Document Frequency is a different weighting scheme that penalizes general words and favors specific ones. Inside, it is based on the number of documents that contain the term, and the idea is that more specific terms occur in fewer documents than general ones.

Finally, there is a combination of both Term Frequency and Inverse Document Frequency, which is abbreviated as TF-IDF. As the name suggests, the weight for the token t consists of two parts: TF and IDF:

 weight(t) = tf(t) * idf(t)

Here is an explanation of the terms mentioned in the preceding equation:

tf(t): This is a function on the number of times the token t occurs in the text
idf(t): This is a function on the number of documents that contain the token

There are multiple ways to define these functions, but, most commonly, the following definitions are used:

tf(t): This is the number of times t occurs in the document
idf(t) = log(N / df(t)): Here, df(t) is the number of documents, which contain t, and N - the total number of documents

Previously, we suggested that we can use the inner product for measuring the similarity between documents. There is a problem with this approach: it is unbounded, which means that it can take any positive value, and this makes it harder to interpret. Additionally, longer documents will tend to have higher similarity with everything else just because they contain more words.

The solution to this problem is to normalize the weights inside a vector such that its norm becomes 1. Then, computing the inner product will always result in a bounded value between 0 and 1, and longer documents will have less influence. The inner product between normalized vectors is usually called cosine similarity because it corresponds to the cosine of the angle that these two vectors form in the vector space.

Vector space model implementation

Now we have enough background information and are ready to proceed to the code.

First, suppose that we have a text file where each line is a document, and we want to index the content of this file and be able to query it. For example, we can take some text from https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt and save it to simple-text.txt.

Then we can read it this way:

 Path path = Paths.get("data/simple-text.txt");
List&l...