eBook - ePub

Mastering Java for Data Science

Name: Mastering Java for Data Science
Author: Alexey Grigorev

Alexey Grigorev

Compartir libro

364 páginas
English
ePUB (apto para móviles)
Disponible en iOS y Android

eBook - ePub

Mastering Java for Data Science

Alexey Grigorev

Detalles del libro

Vista previa del libro

Índice

Citas

Información del libro

Use Java to create a diverse range of Data Science applications and bring Data Science into productionAbout This Book• An overview of modern Data Science and Machine Learning libraries available in Java• Coverage of a broad set of topics, going from the basics of Machine Learning to Deep Learning and Big Data frameworks.• Easy-to-follow illustrations and the running example of building a search engine.Who This Book Is ForThis book is intended for software engineers who are comfortable with developing Java applications and are familiar with the basic concepts of data science. Additionally, it will also be useful for data scientists who do not yet know Java but want or need to learn it.If you are willing to build efficient data science applications and bring them in the enterprise environment without changing the existing stack, this book is for you!What You Will Learn• Get a solid understanding of the data processing toolbox available in Java• Explore the data science ecosystem available in Java• Find out how to approach different machine learning problems with Java• Process unstructured information such as natural language text or images• Create your own search engine• Get state-of-the-art performance with XGBoost• Learn how to build deep neural networks with DeepLearning4j• Build applications that scale and process large amounts of data• Deploy data science models to production and evaluate their performanceIn DetailJava is the most popular programming language, according to the TIOBE index, and it is a typical choice for running production systems in many companies, both in the startup world and among large enterprises.Not surprisingly, it is also a common choice for creating data science applications: it is fast and has a great set of data processing tools, both built-in and external. What is more, choosing Java for data science allows you to easily integrate solutions with existing software, and bring data science into production with less effort.This book will teach you how to create data science applications with Java. First, we will revise the most important things when starting a data science application, and then brush up the basics of Java and machine learning before diving into more advanced topics. We start by going over the existing libraries for data processing and libraries with machine learning algorithms. After that, we cover topics such as classification and regression, dimensionality reduction and clustering, information retrieval and natural language processing, and deep learning and big data.Finally, we finish the book by talking about the ways to deploy the model and evaluate it in production settings.Style and approachThis is a practical guide where all the important concepts such as classification, regression, and dimensionality reduction are explained with the help of examples.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?

Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.

¿Cómo descargo los libros?

Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.

¿En qué se diferencian los planes de precios?

Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.

¿Qué es Perlego?

Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.

¿Perlego ofrece la función de texto a voz?

Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.

¿Es Mastering Java for Data Science un PDF/ePUB en línea?

Sí, puedes acceder a Mastering Java for Data Science de Alexey Grigorev en formato PDF o ePUB, así como a otros libros populares de Informatique y Traitement des données. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial

Packt Publishing

Año

2017

ISBN

9781785887390

Edición

Categoría

Informatique

Categoría

Traitement des données

Working with Text - Natural Language Processing and Information Retrieval

In the previous two chapters, we covered the basics of machine learning: we spoke about supervised and unsupervised problems.

In this chapter, we will take a look at how to use these methods for processing textual information, and we will illustrate most of our ideas with our running example: building a search engine. Here, we will finally use the text information from the HTML and include it into the machine learning models.

First, we will start with the basics of natural language processing, and implement some of the basic ideas ourselves, and then look into efficient implementations available in NLP libraries.

This chapter covers the following topics:

Basics of information retrieval
Indexing and searching with Apache Lucene
Basics of natural language processing
Unsupervised models for texts - dimensionality reduction, clustering, and word embeddings
Supervised models for texts - text classification and learning to rank

By the end of this chapter you will learn how to do simple text pre-processing for machine learning, how to use Apache Lucene for indexing, how to transform words into vectors, and finally, how to cluster and classify texts.

Natural Language Processing and information retrieval

Natural Language Processing (NLP) is a part of computer science and computational linguistics that deals with textual data. To a computer, texts are unstructured, and NLP helps find the structure and extract useful information from them.

Information retrieval (IR) is a discipline that studies searching in large unstructured datasets. Typically, these datasets are texts, and the IR systems help users find what they want. Search engines such as Google or Bing are examples of such IR systems: they take in a query and provide a collection of documents ranked according to relevance with respect to the query.

Usually, IR systems use NLP for understanding what the documents are about - so later, when the user needs, these documents can be retrieved. In this chapter, we will go over the basics of text processing for information retrieval.

Vector Space Model - Bag of Words and TF-IDF

For a computer, a text is just a string of characters with no particular structure imposed on it. Hence, we call texts unstructured data. However, to humans, texts certainly has a structure, which we use to understand the content. What IR and NLP models try to do is similar: they find the structure in texts, use it to extract the information there, and understand what the text is about.

The simplest possible way of achieving it is called Bag of Words: we take a text, split it into individual words (which we call tokens), and then represent the text as an unordered collection of tokens along with some weights associated with each token.

Let us consider an example. If we take a document, that consists of one sentence (we use Java for Data Science because we like Java), it can be represented as follows:

 (because, 1), (data, 1), (for, 1), (java, 2), (science, 1), (use, 1), (we, 2)

Here, each word from the sentence is weighted by the number of times the word occurs there.

Now, when we are able to represent documents in such a way, we can use it for comparing one document to another.

For example, if we take another sentence such as Java is good enterprise development, we can represent it as follows:

 (development, 1), (enterprise, 1), (for, 1), (good, 1), (java, 1)

We can see that there is some intersection between these two documents, which may mean that these two documents are similar, and the higher the intersection, the more similar the documents are.

Now, if we think of words as dimensions in some vector space, and weights as the values for these dimensions, then we can represent documents as vectors:

If we take this vectorial representation, we can use the inner product between two vectors as a measure of similarity. Indeed, if two documents have a lot of common words, the inner product between them will be high, and if they share no documents, the inner product is zero.

This idea is called Vector Space Model, and this is what is used in many information retrieval systems: all documents as well as the user queries are represented as vectors. Once the query and the documents are in the same space, we can think of similarity between a query and a document as the relevance between them. So, we sort the documents by their similarity to the user query.

Going from raw text to a vector involves a few steps. Usually, they are as follows:

First, we tokenize the text, that is, convert it into a collection of individual tokens.
Then, we remove function words such as is, will, to, and others. They are often used for linking purposes only and do not carry any significant meaning. These words are called stop words.
Sometimes we also convert tokens to some normal form. For example, we may want to map cat and cats to cat because the concept is the same behind these two different words. This is achieved through stemming or lemmatization.
Finally, we compute the weight of each token and put them into the vector space.

Previously, we used the number of occurrences for weighting terms; this is called Term Frequency weighting. However, some words are more important than others and Term Frequency does not always capture that.

For example, hammer can be more important than tool because it is more specific. Inverse Document Frequency is a different weighting scheme that penalizes general words and favors specific ones. Inside, it is based on the number of documents that contain the term, and the idea is that more specific terms occur in fewer documents than general ones.

Finally, there is a combination of both Term Frequency and Inverse Document Frequency, which is abbreviated as TF-IDF. As the name suggests, the weight for the token t consists of two parts: TF and IDF:

 weight(t) = tf(t) * idf(t)

Here is an explanation of the terms mentioned in the preceding equation:

tf(t): This is a function on the number of times the token t occurs in the text
idf(t): This is a function on the number of documents that contain the token

There are multiple ways to define these functions, but, most commonly, the following definitions are used:

tf(t): This is the number of times t occurs in the document
idf(t) = log(N / df(t)): Here, df(t) is the number of documents, which contain t, and N - the total number of documents

Previously, we suggested that we can use the inner product for measuring the similarity between documents. There is a problem with this approach: it is unbounded, which means that it can take any positive value, and this makes it harder to interpret. Additionally, longer documents will tend to have higher similarity with everything else just because they contain more words.

The solution to this problem is to normalize the weights inside a vector such that its norm becomes 1. Then, computing the inner product will always result in a bounded value between 0 and 1, and longer documents will have less influence. The inner product between normalized vectors is usually called cosine similarity because it corresponds to the cosine of the angle that these two vectors form in the vector space.

Vector space model implementation

Now we have enough background information and are ready to proceed to the code.

First, suppose that we have a text file where each line is a document, and we want to index the content of this file and be able to query it. For example, we can take some text from https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt and save it to simple-text.txt.

Then we can read it this way:

 Path path = Paths.get("data/simple-text.txt");
List&l...