Mastering Java for Data Science
Alexey Grigorev
- 364 pages
- English
- ePUB (mobile friendly)
- Available on iOS & Android
Mastering Java for Data Science
Alexey Grigorev
About This Book
Use Java to create a diverse range of Data Science applications and bring Data Science into productionAbout This Bookā¢ An overview of modern Data Science and Machine Learning libraries available in Javaā¢ Coverage of a broad set of topics, going from the basics of Machine Learning to Deep Learning and Big Data frameworks.ā¢ Easy-to-follow illustrations and the running example of building a search engine.Who This Book Is ForThis book is intended for software engineers who are comfortable with developing Java applications and are familiar with the basic concepts of data science. Additionally, it will also be useful for data scientists who do not yet know Java but want or need to learn it.If you are willing to build efficient data science applications and bring them in the enterprise environment without changing the existing stack, this book is for you!What You Will Learnā¢ Get a solid understanding of the data processing toolbox available in Javaā¢ Explore the data science ecosystem available in Javaā¢ Find out how to approach different machine learning problems with Javaā¢ Process unstructured information such as natural language text or imagesā¢ Create your own search engineā¢ Get state-of-the-art performance with XGBoostā¢ Learn how to build deep neural networks with DeepLearning4jā¢ Build applications that scale and process large amounts of dataā¢ Deploy data science models to production and evaluate their performanceIn DetailJava is the most popular programming language, according to the TIOBE index, and it is a typical choice for running production systems in many companies, both in the startup world and among large enterprises.Not surprisingly, it is also a common choice for creating data science applications: it is fast and has a great set of data processing tools, both built-in and external. What is more, choosing Java for data science allows you to easily integrate solutions with existing software, and bring data science into production with less effort.This book will teach you how to create data science applications with Java. First, we will revise the most important things when starting a data science application, and then brush up the basics of Java and machine learning before diving into more advanced topics. We start by going over the existing libraries for data processing and libraries with machine learning algorithms. After that, we cover topics such as classification and regression, dimensionality reduction and clustering, information retrieval and natural language processing, and deep learning and big data.Finally, we finish the book by talking about the ways to deploy the model and evaluate it in production settings.Style and approachThis is a practical guide where all the important concepts such as classification, regression, and dimensionality reduction are explained with the help of examples.
Frequently asked questions
Information
Working with Text - Natural Language Processing and Information Retrieval
- Basics of information retrieval
- Indexing and searching with Apache Lucene
- Basics of natural language processing
- Unsupervised models for texts - dimensionality reduction, clustering, and word embeddings
- Supervised models for texts - text classification and learning to rank
Natural Language Processing and information retrieval
Vector Space Model - Bag of Words and TF-IDF
(because, 1), (data, 1), (for, 1), (java, 2), (science, 1), (use, 1), (we, 2)
(development, 1), (enterprise, 1), (for, 1), (good, 1), (java, 1)
- First, we tokenize the text, that is, convert it into a collection of individual tokens.
- Then, we remove function words such as is, will, to, and others. They are often used for linking purposes only and do not carry any significant meaning. These words are called stop words.
- Sometimes we also convert tokens to some normal form. For example, we may want to map cat and cats to cat because the concept is the same behind these two different words. This is achieved through stemming or lemmatization.
- Finally, we compute the weight of each token and put them into the vector space.
weight(t) = tf(t) * idf(t)
- tf(t): This is a function on the number of times the token t occurs in the text
- idf(t): This is a function on the number of documents that contain the token
- tf(t): This is the number of times t occurs in the document
- idf(t) = log(N / df(t)): Here, df(t) is the number of documents, which contain t, and N - the total number of documents
Vector space model implementation
Path path = Paths.get("data/simple-text.txt");
List&l...