Practical Machine Learning with Spark
eBook - ePub

Practical Machine Learning with Spark

Uncover Apache Spark's Scalable Performance with High-Quality Algorithms Across NLP, Computer Vision and ML

Gourav Gupta, Dr. Manish Gupta, Dr. Inder Singh Gupta

Share book
  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Practical Machine Learning with Spark

Uncover Apache Spark's Scalable Performance with High-Quality Algorithms Across NLP, Computer Vision and ML

Gourav Gupta, Dr. Manish Gupta, Dr. Inder Singh Gupta

Book details
Book preview
Table of contents
Citations

About This Book

Explore the cosmic secrets of Distributed Processing for Deep Learning applications.

Key Features
?In-depth practical demonstration of ML/DL concepts using Distributed Framework.
? Covers graphical illustrations and visual explanations for ML/DL pipelines.
? Includes live codebase for each of NLP, computer vision and machine learning applications.

Description
This book provides the reader with an up-to-date explanation of Machine Learning and an in-depth, comprehensive, and straightforward understanding of the architectural techniques used to evaluate and anticipate the futuristic insights of data using Apache Spark.The book walks readers by setting up Hadoop and Spark installations on-premises, Docker, and AWS. Readers will learn about Spark MLib and how to utilize it in supervised and unsupervised machine learning scenarios. With the help of Spark, some of the most prominent technologies, such as natural language processing and computer vision, are evaluated and demonstrated in a realistic setting. Using the capabilities of Apache Spark, this book discusses the fundamental components that underlie each of these natural language processing, computer vision, and machine learning technologies, as well as how you can incorporate these technologies into your business processes.Towards the end of the book, readers will learn about several deep learning frameworks, such as TensorFlow and PyTorch. Readers will also learn to execute distributed processing of deep learning problems using the Spark programming language.

What you will learn
? Learn how to get started with machine learning projects using Spark.
? Witness how to use Spark MLib's design for machine learning and deep learning operations.
? Use Spark in tasks involving NLP, unsupervised learning, and computer vision.
? Experiment with Spark in a cloud environment and with AI pipeline workflows.
? Run deep learning applications on a distributed network.

Who this book is for
This book is valuable for data engineers, machine learning engineers, data scientists, data architects, business analysts, and technical consultants worldwide. It would be beneficial to have some familiarity with the fundamentals of Hadoop and Python.

Table of Contents
1.Introduction to Machine Learning
2. Apache Spark Environment Setup and Configuration
3. Apache Spark
4. Apache Spark MLlib
5. Supervised Learning with Spark
6. Un-Supervised Learning with Apache Spark
7. Natural Language Processing with Apache Spark
8. Recommendation Engine with Distributed Framework
9. Deep Learning with Spark
10. Computer Vision with Apache Spark

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Practical Machine Learning with Spark an online PDF/ePUB?
Yes, you can access Practical Machine Learning with Spark by Gourav Gupta, Dr. Manish Gupta, Dr. Inder Singh Gupta in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.

Information

Year
2022
ISBN
9789391392086

CHAPTER 1

Introduction to Machine Learning

“Field of study that gives computers the capability to learn without being explicitly programmed.”
— Arthur Samuel

Introduction

Since the last two decades, there has been an incessant enhancement towards the vertical of Artificial Intelligence (AI) and its related sub-branches such as Machine Learning (ML), Statistical Modelling (SM), and Deep Learning (DL). These aforementioned technologies leverage many applications in the amelioration of people’s life and their day-to-day needs in various domains such as bioinformatics, radiology, agriculture, finance, astronomy, banking, healthcare, geo-informatics, seismology, and space exploration. ML extends the core functionality to push-up the capability of manual operations and machine to automatically learn by understanding and observing the key historical experiences. The main objective of this book is to educate the readers about the fundamental, advancement, and real-life applications of ML using a distributed framework. Furthermore, this chapter gives an in-depth knowledge about the journey of AI and the taxonomy of AI. Indeed, the term AI refers to a mimic prototype to imitate intelligent behaviors by understanding the meaningful information, patterns, or inputs. For example, self-driving cars use the concept of AI, especially a vision-based technology for teaching the AI model to make insightful decisions by mimicking and understanding the intelligent behaviors or inputs; these kinds of models are ideal examples of AI. The report shared by Gartner in 2019 depicts that the Intelligent System (IS) and its related verticals will become a big epic-center and most decisive emerging technology in the coming years. In future, almost every tedious problem will be resolved with the help of AI and ML. Across the globe it becomes a subject of interest among researchers, data scientists, data analysts, industrial experts, and academicians for mitigating the herculean real-time problems using AI. Also, this chapter shows the rigorous knowledge about the evolution of ML, types of ML, and its emerging applications with their futuristic scope. In addition, a compendious discussion on DL in connection with AI applications have been embossed in this chapter.

Structure

In this chapter, we will discuss the following topics:
  • Evolution of machine learning
  • Fundamentals and definition of machine learning
  • Types of machine learning algorithms
  • Application of machine learning
  • Future of machine learning

Objectives

After studying this chapter, readers will be able to:
  • Learn about the history of machine learning.
  • Get an understanding of the modern definition of machine learning.
  • Grasp the knowledge of different types of machine learning and its algorithm.
  • Understand the application of machine learning in various fields.
  • Know the future scope of machine learning.

Evolution of Machine Learning

The origin of both technologies AI and ML are interconnected. Hence, for the solid foundation of the readers, detailed history of ML and AI is presented in this section. However, the primary objective of this book is to make the readers conversant with the practical real-time scenario of ML with Apache Spark.
The term ‘Machine Learning’ first came into existence in 1952 after the distinguished work by an American engineer Arthur Samuel. Starting from 1949 to late 1968, he did the pioneering research to learn a computer by applying some instructions into it for making a self-decision. Initially in 1950s, he developed an alpha beta pruning program using a scoring function for measuring winning chances of two-player games like chess, on computers with limited memory. Next, he proposed the minimax algorithm based on the minimax strategy concept along with numerous mechanisms named as “rotelearning” to make his program better. In 1952, Samuel was the first to introduce the term “Machine Learning”. Thereafter, in 1957 Frank Rosenblatt from Cornell Aeronautical Laboratory merged the Donald Hebb’s model of a brain cell with Samuel’s machine learning concept to design the first neural network named perceptron for computers. The Perceptron algorithm was first installed in a machine named Mark 1 perceptron based on IBM704 hardware. It was used for image reconstruction applications and still had some limitations in recognition of the faces patterns.
In 1960s, the new trail was introduced using multi-layers in the neural network [NN], there by providing enhanced capability to solve complex algorithms and provide better precision. After this multi-layer theory, many new capabilities were opened to further improve the neural network learning through the feedforward propagation and back propagation neural networks.
In 1967, the nearest neighbor algorithm came in existence for the basic pattern recognition application for finding the more efficient route for traveling sales persons. In 1970, the back propagation algorithm was developed to adjust the network with hidden layers of neurons for minimizing errors. This algorithm was used to train Deep Neural Network (DNN).
During the 70s and 80s, AI researchers and computer scientists worked together on neural network research, while some of the researchers and engineers started working in ML as a new trail. By the early 1980s, ML and AI took separate paths. AI mainly focused on using logical and knowledge-based approaches while ML focused on neural networks-based algorithms.
In 1990s, ML reached its peak because of availability of large data shared by the Internet service. In 1990, Robert Schapire developed the Boosting Algorithm for ML to minimize the bias during supervise learning with ML algorithms for boosting weak learners. In this, a set of weak learners create a single strong learner and is defined as classifiers that are correlated with true classification. It combines many simple models (weak learners) to generate the result. There are many types of boosting algorithms such as, AdaBoost, BrownBoost, LPBoost, MadaBoost, TotalBoost, xqBoost, and LogitBoost, and AnyBoost. A detailed study on various types of boosting algorithms have been discussed later in this chapter.
Next, in 1996, the IBM Company won the first game against the world champion Garry Kasparov by developing “Deep Blue”, a chess-playing computer. The Deep Blue computer used custom build Very Large-Scale Integration (VLSI) chips for executing the Alpha-Beta algorithm. In 1997, Jurgen Schmidhuber and Sepp Hochreiter designed the neural network model named Long Short-Term Memory (LSTM) for speech recognition training. LSTM consists of cells, input, and output gates and was used for eliminating the gradient problem. In 2006, Face Recognition Algorithms were tested for 3D face scans, face images, and iris images and which was more accurate than the earlier facial recognition algorithms.
In the same year, the Canadian computer scientist Geoffrey Hinton introduced the term Deep Learning (DL) and developed a fast and greedy unsupervised learning algorithm for distinguishing the text and objects in the digital images and videos.
In 2011, the deep learning artificial intelligence research team at Google also known as “Google Brain” developed a large-scale deep learning software system named as DistBelief for learning and categorizing the object in a similar way as a person does. After a year, the Google X team developed ML algorithms containing 16,000 clusters for automatically identifying the cat digital images from YouTube videos.
In 2014, the Facebook research team came up with a facial recognition system known as DeepFace for recognizing human faces in digital images using DL. In 2015, Microsoft developed the ML toolkit for distributed resolution ML problems across multiple computers. In 2016, the Google DeepMind team developed AlphaGo for solving most complex board game problems.
Next in 2017, Google released Google Brain’s second-generation system known as the TensorFlow version 1.0.0 for a single device that can run on both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) for general purpose computing. Recently, Google released the TensorFlow version named TensorFlow.js version 1.0 for ML in JavaScript, TensorFlow 2.0, and TensorFlow Graphics for DL in computer graphics in 2018 and 2019, respectively.

Fundamentals and Definition of Machine Learning

This section focuses on creating a solid foundation of ML starting from its initial definition to its modern definition along with basic terminologies which are essential for grasping the fundamentals of ML. As discussed previously, ML has been adapting and expanding its functionalities in every automation related jobs, so the authors here have put the extra attention towards the core and rational concepts to strengthen the core knowledge of readers on ML. Also, it is necessary to walk through the journey of ML consisting of its importance, the traditional and modern approaches to train a machine or a model for training, validating, and testing of the dataset. This book helps the readers to update them about the real-time challenges and their respective solutions being used in the Intelligence and Analytics-based organizations.
Figure 1.1 depicts the branches of Artificial Intelligence such as Machine Learning, Neural Network, and Deep Learning. In ML, it takes the help of different types of learning concepts such as Supervised Learning (SL), Semi-Supervised Learning (SSL), Unsupervised Learning (USL), and Reinforcement Learning (RL).
Figure 1.1: Artificial Intelligence with its derived technologies
In NN, a special collection of algorithms is used for training, validating, and testing the patterns or inputs by leveraging the ideation of artificial neurons that work a like neurons of a human brain. For example, the conversion of voice-to-text uses the NN as a backbone. Amazon Alexa, Apple Siri, and Google Home are usually known as an ideal application of Smart Personal Assistants. On the flip side, the term DL represents the conglomeration of two or more hidden layers for processing the complex problems with high precision. Generally, DL is like NN, but the only difference is that DL is an easy customization for the complex neural architecture and extends the ease to handle the cumbersome model. These days, there are various DL and NN frameworks available to get on-spot flavor of the initial analytic platform such as Keras, Caffe, and TensorFlow.
In the following section, the reader will elicit about the basic terminologies which are essential to understand the concepts of ML:
  • Features or Attributes or Variables: These are the unique key measurable characteristics of data to be fed into the system for training and testing a model. For ML algorithms, these features are used as inputs or outputs. For recognizing the face of a human being, the associated features such as gender, age, height, lip shape, face shape, and color, so on are to be used as the decisive attributes.
  • Featured Vector or Tuple: It is a group of important features which are listed in a vector or tuple format for training a model.
  • Model: A specific representation learned from data using the ML algorithm. There are three types of models in ML named as Supervised, Unsupervised, and Reinforcement models. It consists of three important phases such as training, validating, and testing of a model.
  • Dataset: A set of informatio...

Table of contents