Mastering Machine Learning on AWS
eBook - ePub

Mastering Machine Learning on AWS

Advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow

Dr. Saket S.R. Mengle, Maximo Gurmendez

Share book
  1. 306 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Mastering Machine Learning on AWS

Advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow

Dr. Saket S.R. Mengle, Maximo Gurmendez

Book details
Book preview
Table of contents
Citations

About This Book

Gain expertise in ML techniques with AWS to create interactive apps using SageMaker, Apache Spark, and TensorFlow.

Key Features

  • Build machine learning apps on Amazon Web Services (AWS) using SageMaker, Apache Spark and TensorFlow
  • Learn model optimization, and understand how to scale your models using simple and secure APIs
  • Develop, train, tune and deploy neural network models to accelerate model performance in the cloud

Book Description

AWS is constantly driving new innovations that empower data scientists to explore a variety of machine learning (ML) cloud services. This book is your comprehensive reference for learning and implementing advanced ML algorithms in AWS cloud.

As you go through the chapters, you'll gain insights into how these algorithms can be trained, tuned and deployed in AWS using Apache Spark on Elastic Map Reduce (EMR), SageMaker, and TensorFlow. While you focus on algorithms such as XGBoost, linear models, factorization machines, and deep nets, the book will also provide you with an overview of AWS as well as detailed practical applications that will help you solve real-world problems. Every practical application includes a series of companion notebooks with all the necessary code to run on AWS. In the next few chapters, you will learn to use SageMaker and EMR Notebooks to perform a range of tasks, right from smart analytics, and predictive modeling, through to sentiment analysis.

By the end of this book, you will be equipped with the skills you need to effectively handle machine learning projects and implement and evaluate algorithms on AWS.

What you will learn

  • Manage AI workflows by using AWS cloud to deploy services that feed smart data products
  • Use SageMaker services to create recommendation models
  • Scale model training and deployment using Apache Spark on EMR
  • Understand how to cluster big data through EMR and seamlessly integrate it with SageMaker
  • Build deep learning models on AWS using TensorFlow and deploy them as services
  • Enhance your apps by combining Apache Spark and Amazon SageMaker

Who this book is for

This book is for data scientists, machine learning developers, deep learning enthusiasts and AWS users who want to build advanced models and smart applications on the cloud using AWS and its integration services. Some understanding of machine learning concepts, Python programming and AWS will be beneficial.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Mastering Machine Learning on AWS an online PDF/ePUB?
Yes, you can access Mastering Machine Learning on AWS by Dr. Saket S.R. Mengle, Maximo Gurmendez in PDF and/or ePUB format, as well as other popular books in Computer Science & Artificial Intelligence (AI) & Semantics. We have over one million books available in our catalogue for you to explore.

Information

Section 1: Machine Learning on AWS

The objective of this section is to introduce readers to machine learning in the context of AWS cloud computing and services. We expect our audience to have some basic knowledge of machine learning. However, we'll describe the nature of a typically successful machine learning project, and the challenges often faced. We will provide an overview of the different AWS services, along with examples of typical machine learning pipelines and the key aspects to consider in order to create smart AI-powered products.
This section contains the following chapter:
  • Chapter 1, Getting Started with Machine Learning for AWS

Getting Started with Machine Learning for AWS

In this book, we focus on all three aspects of data science by explaining machine learning (ML) algorithms in business applications, demonstrating how they can be implemented in a scalable environment, and examining how to evaluate models and present evaluation metrics as business key performance indicators (KPIs). This book shows how Amazon Web Services (AWS) ML tools can be effectively used on large datasets. We present various scenarios where mastering ML algorithms in AWS helps data scientists to perform their jobs more effectively.
Let's take a look at the topics we will cover in this chapter:
  • How AWS empowers data scientists
  • Identifying candidate problems that can be solved using ML
  • The ML project life cycle
  • Deploying models

How AWS empowers data scientists

The number of digital data records that are stored on the internet has grown a lot in the last decade. Due to the drop in storage costs, and new sources of digital data, it is predicted that the amount of digital data stored in 2025 will be 163 zettabytes (1,630,000,000,000 terabytes). Moreover, the amount of data that is generated every day is increasing at an alarming pace, with almost 90% of current data only having been generated during the last two years. With more than 3.5 billion people with access to the internet, this data is not only generated by professionals and large companies, but also by each of the 3.5 billion internet users.
Moreover, since companies understand the importance of data, they store all of their transactional data in the hope of analyzing it and uncovering interesting trends that could help their business make important decisions. Financial investors also crave storing and understanding every bit of information they can get about companies, and train their quantitative analysts or quants to make investment decisions.
It is up to the data scientists of the world to analyze this data and find the gems of information embedded in it. In the last decade, the data science team has become one of the most important teams in every organization. When data science teams were first created, most of the data would fit in Microsoft Excel sheets, and the task was to find statistical trends in the data and provide actionable insights to business teams. However, as the amount of data has increased and ML algorithms have become more sophisticated and potent, the scope of data science teams has expanded.
In the following diagram, we can see the three basic skills that a data scientist needs:
The job description for data scientists varies from company to company. However, in general, a data scientist needs the following three crucial skills:
  • ML: ML algorithms provide tools to analyze and learn from a large amount of data, and generate predictions or recommendations from that data. It is an important tool for analyzing structured data (such as databases) and unstructured data (such as text documents), and inferring actionable insights from them. A data scientist should be an expert in a plethora of ML algorithms and should understand what algorithm should be applied in a given situation. As data scientists have access to a large library of algorithms that can solve a given problem, they should know which algorithms should be used in each situation.
  • Computer programming: A data scientist should be an adept programmer, able to write code to access various ML and statistical libraries. There are a lot of programming languages, such as Scala, Python, and R, that provide a number of libraries that let us apply ML algorithms on a dataset. Hence, knowledge of such tools helps a data scientist to perform complex tasks within a feasible time frame. This is crucial in a business environment.
  • Communication: Along with discovering trends in the data and building complex ML models, a data scientist is also tasked with explaining these findings to business teams. Hence, a data scientist must not only possess good communication skills, but also good analytical and visualization skills. This will help them present complex data models in a way that is easily understood by people not familiar with ML. This also helps data scientists to convey their findings to business teams and provide them with guidance on expected outcomes.

Using AWS tools for ML

ML research spans decades and has deep roots in mathematics and statistics. ML algorithms can be used to solve problems in many business applications. In application areas such as advertising, predictive algorithms are used to predict where to discover further new customers based on trends from previous purchasers. Regression algorithms are used to predict stock prices based on prior trends. Services such as Netflix use recommendation algorithms to study the history of a user and enhance the discoverability of new shows that they may be interested in. Artificial Intelligence (AI) applications such as self-driving cars rely heavily on image recognition algorithms that utilize deep learning to effectively discover and label objects on the road. It is important for a data scientist to understand the nuances of different ML algorithms and understand where they should be applied. Using pre-existing libraries helps a data scientist to explore various algorithms for a given application area and evaluate them. AWS offers a large number of libraries that can be used to perform ML tasks, as explained in the ML algorithms and deep learning algorithms parts of this book.

Identifying candidate problems that can be solved using ML

It is also important for data scientists to be able to understand the scale of data that they are working with. There might be tasks related to medical research that span thousands of patients, with hundreds of features that can be processed on a single node device. However, tasks such as advertising, where companies collect several petabytes of data on customers based on every online advertisement that is served to the user, may require several thousand machines to compute and train ML algorithms. Deep learning algorithms are GPU-intensive and require a different type of machine than other ML algorithms. In this book, for each algorithm, we supply a description of how it is implemented simply using Python libraries, and then, how it can be scaled on large AWS clusters using technologies such as Spark and AWS SageMaker. We also discuss how TensorFlow is used for deep learning applications.
It is crucial to understand the customer of their ML-related tasks. Although it is challenging for data scientists to find which algorithm works for a specific application area, it is also important to gather evidence on how that algorithm enhances the application area and present this to the product owners. Hence, we also discuss how to evaluate each algorithm and visualize the results where necessary. AWS offers a large array of tools for evaluating ML algorithms and presenting the results.
Finally, a data scientist also needs to be able to make decisions on what types of machines best fit their needs on AWS. Once the algorithm is implemented, there are important considerations regarding how it can be deployed on large clusters in the most economical way. AWS offers more than 25 hardware alternatives, called instance types, which can be selected. We will discuss case studies on how an application is deployed on production clusters, and the various issues that a data scientist can face during this process.

The ML project life cycle

A typical ML project life cycle starts by understanding the problem at hand. Typically, someone in the organization (possibly a data scientist or business stakeholder) feels that some part of their business can be improved by the use of ML. For example, a music streaming company could conjecture that providing recommendations of songs similar to those played by a user would improve user engagement with the platform. Once we understand the business context and possible business actions to take, the data science team will need to consider several aspects during the project life cycle.
The following diagram describes various steps in the ML project life cycle:

Data gathering

We need to obtain data and organize it appropriately for the current problem (in our example, this could mean building a dataset linking users to songs they've listened to in the past). Depending on the size of the data, we might pick different technologies for storing the data. For example, it might be fine to train on a local machine using scikit-learn if we're working through a few million records. However, if the data doesn't fit on a single computer, then we must consider AWS solutions such as S3 for storage and Apache Spark, or SageMaker's built-in algorithms for model building.

Evaluation metrics

Before applying an ML algorithm, we need to consider how to assess the effectiveness of our strategy. In some cases, we can use part of our data to simulate the performance of the algorithm. However, on other occasions, the only viable way to ev...

Table of contents