eBook - ePub

Practical Full Stack Machine Learning

Name: Practical Full Stack Machine Learning
ISBN: 9789391030421

A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions

Alok Kumar,

English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Practical Full Stack Machine Learning

A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions

Alok Kumar,

About this book

Master the ML process, from pipeline development to model deployment in production.

Key Features
? Prime focus on feature-engineering, model-exploration & optimization, dataops, ML pipeline, and scaling ML API.
? A step-by-step approach to cover every data science task with utmost efficiency and highest performance.
? Access to advanced data engineering and ML tools like AirFlow, MLflow, and ensemble techniques.

Description
'Practical Full-Stack Machine Learning' introduces data professionals to a set of powerful, open-source tools and concepts required to build a complete data science project. This book is written in Python, and the ML solutions are language-neutral and can be applied to various software languages and concepts.The book covers data pre-processing, feature management, selecting the best algorithm, model performance optimization, exposing ML models as API endpoints, and scaling ML API. It helps you learn how to use cookiecutter to create reusable project structures and templates. It explains DVC so that you can implement it and reap the same benefits in ML projects.It also covers DASK and how to use it to create scalable solutions for pre-processing data tasks. KerasTuner, an easy-to-use, scalable hyperparameter optimization framework that solves the pain points of hyperparameter search will be covered in this book. It explains ensemble techniques such as bagging, stacking, and boosting methods and the ML-ensemble framework to easily and effectively implement ensemble learning. The book also covers how to use Airflow to automate your ETL tasks for data preparation. It explores MLflow, which allows you to train, reuse, and deploy models created with any library. It teaches how to use fastAPI to expose and scale ML models as API endpoints.

What you will learn
? Learn how to create reusable machine learning pipelines that are ready for production.
? Implement scalable solutions for pre-processing data tasks using DASK.
? Experiment with ensembling techniques like Bagging, Stacking, and Boosting methods.
? Learn how to use Airflow to automate your ETL tasks for data preparation.
? Learn MLflow for training, reprocessing, and deployment of models created with any library.
? Workaround cookiecutter, KerasTuner, DVC, fastAPI, and a lot more.

Who this book is for
This book is geared toward data scientists who want to become more proficient in the entire process of developing ML applications from start to finish. Knowing the fundamentals of machine learning and Keras programming would be an essential requirement.

Table of Contents
1. Organizing Your Data Science Project
2. Preparing Your Data Structure
3. Building Your ML Architecture
4. Bye-Bye Scheduler, Welcome Airflow
5. Organizing Your Data Science Project Structure
6. Feature Store for ML
7. Serving ML as API

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

eBook ISBN

Topic

Subtopic

Artificial Intelligence (AI) & Semantics

Index

Computer Science

CHAPTER 1 Organizing Your Data Science Project

You don't need to worry about organizing your project if:

You are the only person working on the project.
Your first trained model meets the project requirements.
Your data does not require pre-processing at all.

If any of the preceding points are not true, then you need to think about effectively organizing your projects. A well-organized project enables better collaboration, repeatability, and re-usability. Project organization is not just about code base but the environment as well. Not everyone can afford environments like Google, Amazon, and Facebook. Cloud is not always the best answer, as marketed by various cloud vendors. If you are running model training continuously for weeks, then buying a GPU machine is a lot cheaper than doing it in cloud.

Structure

In this chapter, we will discuss the following topics:

Project folder and code organization
GPU 101
On-premises vs. cloud
Deciding your framework
Deciding your targets
Baseline preparation
Managing workflow

Objective

After studying this chapter, you will be able to:

Set up the project code base effectively. We will explore a library called cookiecutter that simplifies and standardizes the process a lot.
Apprehend the best practices of selecting GPU. The options in the market are overwhelming and also quite confusing.
Learn the best practices to decide infrastructure – cloud or on-premises.
Select the framework (Think of TensorFlow, PyTorch, etc.) and hardware.
Apply the tools for workflow management setup. We will explore sacred and omniboard projects to keep track of experiments.
Decide on the target and define the metrics around it.
Define your baseline.

1.1 Project folder and code organization

A good folder structure separates data processing, model definition, and model training. The following is a good method to organize the folders:

Figure 1.1: Folder and code organization

Let us go through each folder to understand its purpose:

data: The purpose of data folders is to organize the data sources. This one has two folders which are as follows:
- raw – This is the original, immutable data dump. This data should never be modified and should be the single source of truth.
- processed – These are the final, canonical data sets for modelling or training. This will be the data that is cleaned, transformed, and extended to make it amenable for training.
With this organization, you can immediately see some benefits such as:
- Data becomes immutable via the raw folder.
- We don't need to prepare the data every time due to the processed folder.
docs: This contains project documentation. We don't need to convince you about the value of project documentation. We recommend using sphinx (https://www.sphinx-doc.org/en/master/) for documentation.
weights: The weights folder stores the trained, serialized models, model predictions, or model summaries.
src: This contains all the Python scripts required for your project. We recommend splitting it across 3 folders.
datasets: This folder contains Python scripts to process the data.
network: This folder has scripts to define the architectures. Only the computational graph is defined.
model: This folder has the script to handle everything else needed in addition to the network.
experiments: This folder contains the parameter/hyperparameters combinations that you would like to try.
api: This folder exposes the model through a rest end point.

Let us go a further level down to look at some of the files. We have kept the minimal number of files for brevity.

Figure 1.2: Folder structure with sample files

This is a condensed view of the structure to help focus on the idea. Let us explore some of the folders. The data/raw folder has a folder for MNIST (handwritten digit dataset) and some proprietary dataset. Let's call it "my-dataset." Essentially, each different dataset is maintained in its own respective folder.

The separation isolates different structures; for example, one of the image datasets may contain image class names mentioned in a csv but the actual images are dumped in a single folder. Maybe the second dataset that you collected from somewhere else has the image class name mentioned in the file name itself. It is worth reiterating that this data should never be changed.

The data/processed folder should maintain the processed data. In this case, it could be normalized data ready for training. The models' folder, as the name suggests, would contain the trained models.

For deep learning projects, it further helps to separate the core network from the rest of the stuff. The network folder defines the network architectures used. Think of it as a block which only defines the computational graph without caring about the input and output shapes, model losses, and training methodology. The loss function and the optimization function can be managed by the model. However, how does this help?

Imagine that you want to perform the following experiments:

Figure 1.3: Hyperparameters values grid

A scalable way to try all these experiments without modifying your network is to write a separate script – let's call it "try-experiments". A Python style pseudocode to run an experiment would look as follows:

Python try-experiment '{network : MLP, Epochs : 500, learning_rate : 0.01, Droput:0.9, Batch_size : 128}'

Put all these "try-experiment" calls in a shell script and you have a clean automated way to try different parameters. The credit for this idea goes to Josh Tobin (http://josh-tobin.com/).

Wouldn't it be nice if the whole folder setup could be automated? Of course, this can be done via custom scripts, but then script maintenance becomes a problem. Is there any clean solution?

Cookiecutter is a CLI tool (https://cookiecutter.readthedocs.io/en/1.7.2/index.html) to create an application boilerplate from a template. It uses a templating system—Jinja2 (https://jinja.palletsprojects.com/)—to replace or customize folder and file names as well as file content.

The whole thing may sound complicated, but using and understanding it is very intuitive. Think of cookiecutter this way: Earlier, we defined a bunch of folders/subfolders to organize our project. Now you want every new data science project to use that structure (template). So, you create a template from that structure and then anyone can pass that template to the cookiecutter CLI to generate the same folder/file's structure. The left-hand side of the picture given below shows a template and the right-hand side shows the generated project.

Figure 1.4: Project generation from a template

Let us build a template to get a hang of things:

Create a folder that contains th...

Cover Page
Title Page
Copyright Page
About the Author
About the Reviewer
Acknowledgements
Preface
Errata
Table of Contents
1. Organizing Your Data Science Project
2. Preparing Your Data
3. Building Your ML Architecture
4. Bye-Bye Scheduler, Welcome Airflow
5. Organizing Your Data Science Project Structure
6. Feature Store for ML
7. Serving ML as API
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Practical Full Stack Machine Learning by Alok Kumar in PDF and/or ePUB format, as well as other popular books in Computer Science & Artificial Intelligence (AI) & Semantics. We have over one million books available in our catalogue for you to explore.