eBook - ePub

Data Scientist Pocket Guide

Name: Data Scientist Pocket Guide
ISBN: 9789390684977

Over 600 Concepts, Terminologies, and Processes of Machine Learning and Deep Learning Assembled Together

Mohamed Sabri,

English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Data Scientist Pocket Guide

Over 600 Concepts, Terminologies, and Processes of Machine Learning and Deep Learning Assembled Together

Mohamed Sabri,

About this book

Discover one of the most complete dictionaries in data science.

Key Features
? Simplified understanding of complex concepts, terms, terminologies, and techniques.
? Combined glossary of machine learning, mathematics, and statistics.
? Chronologically arranged A-Z keywords with brief description.

Description
This pocket guide is a must for all data professionals in their day-to-day work processes. This book brings a comprehensive pack of glossaries of machine learning, deep learning, mathematics, and statistics. The extensive list of glossaries comprises concepts, processes, algorithms, data structures, techniques, and many more. Each of these terms is explained in the simplest words possible. This pocket guide will help you to stay up to date of the most essential terms and references used in the process of data analysis and machine learning.

What you will learn
? Get absolute clarity on every concept, process, and algorithm used in the process of data science operations.
? Keep yourself technically strong and sound-minded during data science meetings.
? Strengthen your knowledge in the field of Big data and business intelligence.

Who this book is for
This book is for data professionals, data scientists, students, or those who are new to the field who wish to stay on top of industry jargon and terminologies used in the field of data science.

Table of Contents
1. Chapter one: A
2. Chapter two: B
3. Chapter three: C
4. Chapter four: D
5. Chapter five: E
6. Chapter six: F
7. Chapter seven: G
8. Chapter eight: H
9. Chapter nine: I
10. Chapter ten: J
11. Chapter 11: K
12. Chapter 12: L
13. Chapter 13: M
14. Chapter 14: N
15. Chapter 15: O
16. Chapter 16: P
17. Chapter 17: Q
18. Chapter 18: R
19. Chapter 19: S
20. Chapter 20: T
21. Chapter 21: U
22. Chapter 22: V
23. Chapter 23: W
24. Chapter 24: X
25. Chapter 25: Y
26. Chapter 26: Z

About the Authors
Mohamed Sabri is the Director of Practice in Data Science and Artificial Intelligence in a business consulting firm. Thanks to his experience in the IT world, he is able to deliver end-to-end solutions in the field of AI. He is very strong in communication and well versed in technology popularization for complex projects. He has participated as a data scientist in several AI projects for large organizations such as banks and manufacturers. He has graduated in Economics and Mathematics from the University of Ottawa. Blog links: https://www.datalyticsbusiness.ca/
LinkedIn Profile: https://www.linkedin.com/in/mohamed-sabri/

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.

Yes, you can access Data Scientist Pocket Guide by Mohamed Sabri in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Year

eBook ISBN

Topic

Subtopic

Index

CHAPTER 1 FAQ

How to fine tune a machine learning algorithm?

Fine tuning refers to a technic in machine learning where the goal is to find the optimal parameters. Fine tuning helps in increasing model performance and accuracy. Obviously, fine tuning is performed on training data and tested on validation data or test data. Usually, before fine tuning an algorithm, it is important to try several algorithms to find the better one. Fine tuning comes at the end of the training phase.

Note that fine tune also refers to an approach in transfer learning. Fine tuning can mean training a neural network algorithm using another trained neural network parameters. This is done by initializing a neural network algorithm with a parameter from another neural network model and usually in the same domain problem.

Fine tuning is the last step in the training phase as it comes after trying multiple machine learning algorithms and selecting the best ones. Fine tuning is considered as a non-necessary phase as it is possible to create a machine learning model without fine tuning it. However, if the idea is to increase accuracy, fine tuning is the best way.

Fine tuning can also be called hyperparameter optimization and there are multiple technics to perform the optimization. Manual search is a technic that uses the data scientist’s experience to select the best parameters and find the optimal ones. For example, a data scientist can decide to reduce the value of the batch size in training a neural network algorithm to help get a faster converges. Manual search is not the most optimal technic but can be combined with other technics. Random search is a technic that creates a grid of hyperparameters and tries different random combination of hyperparameters. Random search is usually used in combination with cross validation as each combination of hyperparameters is tested with a specific fold from the dataset. Grid search is a technic that sets up a grid of hyperparameters and trains the model on each possible combination. The parameters to be used in the grid search are usually selected from a prior random search. Bayesian optimization is considered to be the best technic over the others as it uses probabilities to find the optimal search spaces for the hyperparameters.

So, fine tuning is a set of technics that can help in improving the performance. When it refers to hyperparameter tuning it can be is used at the end of the training phase and can make a difference between a good model and a very good model. When it refers to transfer learning it can help improve deep neural network model performance.

How to build deep neural network architecture?

Today, deep learning is one of the most promising machine learning algorithms, especially for image recognition and unstructured data. A deep neural network is basically a neural network with more than two hidden layers. Hidden layers are layers between the input layer and the output layer in a neural network, its role is to learn features from input layer. While increasing the number of hidden layers, it helps the neural network to learn more complex features from the input data.

Building a state-of-the-art deep neural network first depends on the type of problem that has to be solved. A problem in image classification doesn’t require the same architecture as a problem in anomaly detection or forecasting. In image classification, the most common type of layers that are used is convolutional layers as they are most suitable for images as input. In anomaly detection, it is preferable to use an architecture based on encoder-decoder as the neural network will deconstruct and reconstruct the input and try to flag any input that doesn’t follow the general pattern.

One of the most frequently asked question about building a deep neural network is how deep the neural network should be at the beginning of the training. The ideal situation is to start with the smallest architecture possible which means the least layers possible and then increasing the number of layers until it reaches the best possible performance.

Another frequently asked question is how to select activation functions. An activation function plays a crucial role in a deep learning model as it is capable of transforming input data to a nonlinear approximation. To make it simple, the most popular activation function for hidden layers is ReLU. It is the one that shows the best results, in most of the case. For the layer before the output layer, the selection is based on the type of problems that are being solved. For example, if the problem is a classification problem, the Softmax activation function is the one that should be used as it helps in converting the data into probabilities. Also, for the activation function usage, it is recommended to sometimes try new ones (Tanh, Leaky ReLU, etc.) and see if that increases the performance and accuracy.

Also note that every type of deep neural network architecture has its specific use case. For example, a generative adversarial neural network is a type of architecture that can be used to generate data. It can also be used for anomaly detection but it cannot be used for image classification or any other task. So, you have to consider all possible architectures of neural networks as a tool box and select the most appropriate one based on the problem that you are trying to solve.

How to train a machine learning algorithm faster?

Sometimes, when trying to train a machine learning algorithm, it can take a long time, that is, sometimes days and even weeks or more. This can be due to the amount of training of data used or the type or size of the algorithm used; obviously the performance of your machine learning model is also impacted by your computer or server capabilities like Memory and processor.

To make the training faster, there are different technics such as using GPU instead of CPU. This switch of processor helps in making the computing faster as GPU can handle more computing in parallel. Note that not all algorithms and frameworks support GPU computing. The most popular ones to support it are neural networks and XGboost.

Another technic to making the computing faster is parallel computing. This can be performed in multiple ways: either by parallelizing the data or by parallelizing the model. For example, parallelizing data can be performed by using a cluster of machines with the support of a framework like Spark MLlib.

Another option is to change the algorithm and select another one with less complexity as the complexity of an algorithm plays a crucial role in the training time. For example, a support vector machine is considered as a complex algorithm which means that that the training time can grow very fast. So, on large dataset, it is advisable to select another algorithm with less complexity.

The last option is to basically sample the data. On a large dataset, it is possible to sample the dataset using stratification which helps the dataset in keeping its original ratios and characteristics. A decade ago, sampling the data used to be the most popular technic to make the computing time faster. Nowadays, the most popular technic is GPU usage or parallelization.

Why do we normalize the input data in deep neural network?

Normalization of the input is one of the best practices in deep neural network. In general, normalization of the data helps in speeding the learning and getting to convergence faster.

Also, the data becomes more suitable for the activation function, especially the sigmoid function. Now, imagine that the inputs are of different scales (not normalized). The weight of some inputs will be updated faster or larger than the other ones. This might hurt the learning. On another side, it guarantees that there are positive and negative values available as inputs for the next layer and this makes the learning more flexible.

Note that the other type of transformations can achieve the same result than normalizing the input for a deep neural network such as standardization; linearly scale input data, and so on.

When can we consider that we did a good job in a machine learning project?

In machine learning, it is always hard to evaluate if a work or a project has been well-performed. Usually, to answer such a problem, there are an infinite number of ways to obtain a good solution. Also, a job or a project can be infinitely improved but you don’t have an infinite amount of time to deliver a project.

In general, the criteria to evaluate if a job done has been good are that it is logical and follows the best practice. The best practice means that the tools and technics used have been approved by the community and are considered as a standard in the industry.

When a data scientist delivers a work, he should not be a perfectionist and should think like an engineer who is trying to solve a practical problem. As an engineer, the data scientist should be result-oriented, focusing on how to get the best outcome in the shortest amount of time.

Usually in a data science project, we apply the lean and agile style where the idea delivers a result fast and iterates to improve the work. So, this means that the data scientist will have to update his work on a regular basis by improving it at each iteration.

Sometimes, it can get very confusing while talking about whether a job has been done well in machine learning, because a good result in accuracy doesn’t necessarily mean that your job is good and if you get a bad result in accuracy, it doesn’t mean that our job is not good. This is directly related to the problem that you are trying to solve since sometimes, some problems are very hard and it is almost impossible, due to the data, to get good accuracy. This means that even if the accuracy is not strong, you may have done great work.

So, while evaluating a machine learning job, the focus should be on the logic and reasoning behind the work instead of focusing on the accuracy.

When should we use deep learning instead of the traditional machine learning models?

To understand when to use deep learning instead of traditional machine learning, it is important to understand the strengths of deep learning compared to traditional machine learning. Deep learning shows better results than traditional machine learning in image recognition, object detection, speech recognition, and natural language processing. This means that for any task that includes unstructured data, it is better to use deep learning. This is due to the fact that deep learning extracts its own features and patterns by itself which are then adapted to unstructured data such as images.

To treat an image with traditional machine learning, we will have to extract all the relevant features from the image prior to training which is time-consuming and can be inaccurate. So, deep learning is preferred in case it is hard to extract features from the data.

Deep learning can also show a strong advantage in case we have a large amount of data which is not the case for some machine learning algorithms. With large data, deep learning is capable of learning better and showing a better performance. So, when we have a large amount of data, deep learning is preferred over the other machine learning technics.

How much time does it take to become a good data scientist?

A good data scientist is a data scientist that has a good understanding of statistics, mathematics, computer science, and of course, machine learning. A good data scientist is capable of solving any hard problem and finding an optimal solution to any type of problem.

Becoming a good data scientist is a journey as it takes continuously learning new technics and updating your knowledge. Becoming a good data scientist doesn’t necessarily require a PhD but it requires discipline and autonomy. It is also a matter of talent since to be able to solve some problems, a good scent is required.

Becoming a decent data scientist requires years of hard work, and therefore, becoming a good data scientist requires you to be a step ahead.

How to evaluate the performance of a model?

Evaluating the performance of a model is one of the most important steps in a machine learning project as it helps in discovering if the trained model is a good model that can be deployed. To evaluate a model’s performance, we use what is called a metric, either a visual metric or a mathematical metric. Usually, these metrics are called performance metrics.

An evaluation metric is defined based on the type of problem that we are trying to solve. It can be a classification problem, a regression problem, an unsupervised model, image recognition, and so on.

There are several types of evaluation metrics. Some of them are as follows:

Classification problem: Area Under the Curve (AUC), confusion matrix, accuracy, recall, precision, and F1-score.
Regression problem: Mean square error, root mean square error, mean absolute error, coefficient of determination, Adjusted R-squared.

Each evaluation metric is unique and has its own strength, so don’t hesitate to use multiple evaluation metrics for the same project to evaluate the same machine learning model.

In case of a large dataset, should I sample my data or use distributed computing?

In the past, when a statistician or a data analyst was in front of a large dataset, the most popular technic was to sample the data to apply the machine learning algorithms afterwards. Nowadays, a new technic has emerged which is called distributed computing, and more precisely, data parallelism. This technic helps in using all the data in training a machine learning model.

In term of time consumption, sampling data is faster to setup than distributed computing. So, in case of a small project with limited time for delivery, it is more relevant to use sampling dataset. In cases when the project is a long-length project with a focus on the performance, using distributed computing is more relevant. On another side, if we are trying to apply the deep learning model it is more advisable to use distributed computing to be able to take advantage of the complete dataset.

Distributed computing and data parallelism require strong knowledge in data engineering and computer science. So, it’s a practice that might take time to be setup for a beginner.

How much time should I spend in data transformation?

Data transformation is a process that is performed by data scientists during the data preparation step when the data can be transformed in various ways depending on the format, the type, and the purpose. The most popular data transformations are natural logarithm for continuous target variable to erase a skew in data, the one hot encoding transformation to transform categorical variables int...

Cover Page
Title Page
Copyright Page
Dedication Page
About the Author
About the Reviewer
Acknowledgements
Preface
Errata
Table of Contents
1. FAQ
2. A
3. B
4. C
5. D
6. E
7. F
8. G
9. H
10. I
11. J
12. K
13. L
14. M
15. N
16. O
17. P
18. Q
19. R
20. S
21. T
22. U
23. V
24. W
25. X
26. Y
27. Z
Index