eBook - ePub

Data Science Without Makeup

Name: Data Science Without Makeup
Author: Mikhail Zhilkin

A Guidebook for End-Users, Analysts, and Managers

Mikhail Zhilkin

Share book

178 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Data Science Without Makeup

A Guidebook for End-Users, Analysts, and Managers

Mikhail Zhilkin

Book details

Book preview

Table of contents

Citations

About This Book

Mikhail Zhilkin, a data scientist who has worked on projects ranging from Candy Crush games to Premier League football players' physical performance, shares his strong views on some of the best and, more importantly, worst practices in data analytics and business intelligence. Why data science is hard, what pitfalls analysts and decision-makers fall into, and what everyone involved can do to give themselves a fighting chance—the book examines these and other questions with the skepticism of someone who has seen the sausage being made.

Honest and direct, full of examples from real life, Data Science Without Makeup: A Guidebook for End-Users, Analysts and Managers will be of great interest to people who aspire to work with data, people who already work with data, and people who work with people who work with data—from students to professional researchers and from early-career to seasoned professionals.

Mikhail Zhilkin is a data scientist at Arsenal FC. He has previously worked on the popular Candy Crush mobile games and in sports betting.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Data Science Without Makeup an online PDF/ePUB?

Yes, you can access Data Science Without Makeup by Mikhail Zhilkin in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Chapman and Hall/CRC

Year

2021

ISBN

9781000464856

Edition

Topic

Computer Science

Subtopic

Data Mining

Index

Computer Science

II

a new hope

DOI: 10.1201/9781003057420-5

Just because data science is hard, and our brain was not exactly designed to do it we should not stop trying our best. The hype around data science is not entirely unfounded. When done right, it can, indeed, transform businesses and disrupt industries (and even create new ones). “When” is the operative word here.

There is lot of the material on how to do data science the right way that falls into two categories:

Articles full of truisms, such as “Data quality is important,” which you cannot and will not argue with, but which leave you none the wiser as to how to actually do things better.
Books and workshops on tools and workflows that enable one to do data analysis, but do not necessarily cover the chasm between using a tool and achieving the desired result.

This book tries to fill in the blanks. Online courses and bootcamps can prepare you for the role of a junior data scientist; saying things like “Deliver relevant results” and “Empower business stakeholders” may work when interviewing for a managerial position; but we will focus on the stuff in between—the magical transformation of data science efforts into something useful.

This part of the book will deal with three major areas:

data science for people, not for its own sake (doing the right thing),
quality assurance (doing it correctly),
automation (never having to do it again).

It will all be essentially about data science best practices.

Just like any best practices, rules of thumb and other kinds of guidelines, data science best practices do not emerge from nowhere. It is important to understand that the process is as follows:

Practitioners try to solve a certain problem.
Some approaches work better than others.
Success and failure stories are distilled into dos and don’ts—a best practice is born.

When writing about best practices, it is difficult to avoid words like “should” and “must.” The reader would do well to remember that there are no sacred texts in data science. And even if there were, this would not be one of them.

“Should” and “must” are only convenient shortcuts. “Thou shalt not lie” is easier to say and remember than “Telling lies is detrimental to one’s character and destroys mutual trust, which is a crucial resource for a group of people with a shared goal.” Similarly, “Always comment your code” is a shortcut for “Commenting your code will make it easier for you and others to maintain it and, if necessary, reuse it in the future.”

If someone tells you that you should do this, or that you must never do that, and you are not sure where they are coming from, it is a good idea to ask: “What will happen if I do? What if I don’t?” Writing about best practices, I will try to make it clear what good things will happen if you follow it, and what bad things will happen if you don’t.

There are no hard rules beyond the laws of physics (and even they are just our best guess for the moment), but experience shows that it is better to start out with known best practices and only deviate from them once you know the ins and outs.

4

data science for people

DOI: 10.1201/9781003057420-6

When discussing data science best practices, it is important to note that there is a hierarchy to them. A best practice never exists without context, and for it to make sense, it may be required that a more high-level best practice (or several) has been put in place.

For example, there is no point in arguing which machine learning technique to use if the data the model will be trained on is rubbish. Thus, data-related best practices come before those specific to machine learning. Improving the data will have a bigger impact on the model accuracy than picking a more sophisticated machine learning algorithm.

This chapter will attempt to outline the most general of data science best practices in a meaningful order:

Align data science efforts with business needs.
Mind data science hierarchy of needs.
Make it simple, reproducible, and shareable.

I do not know about you, but when I am looking at these I cannot help thinking, “Aren’t they all extremely obvious?” Who does not want to align data science efforts with business needs? Who wants to make it unnecessarily complicated? Who does not want to automate everything that can be automated, and save time and money? But then, if these principles were adhered to by—not even all—most organizations, I would not feel the urge to write this book in the first place.

Let’s go through these best practices one by one and try to understand why they are ignored more often than not.

align data science efforts with business needs

In any organization that aspires to be data-driven, the first thing to look at is the alignment of data science efforts with business needs. This may sound obvious, but I have been in and observed situations when data science efforts were primarily driven not by what the business needed, but by one or both of the following two:

what data scientists wanted to do,
what people working with data scientists thought the business needed.

Let’s address the first one: data scientists doing what they want rather than what business needs.

As science is concerned with seeking truth, data science is concerned with seeking truth in data. The two main reasons to seek truth are:

Curiosity: you want to understand something for the sake of understanding. This is what often drives data scientists. They want to do an exploratory analysis, run an A/B test or master a new tool not because it will necessarily create business value, but simply because they are curious.
Pragmatism: whatever your goal, you can get closer to it by better understanding the domain. In case of a business, you may, for example, hope to increase revenue by better understanding your customers’ needs and behaviors.

A fundamental challenge of creating a data-driven organization is the marriage of these two: curious people working towards pragmatic goals. The optimal proportion of curiosity and pragmatism will vary from company to company. A research data scientist working on pushing the boundaries of deep learning may do well to be 95% curious and 5% pragmatic, whereas a business analyst supporting a small chain of retail stores is likely to benefit from being only 5% curious and 95% pragmatic. Data analysts in most companies will be at their most productive when combining curiosity and pragmatism in reasonable proportions.

The absolute majority of data analysts I have had the privilege to work with had enough curiosity for two people. Some would be more interested in statistical analysis, some—in writing efficient code, others—in building data pipelines, but all of them would have a pre-existing interest in doing something data-related for its own sake.

The same could not be said about every analyst’s passion for meeting quarterly business objectives. Most of them, especially those just breaking into the field, would look for an interesting project first and think about its value for the business later, if ever. Whereas in the ideal world it would be the other way around.

This challenge is best addressed top down: a business-minded data science team manager will have a shot at aligning less business-minded data scientists and making sure they deliver business value. It is difficult to imagine a bottom-up approach to be successful.

A manager who understands business needs but knows nothing about data science will generally outperform a tech-savvy manager of comparable general intelligence who has lost touch with business goals.

I have personally worked with a variety of data science managers, with widely varying degrees of business-mindedness and tech-savviness. My experience is that a manager who understands business needs but knows nothing about data science will generally outperform a tech-savvy manager of comparable general intelligence who has lost touch with business goals.

In a smaller data science organization, it can be the data scientist themselves who determine the overall direction of research.

I once got a question from a data analyst who had just joined a sports club. He wanted my input on how to start off with the data and what questions he should be looking to answer.

While I did my best to answer in a friendly and constructive manner, I could not help thinking, “I am not the person you should be talking to. Your job is to help people running the club. Ask them what they need, not me.”

For a data scientist, it may be useful to know what your peers in other companies work on. If they happen to have solved a problem similar to one you are working on, you may be able to learn from their experience (and you can certainly learn from their mistakes). However, at the end of the day, everything you do you do in the context of your organization, and you are best positioned to find out what needs to be done. And it is arguably the most important part of your job. You cannot outsource understanding the needs of the business.

You cannot outsource understanding the needs of the business.

One well-known management methodology that can help align what the data science team does with what business needs is objectives and key results (OKR). The idea behind this goal-setting framework, popularized by Google, is to ensure that the company focuses efforts on the same important issues throughout the organization. When OKR is applied correctly, anything a data scientist (or any employee, for that matter) does should be connected to an overarching company’s objective. Conversely, if a task cannot be connected to such objective, it can and should be dropped.

Unfortunately, as often happens with methodologies and frameworks, they can be applied in theory while very much ignored in practice. A certain kind of cargo cult takes place: meetings are held, presentations are shown around, to-do lists are created, but when the dust settles, it is business as usual, with people doing what they have always been doing.

Without a change in organization culture and everyone’s mindset, a management framework is just a yoga mat that was bought and put away in the loft.

Without a change in organization culture and everyone’s mindset, a management framework is jus...