eBook - ePub

Data Science Bookcamp

Name: Data Science Bookcamp
ISBN: 9781638352303

Five real-world Python projects

Leonard Apeltsin,

704 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Data Science Bookcamp

Five real-world Python projects

Leonard Apeltsin,

About this book

Learn data science with Python by building five real-world projects! Experiment with card game predictions, tracking disease outbreaks, and more, as you build a flexible and intuitive understanding of data science. In Data Science Bookcamp you will learn: - Techniques for computing and plotting probabilities
- Statistical analysis using Scipy
- How to organize datasets with clustering algorithms
- How to visualize complex multi-variable datasets
- How to train a decision tree machine learning algorithm In Data Science Bookcamp you'll test and build your knowledge of Python with the kind of open-ended problems that professional data scientists work on every day. Downloadable data sets and thoroughly-explained solutions help you lock in what you've learned, building your confidence and making you ready for an exciting new data science career. Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. About the technology
A data science project has a lot of moving parts, and it takes practice and skill to get all the code, algorithms, datasets, formats, and visualizations working together harmoniously. This unique book guides you through five realistic projects, including tracking disease outbreaks from news headlines, analyzing social networks, and finding relevant patterns in ad click data. About the book
Data Science Bookcamp doesn't stop with surface-level theory and toy examples. As you work through each project, you'll learn how to troubleshoot common problems like missing data, messy data, and algorithms that don't quite fit the model you're building. You'll appreciate the detailed setup instructions and the fully explained solutions that highlight common failure points. In the end, you'll be confident in your skills because you can see the results. What's inside - Web scraping
- Organize datasets with clustering algorithms
- Visualize complex multi-variable datasets
- Train a decision tree machine learning algorithm About the reader
For readers who know the basics of Python. No prior data science or machine learning skills required. About the author
Leonard Apeltsin is the Head of Data Science at Anomaly, where his team applies advanced analytics to uncover healthcare fraud, waste, and abuse. Table of Contents
CASE STUDY 1 FINDING THE WINNING STRATEGY IN A CARD GAME
1 Computing probabilities using Python
2 Plotting probabilities using Matplotlib
3 Running random simulations in NumPy
4 Case study 1 solution
CASE STUDY 2 ASSESSING ONLINE AD CLICKS FOR SIGNIFICANCE
5 Basic probability and statistical analysis using SciPy
6 Making predictions using the central limit theorem and SciPy
7 Statistical hypothesis testing
8 Analyzing tables using Pandas
9 Case study 2 solution
CASE STUDY 3 TRACKING DISEASE OUTBREAKS USING NEWS HEADLINES
10 Clustering data into groups
11 Geographic location visualization and analysis
12 Case study 3 solution
CASE STUDY 4 USING ONLINE JOB POSTINGS TO IMPROVE YOUR DATA SCIENCE RESUME
13 Measuring text similarities
14 Dimension reduction of matrix data
15 NLP analysis of large text datasets
16 Extracting text from web pages
17 Case study 4 solution
CASE STUDY 5 PREDICTING FUTURE FRIENDSHIPS FROM SOCIAL NETWORK DATA
18 An introduction to graph theory and network analysis
19 Dynamic graph theory techniques for node ranking and social network analysis
20 Network-driven supervised machine learning
21 Training linear classifiers with logistic regression
22 Training nonlinear classifiers with decision tree techniques
23 Case study 5 solution

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

Print ISBN

eBook ISBN

Topic

Subtopic

Index

Part 1. Case study 1: Finding the winning strategy in a card game

Problem statement

Would you like to win a bit of money? Let’s wager on a card game for minor stakes. In front of you is a shuffled deck of cards. All 52 cards lie face down. Half the cards are red, and half are black. I will proceed to flip over the cards one by one. If the last card I flip over is red, you’ll win a dollar. Otherwise, you’ll lose a dollar.

Here’s the twist: you can ask me to halt the game at any time. Once you say “Halt,” I will flip over the next card and end the game. That next card will serve as the final card. You will win a dollar if it’s red, as shown in figure CS1.1.

Figure CS1.1 The card-flipping game. We start with a shuffled deck. I repeatedly flip over the top card from the deck. (A) I have just flipped the fourth card. You instruct me to stop. (B) I flip over the fifth and final card. The final card is red. You win a dollar.

We can play the game as many times as you like. The deck will be reshuffled every time. After each round, we’ll exchange money. What is your best approach to winning this game?

Overview

To address the problem at hand, we will need to know how to

Compute the probabilities of observable events using sample space analysis.
Plot the probabilities of events across a range of interval values.
Simulate random processes, such as coin flips and card shuffling, using Python.
Evaluate our confidence in decisions drawn from simulations using confidence interval analysis.

1 Computing probabilities using Python

This section covers

What are the basics of probability theory?
Computing probabilities of a single observation
Computing probabilities across a range of observations

Few things in life are certain; most things are driven by chance. Whenever we cheer for our favorite sports team, or purchase a lottery ticket, or make an investment in the stock market, we hope for some particular outcome, but that outcome cannot ever be guaranteed. Randomness permeates our day-to-day experiences. Fortunately, that randomness can still be mitigated and controlled. We know that some unpredictable events occur more rarely than others and that certain decisions carry less uncertainty than other much-riskier choices. Driving to work in a car is safer than riding a motorcycle. Investing part of your savings in a retirement account is safer than betting it all on a single hand of blackjack. We can intrinsically sense these trade-offs in certainty because even the most unpredictable systems still show some predictable behaviors. These behaviors have been rigorously studied using probability theory. Probability theory is an inherently complex branch of math. However, aspects of the theory can be understood without knowing the mathematical underpinnings. In fact, difficult probability problems can be solved in Python without needing to know a single math equation. Such an equation-free approach to probability requires a baseline understanding of what mathematicians call a sample space.

1.1 Sample space analysis: An equation-free approach for measuring uncertainty in outcomes

Certain actions have measurable outcomes. A sample space is the set of all the possible outcomes an action could produce. Let’s take the simple action of flipping a coin. The coin will land on either heads or tails. Thus, the coin flip will produce one of two measurable outcomes: heads or tails. By storing these outcomes in a Python set, we can create a sample space of coin flips.

Listing 1.1 Creating a sample space of coin flips

sample_space = {'Heads', 'Tails'} ❶

❶ Storing elements in curly brackets creates a Python set. A Python set is a collection of unique, unordered elements.

Suppose we choose an element of sample_space at random. What fraction of the time will the chosen element equal Heads? Well, our sample space holds two possible elements. Each element occupies an equal fraction of the space within the set. Therefore, we expect Heads to be selected with a frequency of 1/2. That frequency is formally defined as the probability of an outcome. All outcomes within sample_space share an identical probability, which is equal to 1 / len(sample_space).

Listing 1.2 Computing the probability of heads

probability_heads = 1 / len(sample_space) print(f'Probability of choosing heads is {probability_heads}')  Probability of choosing heads is 0.5

The probability of choosing Heads equals 0.5. This relates directly to the action of flipping a coin. We’ll assume the coin is unbiased, which means the coin is equally likely to fall on either heads or tails. Thus, a coin flip is conceptually equivalent to choosing a random element from sample_space. The probability of the coin landing on heads is therefore 0.5; the probability of it landing on tails is also equal to 0.5.

We’ve assigned probabilities to our two measurable outcomes. However, there are additional questions we could ask. What is the probability that the coin lands on either heads or tails? Or, more exotically, what is the probability that the coin will spin forever in the air, landing on neither heads nor tails? To find rigorous answers, we need to define the concept of an event. An event is the subset of those elements within sample_space that satisfy some event condition (as shown in figure 1.1). An event condition is a simple Boolean function whose input is a single sample_space element. The function returns True only if the element satisfies our condition constraints.

Figure 1.1 Four event conditions applied to a sample space. The sample space contains two outcomes: heads and tails. Arrows represent the event conditions. Ever...

inside front cover
Data Science Bookcamp
Copyright
dedication
brief contents
contents
front matter
Part 1. Case study 1: Finding the winning strategy in a card game
1 Computing probabilities using Python
2 Plotting probabilities using Matplotlib
3 Running random simulations in NumPy
4 Case study 1 solution
Part 2. Case study 2: Assessing online ad clicks for significance
5 Basic probability and statistical analysis using SciPy
6 Making predictions using the central limit theorem and SciPy
7 Statistical hypothesis testing
8 Analyzing tables using Pandas
9 Case study 2 solution
Part 3. Case study 3: Tracking disease outbreaks using news headlines
10 Clustering data into groups
11 Geographic location visualization and analysis
12 Case study 3 solution
Part 4. Case study 4: Using online job postings to improve your data science resume
13 Measuring text similarities
14 Dimension reduction of matrix data
15 NLP analysis of large text datasets
16 Extracting text from web pages
17 Case study 4 solution
Part 5. Case study 5: Predicting future friendships from social network data
18 An introduction to graph theory and network analysis
19 Dynamic graph theory techniques for node ranking and social network analysis
20 Network-driven supervised machine learning
21 Training linear classifiers with logistic regression
22 Training nonlinear classifiers with decision tree techniques
23 Case study 5 solution
index
inside back cover

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Data Science Bookcamp by Leonard Apeltsin in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.