eBook - ePub

Associations and Correlations

Name: Associations and Correlations
ISBN: 9781838982201

Unearth the powerful insights buried in your data

Lee Baker,

134 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Associations and Correlations

Unearth the powerful insights buried in your data

Lee Baker,

About this book

Discover the story of your data using the essential elements of associations and correlationsKey Features• Get a comprehensive introduction to associations and correlations• Explore multivariate analysis, understand its limitations, and discover the assumptions on which it's based• Gain insights into the various ways of preparing your data for analysis and visualizationBook DescriptionAssociations and correlations are ways of describing how a pair of variables change together as a result of their connection. By knowing the various available techniques, you can easily and accurately discover and visualize the relationships in your data. This book begins by showing you how to classify your data into the four distinct types that you are likely to have in your dataset. Then, with easy-to-understand examples, you'll learn when to use the various univariate and multivariate statistical tests. You'll also discover what to do when your univariate and multivariate results do not match. As the book progresses, it describes why univariate and multivariate techniques should be used as a tag team, and also introduces you to the techniques of visualizing the story of your data. By the end of the book, you'll know exactly how to select the most appropriate univariate and multivariate tests, and be able to use a single strategic framework to discover the true story of your data. What you will learn• Identify a dataset that's fit for analysis using its basic features• Understand the importance of associations and correlations• Use multivariate and univariate statistical tests to confirm relationships• Classify data as qualitative or quantitative and then into the four subtypes• Build a visual representation of all the relationships in the dataset• Automate associations and correlations with CorrelVizWho this book is forThis is a book for beginners – if you're a novice data analyst or data scientist, then this is a great place to start. Experienced data analysts might also find value in this title, as it will recap the basics and strengthen your understanding of key concepts. This book focuses on introducing the essential elements of association and correlation analysis.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Packt Publishing

Year

2019

Edition

eBook ISBN

9781838982201

Topic

Informatica

Subtopic

Informatica generale

Chapter 1 Data Collection and Cleaning

The first step in any data analysis project is to collect and clean your data. If you're fortunate enough to have been given a perfectly clean dataset, then congratulations – you're well on your way. For the rest of us, though, there's quite a bit of grunt work to be done before you can get to the joy of analysis (yeah, I know, I really must get a life…).

In this chapter, you'll learn about what the features of a good dataset look like and how the dataset should be formatted to make it amenable to analysis by association and correlation tests.

Most importantly, you'll learn why it's not necessarily a good idea to collect sales data on ice cream and haemorrhoid cream in the same dataset.

If you're happy with your dataset and quite sure that it doesn't need cleaning, then you can safely skip this chapter. I won't take it personally – honest!

Data Collection

The first question you should be asking before starting any project is "What is my question?" If you don't know your question, then you won't know how to get an answer. In science and statistics, this is called having a hypothesis. Typical hypotheses might be:

Is smoking related to lung cancer?
Is there an association between sales of ice cream and haemorrhoid cream?
Is there a correlation between coffee consumption and insomnia?

It's important to start with a question, because this will help you decide what data you should collect (and what data you shouldn't).

It's not usual that you can answer these types of question by collecting data on just those variables. It's much more likely that there will be other factors that may have an influence on the answer and all of these factors must be taken into account. If you want to answer the question is smoking related to lung cancer? then you'll typically also collect data on age, height, weight, family history, genetic factors, and environmental factors, and your dataset will start to become quite large in comparison with your hypothesis.

So, what data should you collect? Well, that depends on your hypothesis, the perceived wisdom of current thinking, and any previous research carried out, but ultimately, if you collect data sensibly, you will likely get sensible results and vice versa, so it's a good idea to take some time to think it through carefully before you start.

I'm not going to go into the finer points of data collection and cleaning here, but it's important that your dataset conforms to a few simple standards before you can start analyzing it.

By the way, if you want a copy of my book Practical Data Cleaning, you can get a free copy of it by following the instructions in the tiny little advert for it at the end of this section…

Dataset Checklist

OK, so here we go. Here are the essential features of a ready-to-go dataset for association and correlation analysis.

Your dataset is a rectangular matrix of data. If your data is spread across different spreadsheets or tables, then it's not a dataset, it's a database, and it's not ready for analysis:

Each column of data is a single variable corresponding to a single piece of information (such as age, height, or weight, in this case).
Column 1 is a list of unique consecutive numbers starting from one. This allows you to uniquely identify any given row and recover the original order of your dataset with a single sort command.
Row 1 contains the names of the variables. If you use rows 2, 3, 4, and so on as the variable names, you won't be able to enter your dataset into a statistics program.
Each row contains the details for a single sample (patient, case, test tube, and so on).
Each cell should contain a single piece of information. If you have entered more than one piece of information in a cell (such as date of birth and their age), then you should separate the column into two or more columns (one for date of birth, another for age).
Don't enter the number zero into a cell unless what has been measured, counted, or calculated results in the answer zero. Don't use the number zero as a code to signify "No Data". By now, you should have a well-formed dataset that is stored in a single Excel worksheet. Each column should be a single variable, with row 1 containing the names of the variables, and below this, each row should be a distinct sample or patient. It should look something like Figure 1.1.

Figure 1.1: A typical dataset used in association and correlation analysis

For the rest of this book, this is how I assume your dataset is laid out, so I might use the terms variable and column interchangeably, the same going for the terms row, sample, and patient.

Data Cleaning

Your next step is cleaning the data. You may well have made some entry errors and some of your data may not be useable. You need to find such instances and correct them. The alternative is that your data may not be fit for purpose and may mislead you in your pursuit of the answers to your questions.

Even after you've corrected the obvious entry errors, there may be other types of errors in your data that are harder to find.

Check That Your Data Is Sensible

Just because your dataset is clean, it doesn't mean that it is correct – real life follows rules, and your data must follow them, too. There are limits on the heights of participants in your study, so check that all data fits within reasonable limits. Calculate the minimum, maximum, and mean values of variables to see whether all values are sensible.

Sometimes, putting together two or more pieces of data can reveal errors that can otherwise be difficult to detect. Does the difference between date of birth and date of diagnosis give you a negative number? Is your patient over 300 years old?

Figure 1.2 gives you a list of the most useful measures that will help you discover errors in your data and find out whether real-life rules have been followed.

Figure 1.2: Essential descriptive statistics

Check That Your Variables Are Sensible

Once you have a perfectly clean dataset it is relatively easy to compare variables with each other to find out whether there is a relationship between them (the subject of this book). But just because you can, it doesn't mean that you should. If there is no good reason why there should be a relationship between sales of ice cream and haemorrhoid cream, then you should consider expelling one of or both of those variables from the dataset. If you've collected your own data from original sources, then you'll have considered beforehand what data is sensible to collect (you have, haven't you?), but if your dataset is a pastiche of two or more datasets, then you might find strange combinations of variables.

You should check your variables before doing any analyses and consider whether it is sensible to make these comparisons.

So, now you have collected your data, cleaned your data, and checked that your data is sensible and fit for purpose. In the next chapter, we'll go through the basics of data classification and introduce the four types of data.

Chapter 2 Data Classification

In this chapter, you'll learn the difference between quantitative and qualitative data. You'll also learn about ratio, interval, ordinal, and nominal data types, and what operations you can perform on each of them.

Quantitative and Qualitative Data

All data is either quantitative – measured with some kind of measuring implement, such as a ruler, jug, weighing scales, stopwatch, thermometer, and so on – or is qualitative: an observed feature of interest that is placed into categories, as in health (healthy, sick), and opinion (agree, neutral, disagree).

Quantitative and qualitative data can be sub-divided into four further classes of data – Ratio, Interval, Ordinal, and Nominal – as shown in Figure 2.1.

Figure 2.1: There are four distinct types of data

The differences between them can be established by asking just three questions:

Question 1: Are adjacent data points or categories ordered?

All measured data is ordered, but not all categories are. If your categories are named [Small; Medium; Large], then there is an order to them. If you have named your categories [1; 2; 3], then there may be an order, but it all depends on what 1, 2, and 3 signify. Just because you've used numbers to name your categories, it doesn't necessarily follow that there is an order.

Question 2: Are adjacent data points equidistant?

Look at a ruler and you'll see that the distance between each centimeter marker is precisely the same irrespective of which part of the ruler you're looking at – every centimeter measurement has the same length. In fact, all measuring implements – such as rulers, stopwatches, and jugs – have equidistant data points. Categories, though, do not. The difference in sizes between small and medium is not necessarily the same as that between medium and large. What do you mean by size? Width, height, or depth? Area or volume? It is likely that the reason your data is organized into categories is that it is difficult to accurately measure the feature of interest; therefore, categorical data does not have equidistant data points.

Question 3: Does the scale of measurement have a meaningful zero?

All categorical data have arbitrarily chosen zero points. Extrapolate backward from the categories small, medium, and large – where does the line cross the x-axis? Well, it doesn't – it's sill...

Preface
Chapter 1
Data Collection and Cleaning
Chapter 2
Data Classification
Chapter 3
Introduction to Associations and Correlations
Chapter 4
Univariate Statistics
Chapter 5
Multivariate Statistics
Chapter 6
Visualizing Your Relationships
Chapter 7
Bonus: Automating Associations and Correlations
Appendix

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Associations and Correlations by Lee Baker in PDF and/or ePUB format, as well as other popular books in Informatica & Informatica generale. We have over 1.5 million books available in our catalogue for you to explore.

Associations and Correlations

Unearth the powerful insights buried in your data

Associations and Correlations

Unearth the powerful insights buried in your data

About this book

Trusted by 375,005 students

Information

Chapter 1

Data Collection and Cleaning

Data Collection

Dataset Checklist

Figure 1.1: A typical dataset used in association and correlation analysis

Data Cleaning

Check That Your Data Is Sensible

Figure 1.2: Essential descriptive statistics

Check That Your Variables Are Sensible

Chapter 2

Data Classification

Data Classification

Quantitative and Qualitative Data

Figure 2.1: There are four distinct types of data

Table of contents

Frequently asked questions