Learning Objectives
By the end of this chapter, you will be able to:
- Distinguish between supervised learning and unsupervised learning
- Explain the concept of clustering
- Implement k-means clustering algorithms using built-in Python packages
- Calculate the Silhouette Score for your data
In this chapter, we will have a look at the concept of clustering.
Introduction
Have you ever been asked to take a look at some data and come up empty handed? Maybe you were not familiar with the dataset, or maybe you didn't even know where to start. This may have been extremely frustrating, and even embarrassing, depending on who asked you to take care of the task.
You are not alone, and, interestingly enough, there are many times the data itself is simply too confusing to be made sense of. As you try and figure out what all those numbers in your spreadsheet mean, you're most likely mimicking what many unsupervised algorithms do when they try to find meaning in data. The reality is that many datasets in the real world don't have any rhyme or reason to them. You will be tasked with analyzing them with little background preparation. Don't fret, however – this book will prepare you so that you'll never be frustrated again when dealing with data exploration tasks.
For this book, we have developed some best-in-class content to help you understand how unsupervised algorithms work and where to use them. We'll cover some of the foundations of finding clusters in your data, how to reduce the size of your data so it's easier to understand, and how each of these sides of unsupervised learning can be applied in the real world. We hope you will come away from this book with a strong real-world understanding of unsupervised learning, the problems that it can solve, and those it cannot.
Thanks for joining us and we hope you enjoy the ride!
Unsupervised Learning versus Supervised Learning
Unsupervised learning is one of the most exciting areas of development in machine learning today. If you have explored machine learning bookwork before, you are probably familiar with the common breakout of problems in either supervised or unsupervised learning. Supervised learning encompasses the problem set of having a labeled dataset that can be used to either classify (for example, predicting smokers and non-smokers if you're looking at a lung health dataset) or fit a regression line on (for example, predicting the sale price of a home based on how many bedrooms it has). This model most closely mirrors an intuitive human approach to learning.
If you wanted to learn how to not burn your food with a basic understanding of cooking, you could build a dataset by putting your food on the burner and seeing how long it takes (input) for your food to burn (output). Eventually, as you continue to burn your food, you will build a mental model of when burning will occur and avoid it in the future. Development in supervised learning was once fast-paced and valuable, but it has since simmered down in recent years – many of the obstacles to knowing your data have already been tackled:
Figure 1.1: Differences between unsupervised and supervised learning
Conversely, unsupervised learning encompasses the problem set of having a tremendous amount of data that is unlabeled. Labeled data, in this case, would be data that has a supplied "target" outcome that you are trying to find the correlation to with supplied data (you know that you are looking for whether your food was burned in the preceding example). Unlabeled data is when you do not know what the "target" outcome is, and you only have supplied input data.
Building upon the previous example, imagine you were just dropped on planet Earth with zero knowledge of how cooking works. You are given 100 days, a stove, and a fridge full of food without any instructions on what to do. Your initial exploration of a kitchen could go in infinite directions – on day 10, you may finally learn how to open the fridge; on day 30, you may learn that food can go on the stove; and after many more days, you may unwittingly make an edible meal. As you can see, trying to find meaning in a kitchen devoid of adequate informational structure leads to very noisy data that is completely irrelevant to actually preparing a meal.
Unsupervised learning can be an answer to this problem. By looking back at your 100 days of data, clustering can be used to find patterns of similar days where a meal was produced, and you can easily review what you did on those days. However, unsupervised learning isn't a magical answer –simply finding clusters can be just as likely to help you to find pockets of similar yet ultimately useless data.
This challenge is what makes unsupervised learning so exciting. How can we find smarter techniques to speed up the process of finding clusters of information that are beneficial to our end goals?
Clustering
Being able to find groups of similar data that exist in your dataset can be extremely valuable if you are trying to find its underlying meaning. If you were a store owner and you wanted to understand which customers are more valuable without a set idea of what valuable is, clustering would be a great place to start to find patterns in your data. You may have a few high-level ideas of what denotes a valuable customer, but you aren't entirely sure in the face of a large mountain of available data. Through clustering you can find commonalities among similar groups in your data. If you look more deeply at a cluster of similar people, you may learn that everyone in that group visits your website for longer periods of time than others. This can show you what the value is and also provides a clean sample size for future supervised learning experiments.
Identifying Clusters
The following figure shows two scatterplots:
Figures 1.2: Two distinct scatterplots
The following figure separates the scatterplots into two distinct clusters:
Figure 1.3: Scatterplots clearly showing clusters that exist in a provided dataset
Both figures display randomly generated number pairs (x,y coordinates) pulled from a Gaussian distribution. Simply by glancing at Figure 1.2, it should be plainly obvious where the clusters exist in your data – in real life, it will never be this easy. Now that you know that the data can be clearly separated into two clusters, you can start to understand what differences exist between the two groups.
Rewinding a bit from where unsupervised learning fits into the larger machine learning environmen...