Chapter 1
An Introduction to Classification and Clustering
1.1 Introduction
An intelligent being cannot treat every object it sees as a unique entity unlike anything else in the universe. It has to put objects in categories so that it may apply its hard-won knowledge about similar objects encountered in the past, to the object at hand.
Steven Pinker, How the Mind Works, 1997.
One of the most basic abilities of living creatures involves the grouping of similar objects to produce a classification. The idea of sorting similar things into categories is clearly a primitive one since early man, for example, must have been able to realize that many individual objects shared certain properties such as being edible, or poisonous, or ferocious and so on.
Classification, in its widest sense, is needed for the development of language, which consists of words which help us to recognize and discuss the different types of events, objects and people we encounter. Each noun in a language, for example, is essentially a label used to describe a class of things which have striking features in common; thus animals are named as cats, dogs, horses, etc., and such a name collects individuals into groups. Naming and classifying are essentially synonymous.
As well as being a basic human conceptual activity, classification is also fundamental to most branches of science. In biology for example, classification of organisms has been a preoccupation since the very first biological investigations. Aristotle built up an elaborate system for classifying the species of the animal kingdom, which began by dividing animals into two main groups, those having red blood (corresponding roughly to our own vertebrates), and those lacking it (the invertebrates). He further subdivided these two groups according to the way in which the young are produced, whether alive, in eggs, as pupae and so on.
Following Aristotle, Theophrastos wrote the first fundamental accounts of the structure and classification of plants. The resulting books were so fully documented, so profound and so all-embracing in their scope that they provided the groundwork of biological research for many centuries. They were superseded only in the 17th and 18th centuries, when the great European explorers, by opening the rest of the world to inquiring travellers, created the occasion for a second, similar programme of research and collection, under the direction of the Swedish naturalist, Linnaeus. In 1737, Carl von LinnƩ published his work Genera Plantarum, from which the following quotation is taken:
All the real knowledge which we possess, depends on methods by which we distinguish the similar from the dissimilar. The greater the number of natural distinctions this method comprehends the clearer becomes our idea of things. The more numerous the objects which employ our attention the more difficult it becomes to form such a method and the more necessary.
For we must not join in the same genus the horse and the swine, though both species had been one hoof'd nor separate in different genera the goat, the reindeer and the elk, tho' they differ in the form of their horns. We ought therefore by attentive and diligent observation to determine the limits of the genera, since they cannot be determined a priori. This is the great work, the important labour, for should the genera be confused, all would be confusion.
In biology, the theory and practice of classifying organisms is generally known as taxonomy. Initially, taxonomy in its widest sense was perhaps more of an art than a scientific method, but eventually less subjective techniques were developed largely by Adanson (1727ā1806), who is credited by Sokal and Sneath (1963) with the introduction of the polythetic type of system into biology, in which classifications are based on many characteristics of the objects being studied, as opposed to monothetic systems, which use a single characteristic to produce a classification.
The classification of animals and plants has clearly played an important role in the fields of biology and zoology, particularly as a basis for Darwin's theory of evolution. But classification has also played a central role in the developments of theories in other fields of science. The classification of the elements in the periodic table for example, produced by Mendeleyev in the 1860s, has had a profound impact on the understanding of the structure of the atom. Again, in astronomy, the classification of stars into dwarf stars and giant stars using the HertzsprungāRussell plot of temperature against luminosity (Figure 1.1) has strongly affected theories of stellar evolution.
Classification may involve people, animals, chemical elements, stars, etc., as the entities to be grouped. In this text we shall generally use the term object to cover all such possibilities.
1.2 Reasons for Classifying
At one level, a classification scheme may simply represent a convenient method for organizing a large data set so that it can be understood more easily and information retrieved more efficiently. If the data can validly be summarized by a small number of groups of objects, then the group labels may provide a very concise description of patterns of similarities and differences in the data. In market research, for example, it may be useful to group a large number of respondents according to their preferences for particular products. This may help to identify a āniche productā for a particular type of consumer. The need to summarize data sets in this way is increasingly important because of the growing number of large databases now available in many areas of science, and the exploration of such databases using cluster analysis and other multivariate analysis techniques is now often called data mining. In the 21st century, data mining has become of particular interest for investigating material on the World Wide Web, where the aim is to extract useful information or knowledge from web page contents (see, Liu, 2007 for more details).
In many applications, however, investigators may be looking for a classification which, in addition to providing a useful summary of the data, also serves some more fundamental purpose. Medicine provides a good example. To understand and treat disease it has to be classified, and in general the classification will have two main aims. The first will be prediction ā separating diseases that require different treatments. The second will be to provide a basis for research into aetiology ā the causes of different types of disease. It is these two aims that a clinician has in mind when she makes a diagnosis.
It is almost always the case that a variety of alternative classifications exist for the same set of objects. Human beings, for example, may be classified with respect to economic status into groups such as lower class, middle class and upper class; alternatively they might be classified by annual consumption of alcohol into low, medium and high. Clearly such different classifications may not collect the same individuals into groups. Some classifications are, however, more likely to be of general use than others, a point well-made by Needham (1965) in discussing the classification of humans into men and women:
The usefulness of this classification does not begin and end with all that can, in one sense, be strictly inferred from it ā namely a statement about sexual organs. It is a very useful classification because classing a person as a man or woman conveys a great deal more information, about probable relative size, strength, certain types of dexterity and so on. When we say that persons in class
man are more suitable than persons in class
woman for certain tasks and conversely, we are only incidentally making a remark about sex, our primary concern being with strength, endurance etc. The point is that we have been able to use a classification of persons which conveys information on many properties. On the contrary a classification of persons into those with hair on their forearms between
and
inch long and those without, though it may serve some particular use, is certainly of no general use, for imputing membership in the former class to a person conveys information in this property alone. Put another way, there are no known properties which divide up a set of people in a similar manner.
A similar point can be made in respect of the classification of books based on subject matter and their classification based on the colour of the book's binding. The former, with classes such as dictionaries, novels, biographies, etc., will be of far wider use than the latter with classes such as green, blue, red, etc. The reason why the first is more useful than the second is clear; the subject matter classification indicates more of a book's characteristics than the latter.
So it should be remembered that in general a classification of a set of objects is not like a scientific theory and should perhaps be judged largely on its usefulness, rather than in terms of whether it is ātrueā or āfalseā.
1.3 Numerical Methods of Classification ā Cluster Analysis
Numerical techniques for deriving classifications originated largely in the natural sciences such as biology and zoology in an effort to rid taxonomy of its traditionally subjective nature. The aim was to provide objective and stable classifications. Objective in the sense that the analysis of the same set of organisms by the same sequence of numerical methods produces the same classification; stable in that the classification remains the same under a wide variety of additions of organisms or of new characteristics describing them.
A number of names have been applied to these numerical methods depending largely on the area of application. Numerical taxonomy is generally used in biology. In psychology the term Q analysis is sometimes employed. In the artificial intelligence literature unsupervised pattern recognition is the favoured label, and market researchers often talk about segmentation. But nowadays cluster analysis is probably the preferred generic term for procedures which seek to uncover groups in data.
In most applications of cluster analysis a partition of the data is sought, in which each individual or object belongs to a single cluster, and the complete set of clusters contains all individuals. In some circumstances, however, overlapping clusters may provide a more acceptable solution. It must also be remembered that one acceptable answer from a cluster analysis is that no grouping of the data is justified.
The basic data for most applications of cluster analysis is the usual n Ć p multivariate data matrix, X, containing the variable values describing each object to be clustered; that is,
The entry
in
X gives the value of the
jth variable on object
i. Such a matrix is often termed ātwo-modeā, indicating that the rows and columns correspond to different things.
The variables in X may often be a mixture of continuous, ordinal and/or categorical, and often some entries will be missing. Mixed variables and missing values may complicate the clustering of data, as we shall see in later chapters. And in some applications, the rows of the matrix X may contain repeated measures of the same variable but under, for example, different conditions, or at different times, or at a number of spatial positions, etc. A simple example in the time domain is provided by measurements of, say, the heights of children each month for several years. Such structured data are of a special nature in that all variables are measured on the same scale, and the cluster analysis of structured data may require different approaches from the clustering of unstructured data, as we will see in Chapter 3 and in Chapter 7.
Some cluster analysis techniques begin by converting the matrix X into an n Ć n matrix of inter-object similarities, dissimilarities or distances (a general term is proximity), a procedure to be discussed in detail in Chapter 3. (Such matrices may be designated āone-modeā, indicating that their rows and columns index the same thing.) But in some applications the inter-object similarity or dissimilarity matrix may arise directly, particularly in experiments where people are asked to judge the perceived similarity or dissimilarity of a set of stimuli or objects of interest. As an example, Table 1.1 shows judgements about various brands of cola made by two subjects, using a visual analogue scale with anchor points āsomeā (having a score of 0) and ādifferentā (having a score of 100). In this example the resulting rating for a pair of colas is a dis...