Cluster Analysis
  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

About this book

Cluster analysis comprises a range of methods for classifying multivariate data into subgroups. By organizing multivariate data into such subgroups, clustering can help reveal the characteristics of any structure or patterns present. These techniques have proven useful in a wide range of areas such as medicine, psychology, market research and bioinformatics.

This fifth edition of the highly successful Cluster Analysis includes coverage of the latest developments in the field and a new chapter dealing with finite mixture models for structured data.

Real life examples are used throughout to demonstrate the application of the theory, and figures are used extensively to illustrate graphical techniques. The book is comprehensive yet relatively non-mathematical, focusing on the practical aspects of cluster analysis.

Key Features:

  • Presents a comprehensive guide to clustering techniques, with focus on the practical aspects of cluster analysis
  • Provides a thorough revision of the fourth edition, including new developments in clustering longitudinal data and examples from bioinformatics and gene studies./li>
  • Updates the chapter on mixture models to include recent developments and presents a new chapter on mixture modeling for structured data

Practitioners and researchers working in cluster analysis and data analysis will benefit from this book.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Cluster Analysis by Brian S. Everitt,Sabine Landau,Morven Leese,Daniel Stahl in PDF and/or ePUB format, as well as other popular books in Mathematics & Probability & Statistics. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Wiley
Year
2011
Print ISBN
9780470749913
eBook ISBN
9780470978443
Chapter 1
An Introduction to Classification and Clustering
1.1 Introduction
An intelligent being cannot treat every object it sees as a unique entity unlike anything else in the universe. It has to put objects in categories so that it may apply its hard-won knowledge about similar objects encountered in the past, to the object at hand.
Steven Pinker, How the Mind Works, 1997.
One of the most basic abilities of living creatures involves the grouping of similar objects to produce a classification. The idea of sorting similar things into categories is clearly a primitive one since early man, for example, must have been able to realize that many individual objects shared certain properties such as being edible, or poisonous, or ferocious and so on.
Classification, in its widest sense, is needed for the development of language, which consists of words which help us to recognize and discuss the different types of events, objects and people we encounter. Each noun in a language, for example, is essentially a label used to describe a class of things which have striking features in common; thus animals are named as cats, dogs, horses, etc., and such a name collects individuals into groups. Naming and classifying are essentially synonymous.
As well as being a basic human conceptual activity, classification is also fundamental to most branches of science. In biology for example, classification of organisms has been a preoccupation since the very first biological investigations. Aristotle built up an elaborate system for classifying the species of the animal kingdom, which began by dividing animals into two main groups, those having red blood (corresponding roughly to our own vertebrates), and those lacking it (the invertebrates). He further subdivided these two groups according to the way in which the young are produced, whether alive, in eggs, as pupae and so on.
Following Aristotle, Theophrastos wrote the first fundamental accounts of the structure and classification of plants. The resulting books were so fully documented, so profound and so all-embracing in their scope that they provided the groundwork of biological research for many centuries. They were superseded only in the 17th and 18th centuries, when the great European explorers, by opening the rest of the world to inquiring travellers, created the occasion for a second, similar programme of research and collection, under the direction of the Swedish naturalist, Linnaeus. In 1737, Carl von LinnƩ published his work Genera Plantarum, from which the following quotation is taken:
All the real knowledge which we possess, depends on methods by which we distinguish the similar from the dissimilar. The greater the number of natural distinctions this method comprehends the clearer becomes our idea of things. The more numerous the objects which employ our attention the more difficult it becomes to form such a method and the more necessary.
For we must not join in the same genus the horse and the swine, though both species had been one hoof'd nor separate in different genera the goat, the reindeer and the elk, tho' they differ in the form of their horns. We ought therefore by attentive and diligent observation to determine the limits of the genera, since they cannot be determined a priori. This is the great work, the important labour, for should the genera be confused, all would be confusion.
In biology, the theory and practice of classifying organisms is generally known as taxonomy. Initially, taxonomy in its widest sense was perhaps more of an art than a scientific method, but eventually less subjective techniques were developed largely by Adanson (1727–1806), who is credited by Sokal and Sneath (1963) with the introduction of the polythetic type of system into biology, in which classifications are based on many characteristics of the objects being studied, as opposed to monothetic systems, which use a single characteristic to produce a classification.
The classification of animals and plants has clearly played an important role in the fields of biology and zoology, particularly as a basis for Darwin's theory of evolution. But classification has also played a central role in the developments of theories in other fields of science. The classification of the elements in the periodic table for example, produced by Mendeleyev in the 1860s, has had a profound impact on the understanding of the structure of the atom. Again, in astronomy, the classification of stars into dwarf stars and giant stars using the Hertzsprung–Russell plot of temperature against luminosity (Figure 1.1) has strongly affected theories of stellar evolution.
Figure 1.1 Hertzsprung–Russell plot of temperature against luminosity.
img
Classification may involve people, animals, chemical elements, stars, etc., as the entities to be grouped. In this text we shall generally use the term object to cover all such possibilities.
1.2 Reasons for Classifying
At one level, a classification scheme may simply represent a convenient method for organizing a large data set so that it can be understood more easily and information retrieved more efficiently. If the data can validly be summarized by a small number of groups of objects, then the group labels may provide a very concise description of patterns of similarities and differences in the data. In market research, for example, it may be useful to group a large number of respondents according to their preferences for particular products. This may help to identify a ā€˜niche product’ for a particular type of consumer. The need to summarize data sets in this way is increasingly important because of the growing number of large databases now available in many areas of science, and the exploration of such databases using cluster analysis and other multivariate analysis techniques is now often called data mining. In the 21st century, data mining has become of particular interest for investigating material on the World Wide Web, where the aim is to extract useful information or knowledge from web page contents (see, Liu, 2007 for more details).
In many applications, however, investigators may be looking for a classification which, in addition to providing a useful summary of the data, also serves some more fundamental purpose. Medicine provides a good example. To understand and treat disease it has to be classified, and in general the classification will have two main aims. The first will be prediction – separating diseases that require different treatments. The second will be to provide a basis for research into aetiology – the causes of different types of disease. It is these two aims that a clinician has in mind when she makes a diagnosis.
It is almost always the case that a variety of alternative classifications exist for the same set of objects. Human beings, for example, may be classified with respect to economic status into groups such as lower class, middle class and upper class; alternatively they might be classified by annual consumption of alcohol into low, medium and high. Clearly such different classifications may not collect the same individuals into groups. Some classifications are, however, more likely to be of general use than others, a point well-made by Needham (1965) in discussing the classification of humans into men and women:
The usefulness of this classification does not begin and end with all that can, in one sense, be strictly inferred from it – namely a statement about sexual organs. It is a very useful classification because classing a person as a man or woman conveys a great deal more information, about probable relative size, strength, certain types of dexterity and so on. When we say that persons in class man are more suitable than persons in class woman for certain tasks and conversely, we are only incidentally making a remark about sex, our primary concern being with strength, endurance etc. The point is that we have been able to use a classification of persons which conveys information on many properties. On the contrary a classification of persons into those with hair on their forearms between
img
and
img
inch long and those without, though it may serve some particular use, is certainly of no general use, for imputing membership in the former class to a person conveys information in this property alone. Put another way, there are no known properties which divide up a set of people in a similar manner.
A similar point can be made in respect of the classification of books based on subject matter and their classification based on the colour of the book's binding. The former, with classes such as dictionaries, novels, biographies, etc., will be of far wider use than the latter with classes such as green, blue, red, etc. The reason why the first is more useful than the second is clear; the subject matter classification indicates more of a book's characteristics than the latter.
So it should be remembered that in general a classification of a set of objects is not like a scientific theory and should perhaps be judged largely on its usefulness, rather than in terms of whether it is ā€˜true’ or ā€˜false’.
1.3 Numerical Methods of Classification – Cluster Analysis
Numerical techniques for deriving classifications originated largely in the natural sciences such as biology and zoology in an effort to rid taxonomy of its traditionally subjective nature. The aim was to provide objective and stable classifications. Objective in the sense that the analysis of the same set of organisms by the same sequence of numerical methods produces the same classification; stable in that the classification remains the same under a wide variety of additions of organisms or of new characteristics describing them.
A number of names have been applied to these numerical methods depending largely on the area of application. Numerical taxonomy is generally used in biology. In psychology the term Q analysis is sometimes employed. In the artificial intelligence literature unsupervised pattern recognition is the favoured label, and market researchers often talk about segmentation. But nowadays cluster analysis is probably the preferred generic term for procedures which seek to uncover groups in data.
In most applications of cluster analysis a partition of the data is sought, in which each individual or object belongs to a single cluster, and the complete set of clusters contains all individuals. In some circumstances, however, overlapping clusters may provide a more acceptable solution. It must also be remembered that one acceptable answer from a cluster analysis is that no grouping of the data is justified.
The basic data for most applications of cluster analysis is the usual n Ɨ p multivariate data matrix, X, containing the variable values describing each object to be clustered; that is,
img
The entry
img
in X gives the value of the jth variable on object i. Such a matrix is often termed ā€˜two-mode’, indicating that the rows and columns correspond to different things.
The variables in X may often be a mixture of continuous, ordinal and/or categorical, and often some entries will be missing. Mixed variables and missing values may complicate the clustering of data, as we shall see in later chapters. And in some applications, the rows of the matrix X may contain repeated measures of the same variable but under, for example, different conditions, or at different times, or at a number of spatial positions, etc. A simple example in the time domain is provided by measurements of, say, the heights of children each month for several years. Such structured data are of a special nature in that all variables are measured on the same scale, and the cluster analysis of structured data may require different approaches from the clustering of unstructured data, as we will see in Chapter 3 and in Chapter 7.
Some cluster analysis techniques begin by converting the matrix X into an n Ɨ n matrix of inter-object similarities, dissimilarities or distances (a general term is proximity), a procedure to be discussed in detail in Chapter 3. (Such matrices may be designated ā€˜one-mode’, indicating that their rows and columns index the same thing.) But in some applications the inter-object similarity or dissimilarity matrix may arise directly, particularly in experiments where people are asked to judge the perceived similarity or dissimilarity of a set of stimuli or objects of interest. As an example, Table 1.1 shows judgements about various brands of cola made by two subjects, using a visual analogue scale with anchor points ā€˜some’ (having a score of 0) and ā€˜different’ (having a score of 100). In this example the resulting rating for a pair of colas is a dis...

Table of contents

  1. Cover
  2. Wiley Series in Probability and Statistics
  3. Title Page
  4. Copyright
  5. Dedication
  6. Preface
  7. Acknowledgement
  8. Chapter 1: An Introduction to classification and clustering
  9. Chapter 2: Detecting clusters graphically
  10. Chapter 3: Measurement of proximity
  11. Chapter 4: Hierarchical clustering
  12. Chapter 5: Optimization clustering techniques
  13. Chapter 6: Finite mixture densities as models for cluster analysis
  14. Chapter 7: Model-based cluster analysis for structured data
  15. Chapter 8: Miscellaneous clustering methods
  16. Chapter 9: Some final comments and guidelines
  17. References
  18. Index