Introduction to LSA
Theory and Methods
LSA as a Theory of Meaning
Thomas K Landauer
Pearson Knowledge Technologies and University of Colorado
The fundamental scientific puzzle addressed by the latent semantic analysis (LSA) theory is that there are hundreds of distinctly different human languages, every one with tens of thousands of words. The ability to understand the meanings of utterances composed of these words must be acquired by virtually every human who grows up surrounded by language. There must, therefore, be some humanly shared method—some computational system—by which any human mind can learn to do this for any language by extensive immersion, and without being explicitly taught definitions or rules for any significant number of words.
Most past and still popular discussions of the problem focus on debates concerning how much of this capability is innate and how much learned (Chomsky, 1991b) or what abstract architectures of cognition might support it—such as whether it rests on association (Skinner, 1957) or requires a theory of mind (Bloom, 2000).
The issue with which LSA is concerned is different. LSA theory addresses the problem of exactly how word and passage meaning can be constructed from experience with language, that is, by what mechanisms—instinctive, learned, or both—this can be accomplished.
Carefully describing and analyzing the phenomenon has been the center of attention for experimental psychology, linguistics, and philosophy. Other areas of interest include pinpointing what parts of the brain are most
heavily involved in which functions and how they interact, or positing functional modules and system models. But, although necessary or useful, these approaches do not solve the problem of how it is possible to make the brain, or any other system, acquire the needed abilities at their natural scale and rate.
This leads us to ask the question: Suppose we have available a corpus of data approximating the mass of intrinsic and extrinsic language-relevant experience that a human encounters, a computer with power that could match that of the human brain, and a sufficiently clever learning algorithm and data storage method. Could it learn the meanings of all the words in any language it was given?
The keystone discovery for LSA was that using just a single simple constraint on the structure of verbal meaning, and a rough approximation to the same experience as humans, LSA can perform many meaning-based cognitive tasks as well as humans.
That this provides a proof that LSA creates meaning is a proposition that manifestly requires defense. Therefore, instead of starting with explication of the workings of the model itself, the chapter first presents arguments in favor of that proposition. The arguments rest on descriptions of what LSA achieves and how its main counterarguments can be discounted.
The Traditional Antilearning Argument
Many well-known thinkers—Plato, Bickerton (1995), Chomsky (1991b), Fodor (1987), Gleitman (1990), Gold (1967), Jackendoff (1992), Osherson, Stob, and Weinstein (1984), Pinker (1994), to name a few—have considered this prima facie impossible, usually on the grounds that humans learn language too easily, that they are exposed to too little evidence, correction, or instruction to make all the conceptual distinctions and generalizations that natural languages demand. This argument has been applied mainly to the learning of grammar, but has been asserted with almost equal conviction to apply to the learning of word meanings as well, most famously by Plato, Chomsky, and Pinker. Given this postulate, it follows that the mind (brain, or any equivalent computational system) must be equipped with other sources of conceptual and linguistic knowledge. This is not an entirely unreasonable hypothesis. After all, the vast majority of living things come equipped with or can develop complex and important behavioral capabilities in isolation from other living things. Given this widely accepted assumption, it would obviously be impossible for a computer using input only from a sample of natural language in the form of unmodified text to come even close to doing things with verbal meaning that humans do.
The LSA Breakthrough
It was thus a major surprise to discover that a conceptually simple algorithm applied to bodies of ordinary text could learn to match literate humans on tasks that if done by people would be assumed to imply understanding of the meaning of words and passages. The model that first accomplished this feat was LSA.
LSA is a computational model that does many humanlike things with language. The following are but a few: After autonomous learning from a large body of representative text, it scores well into the high school student range on a standardized multiple-choice vocabulary test; used alone to rate the adequacy of content of expository essays (other variables are added in full-scale grading systems; Landauer, Laham, & Foltz, 2003a, 2003b), estimated in more than one way, it shares 85%–90% as much information with expert human readers as two human readers share with each other (Landauer, 2002a); it has measured the effect on comprehension of paragraph-to-paragraph coherence better than human coding (Foltz, Kintsch, & Landauer, 1998); it has successfully modeled several laboratory findings in cognitive psychology (Howard, Addis, Jing, & Kahana, chap. 7
in this volume; Landauer, 2002a; Landauer & Dumais, 1997; Lund, Burgess, & Atchley, 1995); it detects improvements in student knowledge from before to after reading as well as human judges (Rehder et al., 1998; Wolfe et al., 1998); it can diagnose schizophrenia from what patients say as well as experienced psychiatrists (Elvevåg, Foltz, Weinberger, & Goldberg, 2005); it improves information retrieval by up to 30% by being able to match queries to documents of the same meaning when there are few or no words in common and reject those with many when irrelevant (Dumais, 1991), and can do the same for queries in one language matching documents in another where no words are alike (Dumais, Landauer, & Littman, 1996); it does its basic functions of correctly simulating human judgments of meaning similarity between paragraphs without modification by the same algorithm in every language to which it has been applied, examples of which include Arabic, Hindi, and Chinese in their native orthographic or ideographic form; and when sets of all LSA similarities among words for perceptual entities such as kinds of objects (e.g., flowers, trees, birds, chairs, or colors) are subjected to multidimensional scaling, the resulting structures match those based on human similarity judgments quite well in many cases, moderately well in others (Laham, 1997, 2000), just as we would expect (and later explain) because text lacks eyes, ears, and fingers.
I view these and its several other successful simulations (see Landauer, 2002a; Landauer, Foltz, & Laham, 1998) as evidence that LSA and models like it (Griffiths & Steyvers, 2003; Steyvers & Griffiths, chap. 21
in this volume)
are candidate mechanisms to explain much of how verbal meaning might be learned and used by the human mind.
About LSA’s Kind of Theory
LSA offers a very different kind of account of verbal meaning from any that went before, including centuries of theories from philosophy, linguistics, and psychology. Its only real predecessor is an explanation inherent in connectionist models but unrealized yet at scale (O’Reilly & Munakata, 2000). Previous accounts had all been in the form of rules, descriptions, or variables (parts of speech, grammars, etc.) that could only be applied by human intercession, products of the very process that needs explanation. By contrast, at least in programmatic goal, the LSA account demands that the only data allowed the theory and its computational instantiations be those to which natural human language users have access. The theory must operate on the data by means that can be expressed with mathematical rigor, not through the intervention of human judgments. This disallows any linguistic rule or structure unless it can be proved that all human minds do equivalent things without explicit instruction from other speakers, the long unattained goal of the search for a universal grammar. It also rules out as explanations—as contrasted with explorations—computational linguistic systems that are trained on corpora that have been annotated by human speakers in ways that only human speakers can.
This way of explaining language and its meaning is so at odds with most traditional views and speculations that, in Piaget’s terminology, it is hard for many people, both lay and scholar, to accommodate. Thus, before introducing its history and more of its evidence and uses, I want to arm readers with a basic understanding of what LSA is and how it illuminates what verbal meaning might be.
But What is Meaning?
First, however, let us take head-on the question of what it signifies to call something a theory of meaning. For a start, I take it that meaning as carried by words and word strings is what allows modern humans to engage in verbal thought and rich interpersonal communication. But this, of course, still begs the question of what meaning itself is.
Philosophers, linguists, humanists, novelists, poets, and theologians have used the word “meaning” in a plethora of ways, ranging, for example, from the truth of matters to intrinsic properties of objects and happenings in the world, to mental constructions of the outside world, to physically irreducible mystical essences, as in Plato’s ideas, to symbols in an internal communication and reasoning system, to potentially true but too vague notions
such as how words are used (Wittgenstein, 1953). Some assert that meanings are abstract concepts or properties of the world that exist prior to and independently of any language-dependent representation. This leads to assertions that by nature or definition computers cannot create meaning from data; meaning must exist first. Therefore, what a computer creates, stores, and uses cannot, ipso facto
, be meaning itself.
A sort of corollary of this postulate is that what we commonly think of as the meaning of a word has to be derived from, “grounded in,” already meaningful primitives in perception or action (Barsalou, 1999; Glenberg & Robertson, 2000; Harnad, 1990; Searle, 1982). In our view (“our” meaning proponents of LSA-like theories), however, what goes on in the mind (and, by identity, the brain) in direct visual or auditory, or any other perception, is fu...