Daniel Veidlinger
Computational Linguistics and the Buddhist Corpus
The process of reading and interpreting religious texts has been going on for millennia, and in fact many of the hermeneutical techniques that scholars throughout the humanities use today were developed over the centuries in attempts to get at the meaning – hidden, metaphorical or otherwise – of various religious texts. Many of the greatest advances in communication technology have also taken place in the effort to preserve and transmit religious texts, from the astonishingly accurate oral transmission of the Hindu Vedas to the legends of Egyptian deities depicted in hieroglyphs, and on through the block printing of the Buddhist sutras and later the movable type of Gutenberg’s Bible. Many people’s first encounter with the radio was through hearing a preacher’s voice emanating from the speaker, and in contemporary times over a quarter of Americans regularly use the Internet to find information about religion (Pew Foundation 2001). It is now possible to use computers and other related digital technologies to help in the hermeneutical enterprise. This chapter will focus on some of the more popular computational language processing and text mining techniques and explain how they can be used to further our understanding of Buddhist texts and reveal new perspectives on their meaning.
A human scholar might read all of the words in a passage and examine their individual meaning, then consider the context in which the words occur, what is known about the author, the historical circumstances surrounding the creation of the text, and perform many other intellectual maneuvers in order to understand the passage. On the other hand, a computer at the current stage of development is not able to understand the piece in the same way. However, computers are able to digest enormous amounts of text – millions and millions of words that would take many lifetimes for a human to read – and apply various algorithms to the text in order to find relationships between the words and hidden patterns and stylistic features that are not immediately evident to a human reader. As John Burrows, an important pioneer of this kind of analysis, states,
Statistical analysis is necessary for the management of words that occur too frequently to be studied one by one… they constitute the underlying fabric of a text, a barely visible web that gives shape to whatever is being said… An appropriate analogy, perhaps, is with the contrast between handwoven rugs where the russet tones predominate and those where they give way to the greens and blues. The principal point of interest is neither a single stitch, a single thread, nor even a single color, but the overall effect (Burrows 2004, 323 – 324).
The digital techniques are not intended to replace human readers, but rather are best used in tandem with the insights gained by close human reading, for invariably the human scholar is forced to draw conclusions about a corpus based on only a sampling of the texts within it. Ultimately, of course, all literary analysis depends upon massive processing of data and detecting of trends. However, the traditional way relies upon years of research that lies collected in the human critic’s head who over time achieves the ability to reliably detect meaningful patterns, whereas a computer does it explicitly and in an instant. A human critic, in other words, is never just reading one document in isolation, but is processing that document through a neural net constructed in her own brain from previous readings of hundreds or thousands of documents that have left a latent impression. As Burrows puts it,
literary analysis often rests upon seemingly intuitive insights and discriminations, processes that may seem remote from the gathering and combining and classifying on which [digital humanities] have concentrated and in which computational stylistics is usually engaged. But those insights and discriminations are not ultimately intuitive because they draw, albeit covertly, upon data gathered in a lifetime’s reading, stored away in a subconscious memory bank, and put to use, as Samuel Johnson reminds us, through processes of comparison and classification, whether tacit or overt (Burrows 2004, 344).
Digital techniques can help scholars expand the data that they are able to examine using the insights gained from the initial close reading. They can be confirmed or contradicted by the statistical analysis of the entire corpus, and then new insights gained from the mechanical reading process can be cycled back and used to check the texts through a close reading. Ideally, therefore, the two techniques should be used to complement each other. In this chapter, I will examine a few of the more popular statistically based methods and provide some examples of how these techniques can be used to discover new insights about Buddhist texts.
The techniques that will be examined in this chapter have been used for some time already in the fields of Digital Humanities, Machine Learning and Natural Language Processing and there are several publically available systems that can be used to deploy them. These techniques are Term Frequency-Inverse Document Frequency (TF-IDF), Collocation Analysis and Vector Space Semantic Mapping.1 Each of these techniques is able to process very large amounts of text and look for relations between the words that can tell us many things about the overall topic of the text, and different ways words are used within it.
Of course, for any of these techniques to work, the text that one wishes to examine must be machine readable and properly formatted. The first task, then, is to identify a good machine readable text that one wishes to analyze. Ideally, the text should be in a raw text form, such as a file ending with .txt. There are then various transformations that must be effected on the text during the preprocessing phase, including sentence boundary detection, punctuation cleansing, stemming and normalization of spelling. The punctuation marks can cause a lot of confusion for the algorithms and skew the results significantly if they are not dealt with properly. For example, in the sentence “The Buddha taught the Dharma, and the Dharma lives on today in many forms” we would want the computer to recognize that “Dharma,” (note the comma) and “Dharma” are the same term. Although this might seem straightforward, a number of complicated issues arise that must be resolved, because sometimes the punctuation may carry important semantic meaning, such as in a hyphenated word, so that removing it will lead the computer down the wrong path. However, one of the benefits of working with an extremely large corpus of material is that these issues should hopefully resolve themselves in many cases, as the number of correct hits far outweighs the number of improperly parsed terms. Stemming involves associating the different forms of a word with the same stem or lemma, which, again, can greatly skew the results if not done correctly. Should plural and singular forms of the same noun be associated with each other, for example, so that three occurrences of the word “ox” and two of “oxen” would count as five occurrences of the lemma “ox”? What about different tenses of the same verb? It is also important to associate various contractions with the correct longform term, for example isn’t and is not. These are all questions that need to be resolved, although the answer may be different depending on the nature of the text one is dealing with and the kinds of questions one wishes to ask.2
Associated with these issues is the question of determining the size of the entities that one wants to examine in the analysis. One may wish to process each word separately, or one may wish to process 2, 3, 4 or more words together in order to retain the meaning of phrases, as Chris Handy discusses in his chapter herein. For example, the phrase “the four noble truths” would obviously be processed very differently by an algorithm that allows for 4-gram phrases than by one that just looks at each word individually. There is no single “correct” way to process texts, and the determination of the amount of words to be examined as a unit should be up to the researcher, with trial and error often being the best or even only way of knowing which one works better. A lot depends on exactly what the purpose of the analysis in question is. For some lines of research, a uni-gram parse might be best, and for others, a multi-gram parse might fare better. The results, as any responsible Digital Humanities scholar will admit, always have to be judged and tweaked based upon the learned opinion of the researcher. There will almost always be results in any language processing or data mining project that do not seem to make any sense and can be discarded. However, it is important at least to try to understand why the system would have provided the problematic output, because perhaps therein might lie some of the most useful insights due to the fact that they go against what was previously held to be the case.
For Buddhist studies, there are many sources of digitized texts that can be used...