1 Introduction
Poetic texts can be analyzed from an infinite number of viewpoints, just as any text and the whole of the human behaviour. Every viewpoint is interesting for some scientific discipline, and the number of viewpoints increases with the advancement of science. Our aim is very restricted, but, nevertheless, it opens up an infinite domain of new problems. And every problem can be solved in different ways. Hence, there is a path without end, wherever one begins and in whatever direction one goes.
In the present volume, we shall concentrate on a small number of methods used in the study of poetic texts and apply them to some already quantified textual properties. Our textual examples are poems; they are often short and each result can be checked even without the use of a computer. Besides, the study of the phonic structure of poems is reasonable, because according to R. Jakobson, in poetry the form stays in the foreground. In prose, the phonic structure is not as prominent as in poetry and the rhythmic structure of prose depends also on the character of the given language, it is seldom a conspicuous property of a single text. Nevertheless, there is a discipline engaged in the study of prose rhythm.
The methods presented in this study are applied to a corpus of 150 Romanian poems (including also a few “outliers”) written by Mihai Eminescu as they can be found in many editions of his works, texts analysing his works, or on the Internet: http://ro.wikisource.org/wiki/Autor:Mihai_Eminescu.
In the present investigation, quantitative methods proven and tested in studies of prose texts, including methods for text comparison, are applied to poetic texts. Inter-sort or inter-language comparisons are frequently somewhat futile because each genre and each language has its own characteristic ways of text creation, hence most of the properties are significantly different. A statistical test simply emphasizes this expectation.
We shall study phonic features, word-form frequencies, word-length, word-classes, and the semantic structure of the poems revealing some parts of the author's world of associations. Each of them has many facets, but we concentrate rather on methods and methodology.
An obvious question at the beginning of any book on text studies is: What is a text? However, in contemporary science, such essentialist questions are rather outdated. They require determinations of a kind of Kantian noumenon, the essence of a thing, which does not exist, or, expressed in a weaker form, it would not explain anything because explanations form an infinite hierarchy whereas the “essence” would be a final (and therefore not acceptable) station on this way. Hence the only rational question is: what do we consider as a text? For the purpose of the present study, a text is a linear sequence of meaningful entities, organized also hierarchically (e.g. in the hierarchy sentence, clause, phrase, word, morpheme, syllable, phoneme). In linguistics, we restrict ourselves to spoken or written material but even within this restricted field, we find exceptions. Hypertexts e.g., on the Internet, are full of pictures and links, or texts in comics, etc., belong to the domain of intertextuality. Of course, one can study them, too, from various points of view but they are not standard texts as we are interested in. The texts of our interest are written in some script and their entities do not have only a purpose (like the kitchen in a house) but also a meaning, i.e., they refer to objects outside of the text. Nevertheless, even under this restriction, they have many properties in common with other linear sequences, and consequently, many methods used in non-linguistic disciplines can be applied also in linguistics.
In quantitative linguistics, the explication of a text is not one of the aims or results of the research activities nor are the description of the content nor its evaluation (whether aesthetic or stylistic). Quantitative-linguistic research aims at finding regularities which arise due to the effect of –possibly still unknown – background laws. These regularities should not be confused with grammatical rules, which can be learnt or changed or even violated, and appear, in a manner of speaking, on the surface of the texts. We rather search for textual phenomena which are evoked by and evidence of certain background mechanisms. We shall never know all of them but stepwise approaching the matter allows us to penetrate deeper and deeper.
There are five main approaches to text analysis (cf. Altmann 2007, 2009):
| (1) | The static approach is concerned with the text as a whole, comprising the computation of all known properties, stylistic studies, evaluation of frequencies of different phenomena, lengths, polysemy values, word associations, measurement of grammatical structures, rankings, diversifications, classifications, denotative structures, measurement of differences, entropies, etc. This means that the text will be dissected into well defined units whose properties are studied. For this approach, at least elementary statistical methods are indispensible. Among the obvious tools, mathematical graphs and their properties provide easy ways to describe and display phenomena and relations. |
| (2) | The sequential approach considers text as a linear sequence of entities forming time series, runs, Markov chains, reference chains, etc. These entities comprise degrees of properties, frequencies, metrical feet, distances between elements of the series, etc., the position of certain de grees of a property in a higher unit, e.g. word length positioning in the given sentence. This approach is more complex and frequently requires more complex methods. Corresponding mathematical models may be based on differential and difference equations. |
| (3) | A systemic approach can be started when some of the problems in the first two domains have been solved. Relations between entities, properties and structures which form control cycles and display the self-regulation mechanisms of text are in the focus of this approach. Though we know that texts are produced by authors, which consciously obey only grammatical rules and maybe rules of text structure, there are also latent, subconscious forces which compel the speaker/writer to form the text in a special way, e.g. reducing the decoding effort, reducing the memory effort, reducing sentence difficulty, increasing originality etc. The writer is free with respect to the content but not free with respect to the external form of the text: s/he must abide by some laws if s/he wants to be understood. The axiom concerning the non-existence of isolated entities in language and text is a sufficient motivation for the systemic approach. Investigations of this kind are known from the so-called synergetic linguistics (cf. Köhler 2005) and comprise both language and text. |
| (4) | The typological approach consists of comparing all the above mentioned properties as they occur in texts of different languages, placing the languages and texts on different scales, building fuzzy classes and studying the variability of various phenomena. Though text analysis played a secondary role in this research, its importance receives new impulses (cf. e.g. Kelih 2009; Popescu, Mačutek, Altmann 2009). However, the notorious classifications based on categorical concepts do not yield anything else but new, more general, concepts. We need them, but they seldom lead to theoretical progress. |
| (5) | The chaos theoretical approach. All aspects mentioned above contain some elements of chaos which is placed in a deeper layer in all text phenomena. Some phenomena, e.g. fractals, dimensions, attractors are identifiable but because of their indirect relevance for the text sciences and also because of their computational effort they are not yet sufficiently discussed (cf. Hřebíček 1997, 2000; Andres 2010; Andres, Benešová 2011). |
Ideally, a quantitative text analysis engages three different specialists. This is because at the beginning of the research, it is always the task of a linguist/text scientist to set up a hypothesis with linguistic relevance. No hypothesis – no quantitative text research! The linguist states what kind of data would be relevant for testing the hypothesis and the programmer tries to elicit them from texts. As opposed to facts and phenomena, data are not just given but they are the result of a scientific activity, they are constructed. To a text scientist, text is the matter from which data are conceptually constructed. In the meantime, the mathematician translates the verbal hypothesis into the language of mathematics, i.e. formulates it as a statistical hypothesis. At the same time s/he tries together with the linguist to find the mechanism that can lead to the rise of the given phenomenon. In other words, the mathematician tries to set up a model of the phenomenon and to subsume it under an existing theory, to embed it in a system of similar hypotheses. The programmer tests the hypothesis on her/his data and the mathematician interprets them statistically. The results of the test are translated into the daily language of linguistics, and the linguist interprets the result linguistically. Hence, the succession of persons in text analysis is: linguist → mathematician → programmer → mathematician → linguist. The linguist is placed at the beginning and the end of this procedure and warrants the linguistic relevance of the problem at the beginning and the relevance of the results at the end. Needless to say, mathematicians and programmers frequently propose excellent ideas; a sound cooperation yields the most reasonable results.
Texts are sources also for other disciplines such as psycholinguistics, sociolinguistics, dialectology, language teaching, etc. in which the respective experts determine the course of research.
Another obvious question is: What can be considered as poetry? The first answer is: Poetry is a kind of literary art where evocative and aesthetic effects are based on form, in addition to (sometimes: instead of) meaning. This volume aims at investigating the universal laws and interrelations of aspects connected with consciously formed texts under consciously imposed form restrictions.
There are many commonalities in these texts but none of the properties can be supposed as a necessary condition. Rhyme, rhythm, meter, the existence of verse line, strophes, a fixed number of lines (as in sonnets), meaning, etc., can be found in many but not in all poems. We must rely on the judgement of literary historians, making allowance for the existence of outliers which may destruct even our theories. Many times they can be made harmless by introducing boundary or subsidiary conditions.
A large part of quantitative characterisations is performed by means of indicators. Many of them tell the same story but their interpretation may be different. But if they tell the same story, then there is a clear link between them, even when their method of computation is different.
The indicators should have at least the following properties (cf. Galtung 1967; Grotjahn, Altmann 1988; Wimmer et al. 2003: 25ff): (1) Meaning. This seems to be quite natural, but many indicators arise in form of a proportion which does not have a clear interpretation. The indicator must tell us what it describes. (2) Simplicity, especially at the beginning of a research, because it alleviates computation and the mathematical treatment. It is advantageous to express different properties with different indicators. (3) Variation interval. If there are indicators varying in the interval <0, ∞>, a given value of this indicator cannot be interpreted. Every number can be considered large (with respect to the lower limit 0) or small (with respect to the upper limit ∞). It is therefore reasonable to restrict the value to a finite interval by means of normalization. (4) Sampling distribution. This property of an indicator is indispensable for a reliable evaluation of the measured values. It gives information about the frequency or probability of the individual values of the indicator, information which is fundamental for any statistical assessment. Unfortunately, this requirement is still ignored in the humanities in many cases. In order to apply an indicator, e.g. for comparisons, one should know at least its variance, which is needed for asymptotic tests. Exact probabilities can be computed only when the distribution of the indicator is known. The application of non-parametric statistics, a well-established technique, is an alternative. (5) Reliability is the measure of exactness and stability. The indicator should be stable and express the same property in all cases. (6) Validity means the fact that the indicator truly expresses the studied property. An illustrative example in this respect is the large number of available measures of vocabulary richness, whose validity is an open question.
But all this cannot be achieved in an elementary, preliminary investigation. Research begins always with the first step and improves its argumentation step by step, sets up more complex hypotheses, extends the investigations to other languages and, based on the surface phenomena expressed by indicators, further steps towards a theory follow. A theory is a system of interrelated hypotheses, some of which can be considered laws, i.e. general statements derived from axioms or other laws, or in other words, anchored in antecedent knowledge, and empirically well corroborated (cf. Bunge 1967).
In mainstream linguistics, the term theory is misused. It stands, as a rule, for concepts, isolated phenomena, descriptive approaches, sets of facts, classifications and sets of rules. All that, and even strict definitions – which are not more than conventions – and a preceding formalization do not have the status of a theory. The mentioned definitions and formalisations are merely necessary but not sufficient conditions for the construction of a theory. A theory begins to arise when we derive hypotheses from antecedent knowledge, test them empirically and join them with a system of universal, corroborated statements. This is, of course, not a simple task because language is not a deterministic system with clear-cut units and relations. Though it is always in a steady state, it varies with every speaker, it changes incessantly, and communication is possible only because of its self-regulation. A speaker can – and does – change elements but if s/he aims at communicative success, the change must not surpass a certain limit. With every change, the limit is shifted by a tiny quantity. Since this shift is always advantageous from the point of view of the speaker – s/he is the actor in this play – the phenomena in language are never distributed according to the normal (Gaussian) distribution. Every distribution in language is skewed. Nevertheless, values of whatever property taken from many texts may display normality in a statistical sense (a situation that can be tested) and a comparison with other text groups is possible by means of an asymptotic test based on normality.
The greatest advancements in every empirical science are achieved by introducing mathematical methods. Mathematics is a warrant of exactness, testability, deducibility, and systematisation and it gives us the chance to predict phenomena which are not visible on the surface of texts. In spite of this, there are still objecti...