Chapter 1
Characteristics of data
The methods and techniques used in the analysis of statistical data are in large measure controlled by the very character of the statistical data themselves. It is therefore necessary to begin with a very brief consideration of some of these characteristics so that the varied themes that will be introduced later will be more readily understood.
When any collection of data, representing some quantitative value of any given phenomenon, is to be processed it will be found that although such data all represent the same phenomenon they are not all of exactly the same value. Thus if a study were being made of the distance inland from the coast that vessels of a given draught could sail it would be found that these distances vary markedly between one river and another, or between one part of the world and another. Again, if the number of vessels sailing along these rivers were examined a very wide range in values would be found between the different rivers. This highly variable nature of the numerical data is common, to a greater or lesser extent, to all sets of data, and this quantity which varies (mileage, or numbers of vessels, in the two cases given above) is known as the variate, or sometimes as the variable.
Three broad sets of distinctions concerning such variates need to be borne in mind. Firstly, there are a range of possible types of units in terms of which data are expressedânominal or classificatory ; ordinal or ranking; interval; ratio.
(a) Nominal. This is a group of data which all too often in the past was assumed by geographers to preclude quantitative description or testing. However, it is a frequently occurring category of data in geographyâthe distinction of settlements into Celtic, Anglo-Saxon and Scandinavian origins; the classifying of soils into podzols, brown earths and rendzinas; the distinction of forest, grassland and heath vegetation complexes; the recognition of various tribal, racial or cultural groups; the functional divisions of towns or the land-use division of rural areas. None of these carries implications of quantity, nor even of relative order of magnitude; they simply refer to categories that are different from one another. Nevertheless, under sampling the various categories may occur with differing degrees of frequency, and these provide data in a form that can be analysed statistically.
(b) Ordinal. This is also a very common group of data in geography, in that the relative importance (or order of magnitude) of data may be known, even though their absolute values are not. In other words, the data can be ranked or put in order, either individually or in classes. Sometimes this reflects constraints that exist upon data collection, such that only rankings are known; in other cases, the use of data in ordinal form is a deliberate choice, even though other data forms could have been used.
(c) Interval. When not only is the order of magnitude known, but also the actual degree of magnitude as well, then an interval scale exists. This is characteristic of rainfall data, production values, population returns, and many other types of data of geographical relevance. In all these and similar cases, either exact measurements are made in some standard unit, or the occurrences of the phenomenon are counted.
(d) Ratio. In this fourth category, interval data have been converted into another form. For example, the number of persons in a given socio-economic group may be expressed as a proportion of the total population, or the number of persons voting for a particular party expressed as a percentage of the total electorate. Again, measured values may have been converted into an index, such as a pH value or an index of production. Such ratio values are often, but not invariably, characterized by finite upper and lower limits.
Secondly, a distinction must be made between continuous and discrete variates. For example, in the case of the navigable mileage of rivers outlined above, it is possible for any mileage value to be recorded and for fractions of a mile to be included. In other words, it is a continuous variate such that there are no clear-cut or sharp breaks between the values that are possible. Such continuous variates occur with measured interval data, or with ratio data. On the other hand, the number of vessels actually sailing these rivers can only be in terms of whole numbers or integers, for fractions of vessels cannot be recorded. Such a variate is known as discrete, and special care must be taken when interpreting the results of the analysis of such discrete variates. Interval data based on the counting of occurrences fall into this category.
A third distinction that must be made is between data for individual items and data that are grouped into classes or cells. The listing of each item separately is possible for all types of data units except the nominal category, which by definition implies the number of occurrences in a given class. The grouping of data can be effected for all types of data, whether this be because of the form in which data are made available, because of doubts concerning the precise accuracy of interval or ratio measurements, or for convenience in calculations or testing procedures. For example, economic or social data may be obtained from official bodies such as employment exchanges or government departments, which are often precluded by law from making individual values available. Thus the numbers of people employed by individual firms may vary from one to some high value, but data may be available only in a series of classes (1â50, 51â100, etc.). Again, the profitability or costs of certain operations may be defined by firms or farmers as high, medium and low, because they are unwilling to make actual values available. At other times, difficulties of measurement or recording may make the ordinal form of data more convenient than the interval form, as when classifying river-bed load as coarse, medium and fine, or slopes as steep, moderate and gentle, or soils as acid, neutral and alkaline. In all these cases, however, there exists some implicit underlying continuum in terms of magnitude, the discrete categories being merely a convenient division.
The variable nature of geographical data can best be understood and appreciated if the data are plotted graphically to show the frequency of occurrence of values of different given amounts. The data are first grouped into âclassesâ, so that it is known how many occurrences fall into each of a series of quantitatively different sets of conditions. Then the number of occurrences are plotted against the appropriate âclassâ, and a diagram drawn in the form of âbuilding blocksâ. Such a diagram is known as a histogram and the pattern which it presents is called the frequency distribution for that set of data. From such a diagram a smoothed curve can be interpolated, this being known as the âfrequency curveâ of that set of data. Thus in Fig. 1 can be seen the frequency distribution for population densities of the European nation-states. The values for individual states are grouped into various classes depending on their order of magnitude (e.g. 0â49.9 persons per sq. km,; 50â99.9 persons per sq. km.), and the variable character of these population densities is readily apparent. The way in which these population densities vary is shown by both the âblocksâ and by the smoothed curve. A similar frequency distribution curve can be constructed for any and all sets of data. Figure 2, for example, shows the distribution of hill summit heights in North Wales based on summit ring-contours taken from the provisional edition of the O.S. 1:25,000 maps. As with the population densities, these summit heights are a continuous variate. Moreover, both Fig. 1 and Fig. 2 also display another feature of many distribution curves. It can be seen clearly that these curves are not symmetrical, having their peak markedly to one side. Such a distribution is known as skew, and the problems which this introduces, together with various methods by which these problems may be largely solved, w...