Word length and frequency
1. Problem and history
There are two problems that must be strictly separated:
(a) The use of words of specific length in texts. Here word forms are meant and neither the size of the lexicon nor the number of phonemes in the phoneme inventory are relevant. Other properties like polysemy, polytexty, or synthetism can be added for modelling textual word length.
(b) Frequency as a factor of lemma construction in the dictionary, where frequency is taken from a frequency dictionary, and the dictionary size, as well as the size of the phoneme inventory, are relevant.
In (b), problems arise because the size of the dictionary can be merely coarsely estimated. There are even authors considering it infinite (cf. Piotrowski, Bektaev, Piotrovskaja 1985; Kornai 2002), which is a reasonable assumption. If, however, it is considered as a fixed finite value, it can be taken into account. Further, frequency dictionaries strongly depend on the kind of the analyzed texts. In order to secure a reliable frequency dictionary even for a short time period, an astronomical number of words must be counted.
Problem (a) is easily solvable but it must be considered for individual texts or group of texts from the same author. Here we consider merely the problem of dependence of word form length on its frequency in texts. In case of rejecting this relationship in some texts, further variables can be joined, e.g. number of meanings.
Another problem is frequency of occurrence and duration (→). G.K. Zipf (1935:25) set up two hypotheses on the relation of frequency and length:
(I) “The magnitude of words tends, on the whole, to stand in an inverse (not necessarily proportionate) relationship to the number of occurrences.” His second hypothesis on the variety of words occurring x-times is simply that of the distribution of word frequency (→).
Zipf himself demonstrated the inverse of (I) using Kaeding´s frequency dictionary of German (Kaeding 1897), i.e. he simply demonstrated the distribution of lengths which turned out to be monotone decreasing (→ word length). Baker (1951) used the letters of a woman with psychic deseases and divided the words in frequency groups; some authors use frequency directly, other ones use the ranks as independent variable. Word length has been counted in terms of the number of phonemes (e.g. Miller, Newman, Friedman 1958) or syllables. Different empirical formulas have been proposed (Belonogov 1962, Guiraud 1954, Kalinin 1964, Guiter 1977). Köhler (1986) discovered an oscillation of lengths and Köhler, Zörnig, Brinkmöller (1990) smoothed the data by taking gliding means. Grzybek, Altmann (2002) and Strauss, Grzybek and Altmann (2005) have shown the dependence of length on frequency in ten languages using individual texts, but other researchers rather use corpora representing mixed samples. Different aspects of this relationship have been shown in Strauss, Grzybek, Altmann (2005). Krott (1996, 2002) stated the same relationship between frequency and morpheme length.
2. Hypothesis
The mean syllabic length (y) of word forms in texts decreases with their frequency of occurrence (x).
Here each form is considered separately and the mean length of all forms with the same frequency is considered y. Since frequency can be considered in relative form, x can be considered continuous. Further hypotheses can be derived from the above-mentioned one, e.g. shorter forms of the word are more frequent than its longer forms (e.g. case, modus, aspect, tenses). Or, derivates and compounds of a lexeme occur more seldom than the lexeme itself.
3. Derivation
Köhler (1986), Strauss, Grzybek, Altmann (2005) start from the assumption that the relative rate of change of mean word length decreases proportionally to the relative rate of change of the frequency, as is very usual in synergetic linguistics. If zero-syllabic clitics are considered parts of the following words, then the mean length cannot take values less than 1. Thus the differential equation is
from which the well-known formula
follows. Here (C being the integration sonstant). If one considers frequency as depending on length, inversion yields
. (2) is a special case of of the unified derivation (see Introduction, 4.1). Other empirical formulas can be found in the references.
Example. Russian text
Strauss, Grzybek, Altmann (2005) present the result for the first chapter of Tolstoj´s Anna Karenina. For each frequency class the mean word-form length has been computed. Zero-syllabic prepositions (k, s, v) were considered proclitics. Frequency classes containing fewer than 10 records were pooled and the unweighted average was computed. The result is presented in Table 1 and Fig. 1. In the frist column, the frequency classes are shown, in the second the observed mean word form lengths, and in the third the theoretical mean lengths according to (2).
4. Author: U. Strauss, G. Altmann
