Word classes

Revision as of 13:30, 7 June 2006 by Ahans (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

1. Problem and history

Does the frequency of different word classes abide by a special distribution law? Evidently, word classes are nominal entities, thus they must be ranked (see Ranking →). Historically, word classes represent the diversification of an amorphous word stock which began to be partitioned by the development of grammar, thus this is a problem of diversification (→).

The first investigations have been performed by Hammerl (1990) who obtained the Zipf-Alekseev distribution, just as Schweers, Zhu (1991) did. Köhler (1991) studied the problem from the diversification point of view. A number of individual studies on texts in German (Best 1994, 1997, 2000, 2001a, b; Judt 1995), Russian (Bosselmann 2001), French (Judt 1995), Latin (Schweers, Zhu 1991) Chinese (Zhu, Best 1992, Schweers, Zhu 1991), Czech (Uhlířová 2000) and Portuguese (Ziegler 1998, 2001) has been performed and brought different results. The word classes were considered in their classical version, nobody tested other possible classifications. Statistical tests for word classes have been set up by Wimmer and Altmann (2001). The hypothesis concerning word classes is merely a special case of a more general hypothesis encompassing any kind of classes of linguistic entities.

2. Hypothesis

If language entities are ordered in classes, then their ranked frequencies follow a regular probability distribution or a regular ranking series.

Seen from the opposite point of view, if the ranked frequencies are properly distributed we have a preliminary approximate corroboration of the “correctness” of the classification, i.e. we approximate some linguistic-psychological truth.

3. Derivation

3.1. The Zipf-Alekseev distribution

The derivation is shown in Word associations (\rightarrow), Chapter 3.1. Different derivations are shown in Hammerl (1990) and Hřebíček (2000: 14f). The formula used in right truncated form and with modified class x = 1 is


where .

Example: Word classes in Portuguese (Ziegler 2001).



Table 1 Fitting of modified right truncated Zipf-Alekseev distribution

to Portuguese data

x fx NPx 123456789 966747453430282019 96.0065.5051.5641.8334.7729.4625.3522.0919.45 a = 0.2359, b = 0.1977, n = 9, α = 0.2487X² = 1.19, DF = 4, P = 0.8798

Figure 1. Fitting of modified right truncated Zipf-Alekseev distribution to Portuguese data


3.2. The negative hypergeometric distribution

The frequency of ranked word classes abides by the usual proportionality relation between  neighboring classes. Using the proportionality function


and solving with displacement

(1) ,

one obtains the 1-displaced negative hypergeometric distribution

(2) N

Example: Word classes in Latin (Schweers, Zhu 1991) Schweers and Zhu examined word classes in Latin, German and Chinese and found satisfactory fits for Latin and German. For Latin, they took Caesar´s “Bellum Gallicum”, Book 1, Chapters 1-8, § 2, and obtained the results in Table 2. The majority of researchers used this distribution.

Table 2 Fitting the modified negative hypergeometric distribution to Latin data (Schweers, Zhu 1991)

x fx NPx 1234567 347173142 98 93 57 39 342.46184.58134.17104.21 81.65 61.67 40.25 K = 2.001, M = 0.5771, n = 6, X23 = 3.58, P = 0.31

The fit is satisfactory. The right truncated modified Zipf-Alekseev distribution yields worse results. Fig. 1. Fitting the negative hypergeometric distribution to Latin data

3.3. Altmann´s series

Altmann (1993) did not strive for a distribution but simply for a series capturing the decreasing frequencies (absolute or relative) of ranked classes. Consider the Zipf-Mandelbrot law as a (not normalized) continuous function. Its differential equation is

(2) ,

i.e. a special case of the unified theory (→). Since ranking proceeds in unit steps, dx = 1 and dy = yx+1 – yx, we obtain

(3) .

Reordering, setting a-c = b, and solving (3), results in

(4)

The parameters fulfill one of the conditions (i) a > b 0, (ii) a > 0, -1 < b < 0, (iii) b < a < 0 when a and b are not integers.

Example: Word class distribution in a German text (Best 1997)

Altmann used (4) only for phoneme ranking, but Best (1994, 1997) used it also for word classes with very satisfactory results. The fitting of (4) to the relative frequencies of word classes in a German text (Bichsel, P., Der Mann, der nichts mehr wissen wollte) yielded results presented in Table 2 and Fig. 2.

Table 2 Fitting series (4) to word classes in a German text (Best 1997)

x yx 123456789 24.3220.3615.0012.6711.66 8.08 4.35 3.50 0.08 24.3218.6214.3911.21 8.81 6.97 5.56 4.46 3.60 a = 30.1447, b = 22.6085, D = 0.94

Here x are the ranked word classes, yx are their relative frequencies. The fit is very satisfactory.

Fig 2. Fitting Altmann´s series to German data

4. Authors: G. Altmann, K.-H. Best

5. References

Altmann, G. (1991). Word class diversification of Arabic verbal roots. In: Rothe, U. (ed.), Diversification Processes in Language: Grammar: 57-59. Hagen: Rottmann. Altmann, G. (1993). Phoneme counts. Glottometrika 14, 55-70. Best, K.-H. (1994). Word class frequencies in contemporary German short prose texts. J. of Quantitative Linguistics 1, 144-147. Best, K.-H. (1997). Zur Wortartenhäufigkeit in Texten deutscher Kurzprosa der Gegenwart. Glottometrika 16, 276-285. Best, K-H. (1998). Zur Interaktion der Wortarten in Texten. Papiere zur Linguistik 58: 83-95 Best, K.-H. (2000). Verteilung der Wortarten in Anzeigen. Göttinger Beiträge zur Sprachwissenschaft 4, 37-51 Best, K.-H. (2001a). Zur Gesetzmäßigkeit der Wortartenverteilungen in deutschen Pressetexten. Glottometrics 1, 1-26. Best, K.H. (2001b). Quantitative Linguistik. Eine Annäherung. Göttingen: Peust & Gutschmidt. Bosselmann, A. (2001). Wortartenverteilungen in russischen Texten. Msc. Hammerl, R. (1990). Untersuchungen zur Verteilung der Wortarten im Text. Glottometrika 11, 142-156. Hřebíček, L. (2000). Variation in sequences. Prague: Oriental Institute. Hudson, R. (1994). About 37 % of word-tokens are nouns. Language 70, 331-339. Judt, B. (1995). Wortartenhäufigkeiten im Deutschen und Französischen. Göttingen: Staats-examensarbeit. Köhler, R. (1991). Diversification of coding methods in grammar. In: Rothe, U. (ed.), Diversification Processes in Language: Grammar: 47-55. Hagen: Rottmann. Lauter, J. (1966). Untersuchungen zur Sprache von Kants “Kritik der reinen Vernunft”. Köln: Westdeutscher Verlag. Mizutani, S. (1989). Ohno's lexical law: its data adjustment by linear regression. In: Mizutani, S. (ed.), Japanese Quantitative Linguistics. Bochum: Brockmeyer. 1-13 Schweers, A., Zhu, J. (1991). Wortartenklassifikation im Lateinischen, Deutschen und Chinesischen. In: Rothe, U. (ed.), Diversification Processes in Language: Grammar: 157-167. Hagen: Rottmann. Uhlířová, L. (2000). On language modelling in automatic speech recognition. J. of Quantitative Linguistics 7, 209-216. Wimmer, G., Altmann, G. (2001). Some statistical investigations concerning word classes. Glottometrics 1, 109-123 Zhu, J, Best, K.-H. (1992). Zum Wort im modernen Chinesisch. Oriens Extremus 35, 45-60. Ziegler, A. (1998). Word class frequencies in Brazilian-Portuguese texts. J. of Quantitative Linguistics 5, 269-280. Ziegler, A. (2001). Word class frequencies in Portuguese press texts. In: Uhlířová, L., Wimmer, G., Altmann, G., Köhler, R. (eds.), Text as a Linguistic Paradigm: Levels, Constituents, Constructs. Festschrift in Honour of Ludek Hřebíček: 295-312. Trier: WVT. Ziegler, A., Best, K.-H., Altmann, G. (2002). Nominalstil. ETC 2, 72-85. Ziegler, A., Best, K.-H., Altmann, G. (2001). A contribution to text spectra. Glottometrics 1, 97-108.