Morph length

Revision as of 09:07, 17 June 2006 by Ahans (talk | contribs)

1. Problem and history

Morph length is a specal case of length or word length (\rightarrow) research. The segmentation of a text in morphs is a problem depending on language and grammar (cf. Best 2001 for German). Morph lengths are focussed in Quantitative Linguistics in different ways: Gerlach (1982) tested the interaction of word length with morph length in a German lexicon; the hypothesis was: the more morphes a word has, the shorter the morphes are. The test results were very good. Without any data Saporta (1963: 70) proposes the hypothesis: “The mean length of morphs will be inversely related to the number of phonemes in the inventory.” This idea is said to go back to R. Jakobson (Saporta 1963: 72, footnote 15). Here we are confronted with another hypothesis: morph lengths in texts abide by a law the same way as word lengths and other entities do. Since up to now only a small number of data has been collected (Best 2000a, 2001a) the only distribution found rather inductively but belonging to the family of length-distributions (cf. Wimmer, Altmann 1996) was the 1-displaced Hyperpoisson distribution. In a paper concerning morph segmentation Creutz (2003: 282) proposes the gamma distribution to be a model for morph lengths in the lexicon; but there is no proof of it. As soon as 1963 Saporta (²1966: 69) presents a little overview over morph lengths (in phonemes) in Spanish; testing the data (1679 morphs) the binomial distribution can be shown to be an acceptable model (C = 0.0166). He “cannot help wondering whether or not such a distribution is universal and, if not, what other factors correlate with different distributions.” However, in Lakota the morphs have a specific form and their frequency distribution must be modelled by means of a difference equation of second order.

2. Hypothesis

2.1. Morph length in texts abides by a regular probability distribution derived form the unified theory, namely the 1-displaced Hyperpoisson distribution. 2.2. If morphs have specific forms, the distribution is multimodal and must be modeled by an appropriate approach.

3. Derivation

3.1. Substituting a_0 = -1, a_1 = a, b_1 = b, a_2 = 0 in formula (10) of unified theory (→) and solving with displacement one obtains the Hyperpoisson distribution

(1) P_x = \frac{a^{x-1}}{b^{x-1} _1 F_1 (1; b; a)}, \quad x = 1, 2, 3, ...; a, b > 0

Example: Morph length distribution in German

Best (2001a) used a text from Eichsfelder Tageblatt (6.3.1997, p.8): „Sieben Deutsche in Jemen entführt)“ and counting the length of morphemes he obtained the results in Table 1.

Tabelle1 ML.jpg
Grafik1 ML.jpg
Fig. 1. Fitting the 1-displaced Hyperpoisson distribution to morph length data


3.2. In Lakota morphemes usually consist of even number of syllables. In that case the usual approach must be extended and the distribution must be modeled by a difference equation of second order i.e.

(2) P_x = g(x)P_{x-1}+h(x)P_{x-2}\quad.

Pustet and Altmann (2005) set g(x) = a(k + x – 1)/x and h(x) = b(2k + x – 2)/x and obtained as solution the Gegenbauer distribution given as

(3) P_x = (n)=\begin{cases} (1-a-b)^k,\quad x=0  \\ p_0 \sum_{j=0}^{[x/2]}\frac{b^j k^{(x-j)}a^{x-2j}}{j!(x-2j)!}, \quad x = 1, 2, 3,...\end{cases}

where [.] is the integer part and k(x - j) is the ascending factorial function.

Example. Morpheme length in Lakota

Pustet and Altmann (2005) modeled the morpheme distribution in Lakota and obtained the results in Table 2 and Fig. 2.

Tabelle2 ML.jpg
Grafik2 ML.jpg
Fig. 2. Fitting the Gegenbauer distribution to the Lakota data


4. Authors: G. Altmann, K.-H. Best

5. References

Best, K.-H. (2000a). Morphlängen in Fabeln von Pestalozzi. Göttinger Beiträge zur Sprachwissenschaft 3, 19-30.

Best, K.-H. (2001a). Zur Länge von Morphen in deutschen Texten. In: Best, K.-H. (ed.), Häufigkeitsverteilungen in Texten: 1-14. Göttingen: Peust & Gutschmidt.

Best, K.-H. (2005). Morphlänge. In: Köhler, R., Altmann, G., Piotrowski, R. (eds.), Quantitative Linguistik - Quantitative Linguistics. Ein internationales Handbuch: 255-260. Berlin/ N.Y.: de Gruyter

Creutz, M. (2003). Unsupervised segmentation of words using prior distributions of morph length and frequency. Proceedings of ACL-03, The 41st Annual Meeting of th Association of Computational Linguistics: 280-287. Sapporo, Japan, 7-12 July.

Gerlach, R. (1982). Zur Überprüfung des Menzerathschen Gesetzes im Bereich der Morphologie. Glottometrika 4, 95-102.

Gorot´, E.I. (1990). Izomorfnye i otličitel´nye čerty morfemy i sloga v raspredelenii dliny. In: Kvantitativnaja lingvistika i avtomatičeskij analiz tekstov: 32-36. Tartu.

Krott, A. (1996). Some remarks on the relation between word length and morpheme length. J. of Quantitative Linguistics 3, 29-37.

Nikonov, V.A. (1978). Dlina slova. Voprosy jazykoznanija 6, 104-111.

Pustet, R., Altmann, G. (2005). Morpheme length distribution in Lakota. J. of Quantitative Linguistics 12(1), 53-63.

Saporta, S. (1963, ²1966). Phoneme distribution and language universals. In: Greenberg, J.H. (ed.), Universals of language. Second edition. Report of a conference held at Dobbs Ferry, New York, April 13-115, 1961: 61-72. Cambridge, Mass. & London: The M.I.T. Press.

Wimmer, G., Altmann, G. (1996). The theory of word length: some results and generalizations. Glottometrika 15, 112-133.

Altmann, Gabriel (1988), Wiederholungen in Tex- ten. Bochum: Brockmeyer. Altmann, Gabriel/ Schwibbe, Michael H. (1989), Das Menzerathsche Gesetz in informationsverar- beitenden Systemen. Hildesheim: Olms. Altmann-Fitter (1994), Lüdenscheid: RAM-Verlag. Altmann-Fitter. Iterative Fitting of Probability Distributions (1997). Lüdenscheid: RAM-Verlag. Best, Karl-Heinz (1999), Quantitative Linguistik: Entwicklung, Stand und Perspektive. In: Göttinger Beiträge zur Sprachwissenschaft 2,7K23. Best, Karl-Heinz (2000), Morphlängen in Fabeln von Pestalozzi. In: Göttinger Beiträge zur Sprach- wissenschaft 3,19K30. Best, Karl-Heinz (2001a), Probability Distributi- ons of Language Entities. In: Journal of Quantita- tive Linguistics 8,1K11. Best, Karl-Heinz (2001b), Kommentierte Biblio- graphie zum Göttinger Projekt. In: Best 2001, 284K310. Best, Karl-Heinz (2001c), Zur Länge von Mor- phen in deutschen Texten. In: Best 2001, 1K14. Best, Karl-Heinz (2001d), Silbenlängen in Mel-dungen der Tagespresse. In: Best 2001, 15K32. Best, Karl-Heinz (2001e), Zur Verteilung rhyth-mischer Einheiten in deutscher Prosa. In: Best 2001, 162K166. Best, Karl-Heinz (Hrsg.), (2001), Häufigkeitsver-teilungen in Texten. Göttingen: Peust & Gut-schmidt. Best, Karl-Heinz/Altmann, Gabriel (1996), Pro-ject Report. In: Journal of Quantitative Linguistics 3, 85K88. Bunge, Mario (1977), Treatise on Basic Philoso-phy, Vol. 3: Ontology I: The Furniture of the World. Dordrecht: Reidel. Fucks, Wilhelm (1956), Die mathematischen Ge-setze der Bildung von Sprachelementen aus ihren Bestandteilen. In: Nachrichtentechnische Fachbe-richte 3,7K21. Garbe, Burckhard (1980), Das sogenannte „ety-mologische“ Prinzip der deutschen Schreibung. In: Zeitschrift für Germanistische Linguistik 8, 197K210. Gerlach, Rainer (1982), Zur Überprüfung des Menzerath’schen Gesetzes im Bereich der Mor-phologie. In: Glottometrika 4. (Eds. Lehfeldt, Wer-ner/ Strauss, Udo). Bochum: Brockmeyer, 95K102. Gorot’, E. I. (1990), Izomorfnye i otli itel’nye è erty morfemy i sloga v raspredelenii dliny. In: Kvantitativnaja lingvistika i avtomati eskij analiz tekstov (UZTU 912), 32K36. Greenberg, Joseph H. (1960), A Quantitative Ap-proach to the Morphological Typology of Lan-guage. In: International Journal of American Lin-guistics 26, 178K194. Grotjahn, Rüdiger/Altmann, Gabriel (1993), Mo-delling the Distribution of Word Length: Some Methodological Problems. In: Contributions to Quantitative Linguistics. (Eds. Reinhard Köhler/ Burghard B. Rieger). Dordrecht: Kluwer, 141K153. H ebí ek, Lud¼ñêôœ`k (1997), Lectures on Text Theory. Prague: Academy of Sciences of the Czech Re-public, Oriental Institute. Kempgen, Sebastian (1995a), Kodierung natürli-cher Sprache auf morphologischer Ebene. In: WSLAV XL,52K57. Kempgen, Sebastian (1995b), Russische Sprachsta-tistik. München: Sagner. Köhler, Reinhard (1986), Zur linguistischen Syner-getik: Struktur und Dynamik der Lexik. Bochum: Brockmeyer. Krott, Andrea (1994), Ein funktionalanalytisches Modell der Wortbildung. Magisterarbeit, Trier. Krott, Andrea (1996), Some Remarks on the Re-lation between Word Length and Morpheme Length. In: Journal of Quantitative Linguistics 3, 29K37. Niehaus, Brigitta (1997), Untersuchung zur Satz-längenhäufigkeit im Deutschen. In: Best, Karl-Heinz (Hrsg.), Glottometrika 16. Trier: Wissen-schaftlicher Verlag Trier, 213K275. Ord, J. K. (1972), Families of frequency distributi-ons. London: Griffin. Wahrig, Gerhard (Hrsg.). (1978), dtv-Wörterbuch der deutschen Sprache. München: Deutscher Ta-schenbuch Verlag. Wimmer, Gejza/Altmann, Gabriel (1996), The Theory of Word Length Distribution: Some Re-sults and Generalizations. In: Glottometrika 15. (Hrsg. Peter Schmidt). Trier: Wissenschaftlicher Verlag Trier, 112K133. Wimmer, Gejza/Köhler, Reinhard/Grotjahn, Rü-diger/ Altmann, Gabriel (1994), Towards a Theory of Word Length Distribution. In: Journal of Quan-titative Linguistics 1,98K106. Karl-Heinz