Difference between revisions of "Morph length"

Latest revision as of 11:05, 26 January 2012

1. Problem and history

Morph length is a specal case of length or word length ( $\rightarrow$ ) research. The segmentation of a text in morphs is a problem depending on language and grammar (cf. Best 2001 for German). Morph lengths are focussed in Quantitative Linguistics in different ways: Gerlach (1982) tested the interaction of word length with morph length in a German lexicon; the hypothesis was: the more morphs a word has, the shorter the morphs are. The test results were very good. Without any data Saporta (1963: 70) proposes the hypothesis: “The mean length of morphs will be inversely related to the number of phonemes in the inventory.” This idea is said to go back to R. Jakobson (Saporta 1963: 72, footnote 15). Here we are confronted with another hypothesis: morph lengths in texts abide by a law in the same way as word lengths and other entities do. Since up to now only a small number of data has been collected (Best 2000a, 2001a) the only distribution found rather inductively but belonging to the family of length-distributions (cf. Wimmer, Altmann 1996) was the 1-displaced Hyperpoisson distribution. In a paper concerning morph segmentation Creutz (2003: 282) proposes the gamma distribution to be a model for morph lengths in the lexicon; but there is no proof of it. As soon as 1963 Saporta (²1966: 69) presents a little overview over morph lengths (in phonemes) in Spanish; testing the data (1679 morphs) the binomial distribution can be shown to be an acceptable model (C = 0.0166). He “cannot help wondering whether or not such a distribution is universal and, if not, what other factors correlate with different distributions.” However, in Lakota the morphs have a specific form and their frequency distribution must be modelled by means of a difference equation of second order.

2. Hypothesis

2.1. Morph length in texts abides by a regular probability distribution derived form the unified theory, namely the 1-displaced Hyperpoisson distribution.

2.2. If morphs have specific forms, the distribution is multimodal and must be modeled by an appropriate approach.

3. Derivation

3.1. Substituting $a_0 = -1, a_1 = a, b_1 = b, a_2 = 0$ in formula (10) of unified theory (→) and solving with displacement one obtains the Hyperpoisson distribution

(1) $P_x = \frac{a^{x-1}}{b^{x-1} _1 F_1 (1; b; a)}, \quad x = 1, 2, 3, ...; a, b > 0$

Example: Morph length distribution in German

Best (2001a) used a text from Eichsfelder Tageblatt (6.3.1997, p.8): „Sieben Deutsche in Jemen entführt)“ and counting the length of morphemes he obtained the results in Table 1.

Fig. 1. Fitting the 1-displaced Hyperpoisson distribution to morph length data

3.2. In Lakota, morphemes usually consist of even number of syllables. In that case the usual approach must be extended and the distribution must be modeled by a difference equation of second order i.e.

(2) $P_x = g(x)P_{x-1}+h(x)P_{x-2}\quad$ .

Pustet and Altmann (2005) set g(x) = a(k + x – 1)/x and h(x) = b(2k + x – 2)/x and obtained as solution the Gegenbauer distribution given as

(3) $P_x = (n)=\begin{cases} (1-a-b)^k,\quad x=0 \\ p_0 \sum_{j=0}^{[x/2]}\frac{b^j k^{(x-j)}a^{x-2j}}{j!(x-2j)!}, \quad x = 1, 2, 3,...\end{cases}$

where [.] is the integer part and k^{(x - j)} is the ascending factorial function.

Example. Morpheme length in Lakota

Pustet and Altmann (2005) modeled the morpheme length distribution in Lakota and obtained the results in Table 2 and Fig. 2.

Fig. 2. Fitting the Gegenbauer distribution to the Lakota data

4. Authors: U. Strauss, G. Altmann, K.-H. Best

5. References

Best, K.-H. (2000a). Morphlängen in Fabeln von Pestalozzi. Göttinger Beiträge zur Sprachwissenschaft 3, 19-30.

Best, K.-H. (2001a). Zur Länge von Morphen in deutschen Texten. In: Best, K.-H. (ed.), Häufigkeitsverteilungen in Texten: 1-14. Göttingen: Peust & Gutschmidt.

Best, K.-H. (2001b). Probability distributions of language entities. J. of Quantitative Linguistics 8, 1-11.

Best, K.-H. (2005). Morphlänge. In: Köhler, R., Altmann, G., Piotrowski, R. (eds.), Quantitative Linguistik - Quantitative Linguistics. Ein internationales Handbuch: 255-260. Berlin/ N.Y.: de Gruyter

Creutz, M. (2003). Unsupervised segmentation of words using prior distributions of morph length and frequency. Proceedings of ACL-03, The 41st Annual Meeting of th Association of Computational Linguistics: 280-287. Sapporo, Japan, 7-12 July.

Gerlach, R. (1982). Zur Überprüfung des Menzerathschen Gesetzes im Bereich der Morphologie. Glottometrika 4, 95-102.

Gorot´, E.I. (1990). Izomorfnye i otličitel´nye čerty morfemy i sloga v raspredelenii dliny. In: Kvantitativnaja lingvistika i avtomatičeskij analiz tekstov: 32-36. Tartu.

Krott, A. (1996). Some remarks on the relation between word length and morpheme length. J. of Quantitative Linguistics 3, 29-37.

Nikonov, V.A. (1978). Dlina slova. Voprosy jazykoznanija 6, 104-111.

Pustet, R., Altmann, G. (2005). Morpheme length distribution in Lakota. J. of Quantitative Linguistics 12(1), 53-63.

Saporta, S. (1963, ²1966). Phoneme distribution and language universals. In: Greenberg, J.H. (ed.), Universals of language. Second edition. Report of a conference held at Dobbs Ferry, New York, April 13-115, 1961: 61-72. Cambridge, Mass. & London: The M.I.T. Press.

Wimmer, G., Altmann, G. (1996). The theory of word length: some results and generalizations. Glottometrika 15, 112-133.

@@ Line 3: / Line 3: @@
 Morph length is a specal case of length or word length (<math>\rightarrow</math>) research. The segmentation of a text in morphs is a problem depending on language and grammar (cf. Best 2001 for German).
 Morph lengths are focussed in Quantitative Linguistics in different ways:
-Gerlach (1982) tested the interaction of word length with morph length in a German lexicon; the hypothesis was: the more morphes a word has, the shorter the morphes are. The test results were very good. Without any data Saporta (1963: 70) proposes the hypothesis: “The mean length of morphs will be inversely related to the number of phonemes in the inventory.” This idea is said to go back to R. Jakobson (Saporta 1963: 72, footnote 15).
+Gerlach (1982) tested the interaction of word length with morph length in a German lexicon; the hypothesis was: the more morphs a word has, the shorter the morphs are. The test results were very good. Without any data Saporta (1963: 70) proposes the hypothesis: “The mean length of morphs will be inversely related to the number of phonemes in the inventory.” This idea is said to go back to R. Jakobson (Saporta 1963: 72, footnote 15).
-Here we are confronted with another hypothesis: morph lengths in texts abide by a law the same way as word lengths and other entities do.
+Here we are confronted with another hypothesis: morph lengths in texts abide by a law in the same way as word lengths and other entities do.
 Since up to now only a small number of data has been collected (Best 2000a, 2001a) the only distribution found rather inductively but belonging to the family of length-distributions (cf. Wimmer, Altmann 1996) was the 1-displaced Hyperpoisson distribution.
 In a paper concerning morph segmentation Creutz (2003: 282) proposes the gamma distribution to be a model for morph lengths in the lexicon; but there is no proof of it. As soon as 1963 Saporta (²1966: 69) presents a little overview over morph lengths (in phonemes) in Spanish; testing the data (1679 morphs) the binomial distribution can be shown to be an acceptable model (C = 0.0166). He “cannot help wondering whether or not such a distribution is universal and, if not, what other factors correlate with different distributions.”
@@ Line 11: / Line 11: @@
 '''2. Hypothesis'''
-''2.1. Morph length in texts abides by a regular probability distribution derived form the unified theory, namely the 1-displaced Hyperpoisson distribution.
+''2.1. Morph length in texts abides by a regular probability distribution derived form the unified theory, namely the 1-displaced Hyperpoisson distribution.''
-.2. If morphs have specific forms, the distribution is multimodal and must be modeled by an appropriate approach''.
+''2.2. If morphs have specific forms, the distribution is multimodal and must be modeled by an appropriate approach''.
 '''3. Derivation'''
@@ Line 24: / Line 25: @@
 Best (2001a) used a text from Eichsfelder Tageblatt (6.3.1997, p.8): „Sieben Deutsche in Jemen entführt)“ and counting the length of morphemes he obtained the results in Table 1.
-<div align="center">[[Image:Tabelle1_ML.jpg]]</div>
+<div align="center">[[Image:Tabelle111_ML.jpg]]</div>
 <div align="center">[[Image:Grafik1_ML.jpg]]</div>
@@ Line 31: / Line 33: @@
-.2. In Lakota morphemes usually consist of even number of syllables. In that case the usual approach must be extended and the distribution must be modeled by a difference equation of second order i.e.
+.2. In Lakota, morphemes usually consist of even number of syllables. In that case the usual approach must be extended and the distribution must be modeled by a difference equation of second order i.e.
-(2)<math> P_x = g(x)P_{x-1}+h(x)P_{x-2}\quad</math>.
+(2)<math>  P_x = g(x)P_{x-1}+h(x)P_{x-2}\quad</math>.
 Pustet and Altmann (2005) set g(x) = a(k + x – 1)/x and h(x) = b(2k + x – 2)/x and obtained as solution the Gegenbauer distribution given as
@@ Line 39: / Line 41: @@
 (3)<math> P_x = (n)=\begin{cases} (1-a-b)^k,\quad x=0  \\ p_0 \sum_{j=0}^{[x/2]}\frac{b^j k^{(x-j)}a^{x-2j}}{j!(x-2j)!}, \quad x = 1, 2, 3,...\end{cases}	 </math>
-where [.] is the integer part and k(x - j) is the ascending factorial function.
+where [.] is the integer part and k<sup>(x - j)</sup> is the ascending factorial function.
 '''Example'''. Morpheme length in Lakota
-Pustet and Altmann (2005) modeled the morpheme distribution in Lakota and obtained the results in Table 2 and Fig. 2.
+Pustet and Altmann (2005) modeled the morpheme length distribution in Lakota and obtained the results in Table 2 and Fig. 2.
-<div align="center">[[Image:Tabelle2_ML.jpg]]</div>
+<div align="center">[[Image:Tabelle222_ML.jpg]]</div>
 <div align="center">[[Image:Grafik2_ML.jpg]]</div>
@@ Line 51: / Line 53: @@
-'''4. Authors: G. Altmann, K.-H. Best'''
+'''4. Authors: U. Strauss, G. Altmann, K.-H. Best'''
 '''5. References'''
@@ Line 58: / Line 60: @@
 '''Best, K.-H'''. (2001a). Zur Länge von Morphen in deutschen Texten. In: Best, K.-H. (ed.), ''Häufigkeitsverteilungen in Texten: 1-14''. Göttingen: Peust & Gutschmidt.
+'''Best, K.-H.''' (2001b). Probability distributions of language entities. ''J. of Quantitative Linguistics 8, 1-11.''
 '''Best, K.-H'''. (2005). Morphlänge. In: Köhler,  R., Altmann, G., Piotrowski, R. (eds.), ''Quantitative Linguistik - Quantitative Linguistics. Ein internationales Handbuch: 255-260''. Berlin/ N.Y.: de Gruyter
@@ Line 65: / Line 69: @@
 '''Gerlach, R.''' (1982). Zur Überprüfung des Menzerathschen Gesetzes im Bereich der Morphologie. ''Glottometrika 4, 95-102''.
-'''Gorot´, E.I.''' (1990). Izomorfnye i otličitel´nye čerty morfemy i sloga v raspredelenii dliny. In: Kvantitativnaja ''lingvistika i avtomatičeskij analiz tekstov: 32-36''. Tartu.
+'''Gorot´, E.I.''' (1990). Izomorfnye i otličitel´nye čerty morfemy i sloga v raspredelenii dliny. In: ''Kvantitativnaja lingvistika i avtomatičeskij analiz tekstov: 32-36''. Tartu.
 '''Krott, A'''. (1996). Some remarks on the relation between word length and morpheme length. ''J. of Quantitative Linguistics 3, 29-37''.
@@ Line 77: / Line 81: @@
 '''Wimmer, G., Altmann, G'''. (1996). The theory of word length: some results and generalizations. ''Glottometrika 15, 112-133''.
-Altmann, Gabriel (1988), Wiederholungen in Tex-
-ten. Bochum: Brockmeyer.
-Altmann, Gabriel/ Schwibbe, Michael H. (1989),
-Das Menzerathsche Gesetz in informationsverar-
-beitenden Systemen. Hildesheim: Olms.
-Altmann-Fitter (1994), Lüdenscheid: RAM-Verlag.
-Altmann-Fitter. Iterative Fitting of Probability
-Distributions (1997). Lüdenscheid: RAM-Verlag.
-Best, Karl-Heinz (1999), Quantitative Linguistik:
-Entwicklung, Stand und Perspektive. In: Göttinger
-Beiträge zur Sprachwissenschaft 2,7K23.
-Best, Karl-Heinz (2000), Morphlängen in Fabeln
-von Pestalozzi. In: Göttinger Beiträge zur Sprach-
-wissenschaft 3,19K30.
-Best, Karl-Heinz (2001a), Probability Distributi-
-ons of Language Entities. In: Journal of Quantita-
-tive Linguistics 8,1K11.
-Best, Karl-Heinz (2001b), Kommentierte Biblio-
-graphie zum Göttinger Projekt. In: Best 2001,
-K310.
-Best, Karl-Heinz (2001c), Zur Länge von Mor-
-phen in deutschen Texten. In: Best 2001, 1K14.
-Best, Karl-Heinz (2001d), Silbenlängen in Mel-dungen
-der Tagespresse. In: Best 2001, 15K32.
-Best, Karl-Heinz (2001e), Zur Verteilung rhyth-mischer
-Einheiten in deutscher Prosa. In: Best
-, 162K166.
-Best, Karl-Heinz (Hrsg.), (2001), Häufigkeitsver-teilungen
-in Texten. Göttingen: Peust & Gut-schmidt.
-Best, Karl-Heinz/Altmann, Gabriel (1996), Pro-ject
-Report. In: Journal of Quantitative Linguistics
-, 85K88.
-Bunge, Mario (1977), Treatise on Basic Philoso-phy,
-Vol. 3: Ontology I: The Furniture of the
-World. Dordrecht: Reidel.
-Fucks, Wilhelm (1956), Die mathematischen Ge-setze
-der Bildung von Sprachelementen aus ihren
-Bestandteilen. In: Nachrichtentechnische Fachbe-richte
-,7K21.
-Garbe, Burckhard (1980), Das sogenannte „ety-mologische“
-Prinzip der deutschen Schreibung.
-In: Zeitschrift für Germanistische Linguistik 8,
-K210.
-Gerlach, Rainer (1982), Zur Überprüfung des
-Menzerath’schen Gesetzes im Bereich der Mor-phologie.
-In: Glottometrika 4. (Eds. Lehfeldt, Wer-ner/
-Strauss, Udo). Bochum: Brockmeyer, 95K102.
-Gorot’, E. I. (1990), Izomorfnye i otli itel’nye
-è erty morfemy i sloga v raspredelenii dliny. In:
-Kvantitativnaja lingvistika i avtomati eskij analiz
-tekstov (UZTU 912), 32K36.
-Greenberg, Joseph H. (1960), A Quantitative Ap-proach
-to the Morphological Typology of Lan-guage.
-In: International Journal of American Lin-guistics
-, 178K194.
-Grotjahn, Rüdiger/Altmann, Gabriel (1993), Mo-delling
-the Distribution of Word Length: Some
-Methodological Problems. In: Contributions to
-Quantitative Linguistics. (Eds. Reinhard Köhler/
-Burghard B. Rieger). Dordrecht: Kluwer, 141K153.
-H ebí ek, Lud¼ñêôœ`k (1997), Lectures on Text Theory.
-Prague: Academy of Sciences of the Czech Re-public,
-Oriental Institute.
-Kempgen, Sebastian (1995a), Kodierung natürli-cher
-Sprache auf morphologischer Ebene. In:
-WSLAV XL,52K57.
-Kempgen, Sebastian (1995b), Russische Sprachsta-tistik.
-München: Sagner.
-Köhler, Reinhard (1986), Zur linguistischen Syner-getik:
-Struktur und Dynamik der Lexik. Bochum:
-Brockmeyer.
-Krott, Andrea (1994), Ein funktionalanalytisches
-Modell der Wortbildung. Magisterarbeit, Trier.
-Krott, Andrea (1996), Some Remarks on the Re-lation
-between Word Length and Morpheme
-Length. In: Journal of Quantitative Linguistics 3,
-K37.
-Niehaus, Brigitta (1997), Untersuchung zur Satz-längenhäufigkeit
-im Deutschen. In: Best, Karl-Heinz
-(Hrsg.), Glottometrika 16. Trier: Wissen-schaftlicher
-Verlag Trier, 213K275.
-Ord, J. K. (1972), Families of frequency distributi-ons.
-London: Griffin.
-Wahrig, Gerhard (Hrsg.). (1978), dtv-Wörterbuch
-der deutschen Sprache. München: Deutscher Ta-schenbuch
-Verlag.
-Wimmer, Gejza/Altmann, Gabriel (1996), The
-Theory of Word Length Distribution: Some Re-sults
-and Generalizations. In: Glottometrika 15.
-(Hrsg. Peter Schmidt). Trier: Wissenschaftlicher
-Verlag Trier, 112K133.
-Wimmer, Gejza/Köhler, Reinhard/Grotjahn, Rü-diger/
-Altmann, Gabriel (1994), Towards a Theory
-of Word Length Distribution. In: Journal of Quan-titative
-Linguistics 1,98K106.
-Karl-Heinz
 [[Category:Unfertig]]

Anonymous

Search

Navigation

Navigation

Wiki tools

Wiki tools

Difference between revisions of "Morph length"

Namespaces

Page actions

Latest revision as of 11:05, 26 January 2012

Anonymous

Search

Navigation

Wiki tools

Page tools

Categories

Difference between revisions of "Morph length"

Latest revision as of 11:05, 26 January 2012