Difference between revisions of "Frequency and polytextuality"

Latest revision as of 13:13, 26 June 2007

1. Problem and history

Polytextuality measures the degree of independence of the usability of a word (in general of a linguistic unit) of its co-text or context. Linguistic units such as phonemes, syllables, morae, morphemes, words etc. differ in their usability with respect to different environments. The environment of a syllable, mora or morphem consists of the words in which they occur, the environment of a word consists of phrases, sentences, or texts. The number of different environments is often called the number of types. The frequency of a given entity in all its environments in, say, a corpus, is considered as the number of tokens. It can be shown that there exists a lawful relationship between the number of types (environments) and the number of tokens (frequency) of units on the given level. The degree of independence of the unsability of a unit from its context (or, the variability of contextes with respect to a given unit), can be measured in several ways. Word polytextuality is often measured in terms of the number of different texts in a text corpus which contain at least one token of the given word. Polytextuality of morphemes or syllables are usually measured with reference to an inventory such as a dictionary.

The relationship between frequency and polytextuality has been postulated and investigated by R. Köhler (1986) as a complement to other quantitative properties, in order to integrate it into his synergetic control cycle. As a consequence of an erroneous identification with another “type-token” problem ( $\rightarrow$ ), one can find this relationship also under the name “(morphological) productivity” (cf. Baayen 2001), which in turn represents a slightly different aspect (cf. Wimmer, Altmann 1995). The relationship was studied in different works on language synergetics (cf. e.g. Gieseking 2002), Köhler (2005) reformulated the pertinent part of his control cycle and Tamaoka, Altmann (2005) showed by means of Japanese morae that the unified theory ( $\rightarrow$ ) leads to an identical result.

Usually one considers frequency as the spiritus movens, the independent variable of many relationships, but Köhler (1986) assumed here an inverse relationship.

2. Hypothesis

The frequency of linguistic units depends on their polytextuality.

3. Derivation

Since in most cases linguistic properties are related by their relative rates of change, Tamaoka (2007), taking into account some ceteris paribus factors, and leaning against the unified theory ( $\rightarrow$ ) set up the equation

(1) $\frac{dy}{y}= \left( c+\frac{b}{x}\right)dx$

where x is polytexty, y is frequency and c represents some additional factors. The resulting solution,

(2) $y = ax^b e^{cx}\quad$ ,

was used to model polytexty and frequency of Japanese morae in a Japanese corpus.

Using Köhlers model (Fig. 1) one can write the relationships as follows:

(3) ln(F) = R ln(Appl) + B ln(PT) – C exp(ln(PT))

i.e.

ln(F) = R ln(Appl) + B ln(PT) – C (PT)

from which it follows that

$F = Appl^R PT^B e^{-c({PT})}\quad$ .

Since in the framework of a synchronic study $Appl^R$ can be considered as a constant, say A, and since we can set PT = x and F = y, we obtain

$y = A x^b e^{-cx}\quad$ ,

whih is identical with the above solution of the differential equation.

Fig. 1. The relationship between polytextuality and frequency in general

Thus Köhler´s model explains also the additional factors.

Example 1. Types and tokens of Japanese morae

Tamaoka and Makioka (2004) computed the frequencies of 103 Japanese morae in a corpus containing 341,771 different words with total frequency 287,792,797. For each mora its frequency and the contexts (different words) were ascertained. Tamaoka and Altmann (2005) showed that the best fit to these data (in logarithmic transformation) can be obtained by the curve

$y = 26.57366832x^{1.31502554}exp(-0.0000125937521x)$ .

yielding a determination coeffciient D = 0.92. The result of fitting is displayd in Table 1 and graphically presented in Fig. 2.

Fig. 2. Relation between types and tokens of Japanese morae

Example 2. Polytextuality of German words as a function of their frequencies

Köhler (1986) published his result on data of a German text corpus (LIMAS), respresented here by the following graph:

4. Authors: G. Altmann, R. Köhler

5. References

Baayen, R.H. (2001). Word frequency distributions. Dordrecht: Kluwer.

Gieseking, K. (2002). Untersuchungen zur Synergetik der englischen Lexik. In: Köhler, R. (ed.), Korpuslinguistische Untersuchungen in die quantitative und systemtheoretische Linguistik: 387-433. http://ubt.opus.hbz-nrw.de/volltexte/2004/279/

Köhler, R. (2006). Frequenz, Kontextualität und Länge von Wörtern. Eine Erweiterung des synergetisch-linguistischen Modells. In: Rapp, R., Sedlmeier, P., Zunker-Rapp, G. (eds.), Perspectives on Cognition. Lengerich, Berlin, Bremen, Miami et al: Pabst Science Publishers, 327-338.

Tamaoka, K. (2007). On the relation between types and tokens of Japanese morae. In: Script problems (in print).

Tamaoka, K., Makioka, Sh. (2004). Frequency of occurrence for units of phonemes, morae, and syllables appearing in a lexical corpus of a Japanese newspaper. Behavior Research Methods, Instruments & Computers 36(3), 531-547.

Wimmer, G., Altmann, G. (1995). A model of morphological productivity. J. of Quantitative Linguistics 2, 212-216.

@@ Line 1: / Line 1: @@
 '''1. Problem and history'''
-Under polytextuality one understands the number of environments of a linguistic entity. The entity can be syllable, mora, morphem, word and other units. The environment for syllable, mora and morphem is the word, the environment of the word are other words. Usually the number of different environments is called number of types. The frequency of the given entity in all its environments in, say, a corpus, is considered as the number of tokens. The question is, whether there is some relationship between the number of types (environments) and the number of tokens (frequency) of  units of the given level.
+Polytextuality measures the degree of independence of the usability of a word (in general of a linguistic unit) of its co-text or context. Linguistic units such as phonemes, syllables, morae, morphemes, words etc. differ in their usability with respect to different environments. The environment of a syllable, mora or morphem consists of the words in which they occur, the environment of a word consists of phrases, sentences, or texts. The number of different environments is often called the number of types. The frequency of a given entity in all its environments in, say, a corpus, is considered as the number of tokens. It can be shown that there exists a lawful relationship between the number of types (environments) and the number of tokens (frequency) of units on the given level.
+The degree of independence of the unsability of a unit from its context (or, the variability of contextes with respect to a given unit), can be measured in several ways. Word polytextuality is often measured in terms of the number of different texts in a text corpus which contain at least one token of the given word. Polytextuality of morphemes or syllables are usually measured with reference to an inventory such as a dictionary.
-The relationship between frequency and polytextuality has been launched by R. Köhler (1986) as a complement to Zipfian properties, in order to enlarge his control cycle. Since the computation of data is very laborious and the erroneous identification with another “type-token” problem (<math>\rightarrow</math>) lead to confusion, one can find this relationship also under the name “(morphological) productivity” (cf. Baayen 2001) which in turn represents a slightly different problem (cf. Wimmer, Altmann 1995). The relationship appeared in different works on language synergetics (cf. e.g. Gieseking 2002), Köhler (2005) reformulated the pertinent part of his control cycle and Tamaoka, Altmann (2005) showed by means of Japanese morae that the unified theory (<math>\rightarrow</math>) leads to the identical result.
+The relationship between frequency and polytextuality has been postulated and investigated by R. Köhler (1986) as a complement to other quantitative properties, in order to integrate it into his synergetic control cycle. As a consequence of an erroneous identification with another “type-token” problem (<math>\rightarrow</math>), one can find this relationship also under the name “(morphological) productivity” (cf. Baayen 2001), which in turn represents a slightly different aspect (cf. Wimmer, Altmann 1995). The relationship was studied in different works on language synergetics (cf. e.g. Gieseking 2002), Köhler (2005) reformulated the pertinent part of his control cycle and Tamaoka, Altmann (2005) showed by means of Japanese morae that the unified theory (<math>\rightarrow</math>) leads to an identical result.
 Usually one considers frequency as the spiritus movens, the independent variable of many relationships, but Köhler (1986) assumed here an inverse relationship.
@@ Line 13: / Line 14: @@
 '''3. Derivation'''
-Since in most cases linguistic properties are related by their relative rates of change, Tamaoka and Altmann (2005), taking into account some ceteris paribus factors, and leaning against the unfied theory (<math>\rightarrow</math>) set up the equation
+Since in most cases linguistic properties are related by their relative rates of change, Tamaoka (2007), taking into account some ceteris paribus factors, and leaning against the unified theory (<math>\rightarrow</math>) set up the equation
-(1)<math> \frac{dy}{y}= \left( c+\frac{b}{x}\right)dx</math>
+(1) <math> \frac{dy}{y}= \left( c+\frac{b}{x}\right)dx</math>
-where x is polytexty, y is frequency and c are some additional factors. They considered Japanese morae, their polytexty and frequency in a Japanese corpus. The resulting equation is
+where x is polytexty, y is frequency and c represents some additional factors. The resulting solution,
-(2)<math>y = ax^b e^{cx}\quad</math>.
+(2) <math>y = ax^b e^{cx}\quad</math>,
+was used to model polytexty and frequency of Japanese morae in a Japanese corpus.
 Using Köhlers model (Fig. 1) one can write the relationships as  follows:
-(3)	ln(F) = R ln(Appl) + B ln(PT) – C exp(ln(PT))
+(3) ln(F) = R ln(Appl) + B ln(PT) – C exp(ln(PT))
 i.e.
-	ln(F) = R ln(Appl) + B ln(PT) – C (PT)\quad
-from which
+ln(F) = R ln(Appl) + B ln(PT) – C (PT)
-<math>F = Appl^R PT^B e^{-c({PT})}\quad</math>
+from which it follows that
-follows. Since <math>Appl^R</math> can, in the framework of a synchronic study, be considered as a constant, say A, PT = x, and F = y, we obtain
+<math>F = Appl^R PT^B e^{-c({PT})}\quad</math>.
-<math>y = A x^b e^{-cx}\quad</math>
+Since in the framework of a synchronic study <math>Appl^R</math> can be considered as a constant, say A, and since we can set PT = x and F = y, we obtain
-whih is identical with the above result of the differential equation.
+<math>y = A x^b e^{-cx}\quad</math>,
+whih is identical with the above solution of the differential equation.
 <div align="center">[[Image:Figur1_Freq.jpg]]</div>
@@ Line 44: / Line 48: @@
 Thus Köhler´s model explains also the additional factors.
-'''Example 1'''. Types and token of Japanese morae
+'''Example 1'''. Types and tokens of Japanese morae
 Tamaoka and Makioka (2004) computed the frequencies of 103 Japanese morae in a corpus containing 341,771 different words with total frequency 287,792,797. For each mora its frequency and the contexts (different words) were ascertained. Tamaoka and Altmann (2005) showed that the best fit to these data (in logarithmic transformation) can be obtained by the curve
@@ Line 55: / Line 59: @@
 <div align="center">[[Image:Grafi1_Freq.jpg]]</div>
 <div align="center">Fig. 2. Relation between types and tokens of Japanese morae</div>
+'''Example 2'''. Polytextuality of German words as a function of their frequencies
+Köhler (1986) published his result on data of a German text corpus (LIMAS), respresented here by the following graph:
+[[image:Graph1.JPG]]
-'''4. Authors: R. Köhler, G. Altmann'''
+'''4. Authors: G. Altmann, R. Köhler'''
@@ Line 66: / Line 77: @@
 '''Gieseking, K.''' (2002). Untersuchungen zur Synergetik der englischen Lexik. In: Köhler, R. (ed.), ''Korpuslinguistische Untersuchungen in die quantitative und systemtheoretische Linguistik: 387-433''. http://ubt.opus.hbz-nrw.de/volltexte/2004/279/
-'''Köhler (2005)………….'''
+'''Köhler, R.''' (2006). Frequenz, Kontextualität und Länge von Wörtern. Eine Erweiterung des synergetisch-linguistischen Modells. In: Rapp, R., Sedlmeier, P., Zunker-Rapp, G. (eds.), ''Perspectives on Cognition''. Lengerich, Berlin, Bremen, Miami et al: Pabst Science Publishers, 327-338.
-'''Tamaoka, K., Altmann, G.''' (2005). On the relation between types and tokens of Japanese morae………….
+'''Tamaoka, K.''' (2007). On the relation between types and tokens of Japanese morae. In: ''Script problems (in print)''.
 '''Tamaoka, K., Makioka, Sh.''' (2004). Frequency of occurrence for units of phonemes, morae, and syllables appearing in a lexical corpus of a Japanese newspaper. ''Behavior Research Methods, Instruments & Computers 36(3), 531-547''.
@@ Line 74: / Line 85: @@
 '''Wimmer, G., Altmann, G.''' (1995). A model of morphological productivity. ''J. of Quantitative Linguistics 2, 212-216.''
-[[Category:Unfertig]]
+[[Category:Quantitative properties]]

Anonymous

Search

Navigation

Navigation

Wiki tools

Wiki tools

Difference between revisions of "Frequency and polytextuality"

Namespaces

Page actions

Latest revision as of 13:13, 26 June 2007

Anonymous

Search

Navigation

Wiki tools

Page tools

Categories

Difference between revisions of "Frequency and polytextuality"

Latest revision as of 13:13, 26 June 2007