Depth of syntactic constructions

1. Problem and history

Depth (of embedding) of a syntactic construction is defined as the „number of steps from S to the given constituent, by applying production rules“ (Köhler, Altmann 2000). In practice, the sentence is analyzed in form of a graph-theoretical tree and the depth is the length of the way from S to the given symbol. The problem is to find the distribution of depth (x) in a text or in a corpus.

The first concept of depth was proposed by Yngve (1960, 1961). Different reserchers used it for typological purposes (Householder 1960; Papp 1966; Altmann, Lehfeldt 1973). It can be defined in the framework of any grammat. Köhler (1999) studied the dependence of depth on position (see Vol. 2); Köhler and Altmann (2000) derived the distribution of depth in a corpus.

2. Hypothesis

The depth of syntactic construction abides by the hyper-Pascal distribution.

3. Derivation

For the derivation the following quantities are necessary (Köhler, Altmann 2000) minT – the requirement of limiting the depth of embedding which represents the limitation of the language processing memory; maxH – the requirement of maximazing compactness. This enables us diminishing the complexity of the subordinated level of embedding by embedding constituents into the given level… minX on the level m corresponds to the requirement maxH on the level m+1; E – a variable representing the average degree of fullness, the default value of complexity.

Assumptions: Depth x is directly proportional to depth x-1, the proportionality being given by the default value of complexity E. The requirement maxH increases the tendency towards greater depth, minT restricts is. Thus we obtain

(1) P_x= \frac{maxH + x}{minT + x}EP_{x-1}.

Setting maxH = k-1, minT = m-1, E = q and solving (1) results in the hyperpascal distribution

(2) P_x= \frac{{k+x-1 \choose x}}{{m+x-1 \choose x}}q^x P^0, x=0,1,2,...

where  P_0^{-1} = _2 F_1 (k,1;m;q) .

Example: The distribution of depth in the Susanne corpus

Köhler and Altmann (2000) considered the depth of embedding in the Susanne corpus and obtained the results presented in Table 1 and Fig. 1.

Tabelle1 DoSC.jpg
Grafik1 DoSC.jpg
Fig. 1. Fitting the hyper-Pascal distribution to the data in Table 1


Since the sample size is very great (N = 101138), the usual chi-square test is inappropriate. Instead, one used the contingency coefficient C = X^2/N, which yields satisfactory results.

4. Authors: G. Altmann

5. References

Altmann, G., Lehfeldt,W. (1973). Allgemeine Sprachtypologie. München: Fink.

Householder, F.W.Jr. (1960). First thoughts on syntactic indices. International Journal of American Linguistics 26, 195-197.

Köhler, R. (1999). Syntactic structures: Properties and interrelations. J. of Quantitative Linguistics 6, 46-57.

Köhler, R., Altmann, G. (2000). Probability distributions of syntactic units and properties. J. of Quantitative Linguistics 7, 189-200.

Papp, F. (1966). On the depth of Hungarian sentences. Linguistics 25, 58-77.

Yngve, V. (1960). A model and a hypothesis for language structure. Proceedings of the American Philosophical Society 104, 444-446.

Yngve, V. (1961). The depth hypothesis. In: Jakobson, R. (ed.), Structure of Language and ist Mathematical Aspects: 130-138. Providence, R:I.: American Mathematical Society.