Structure of the <text> element


It has already been noted elsewhere in this site that the DECTE content is provided in several types of representation — audio, orthographic transcription, part-of-speech tagged orthographic transcription, and phonetic transcription — and that not all types are representation are available across all the TLS, PVC, and NECTE2 components.

To encode these various levels of representation in a TEI-conformant way, each <text> element (i.e. the material associated with a given interview in the corpus) is regarded as a composite whose components are enclosed in a <group> element, which, in the words of the TEI Guidelines, 'contains the body of a composite text, grouping together a sequence of distinct texts (or groups of such texts) which are regarded as a unit for some purpose' (TEI Guidelines, section 4; see also 4.3.1).

For texts in the TLS subcorpus, the <text> structure looks like this (using decten1tlsg37 as an example):

<text>

<group>

<!--A sequence of five <text> elements-->

<text xml:id="decten1tlsg37audio">

<body>

<!--Content 1: identification of audio file-->

</body>

</text>

<text xml:id="decten1tlsg37necteortho">

<body>

<!--Content 2: DECTE orthographic transcription-->

</body>

</text>

<text xml:id="decten1tlsg37tlsortho">

<body>

<!--Content 3: TLS orthographic transcription-->

</body>

</text>

<text xml:id="decten1tlsg37phonetic">

<body>

<!--Content 4: TLS phonetic transcription-->

</body>

</text>

<text xml:id="decten1tlsg37tagged">

<body>

<!--Content 5: part-of-speech tagged transcription-->

</body>

</text>

</group>

</text>

Notes:

  • Each of the five <text> elements in a <group> contains one of the types of representation. The @xml:id attribute identifies both the interview and the representational type by concatenating the relevant entity name taken from the list defined by interviews.ent in the DTD and a mnemonic indicating the type of representation, as described elsewhere. For example xml:id="decten1tlsg37audio" identifies the audio file of the decten1tlsg37 interview.
  • Audio: non-text content such as audio and graphics cannot appear explicitly in XML documents, but they can be linked or embedded using a referencing mechanism that an XML processor is able to interpret appropriately. The TEI Guidelines do not at present provide an obvious embedding mechanism for audio, so for the time being the only content here is a reference to an audio file entity defined by the audiofiles.ent file in the DTD. For example:

    <text xml:id="decten1tlsg37audio">

    <body>

    <p>audio file</p>

    </body>

    </text>

  • The other four representational levels involve text, and are thus lexically present in the <body> elements of their respective <text> elements. A brief example of each is given:

    (1) DECTE orthographic transcription

    <text xml:id="decten1tlsg37necteortho">

    <body>

    <u who="#informantTLSG37"><anchor id="decten1tlsg37necteortho0000"/>T L S what's that</u>

    <u who="#interviewerTLSG37">G</u>

    <u who="#informantTLSG37">E <pause/> five two</u>

    <u who="#interviewerTLSG37">thanks <pause/> ta <pause/> eh could you tell us first eh where you were born please <incident><desc>interruption<incident><desc> <unclear/> </u>

    <u who="#informantTLSG37">I was born at eleven Victoria Street <pause/> Gateshead</u>

    <u who="#interviewerTLSG37"> eh <pause/> aye yeah whereabouts is that again [...]

    </body>

    </text>

    (2) TLS orthographic transcription (recording informant only)

    <text xml:id="decten1tlsg37tlsortho">

    <body>

    <u who="#informantTLSG37"><anchor id="decten1tlsg37tlsortho0000"/>I was born at eleven Victoria Street Gateshead thats just against the <pause/> the flats <anchor id="decten1tlsg37tlsortho0020"/> you know no Barney Close the old <pause/> Victoria Street eleven Victoria Street oh well I my mother got out they were building houses for the people then down the Old Fold and I went down the Old <anchor id="decten1tlsg37tlsortho0040"/> Fold to live [...]

    </body>

    </text>

    (3) TLS phonetic transcription (recording informant only)

    <text xml:id="decten1tlsg37phonetic">

    <body>

    <u who="#informantTLSG37"><anchor id="decten1tlsg37phonetic0000"/>01304 02941 02641 02201 00626 02741 08760 02301 02081 02781 00244 02561 02021 02741 02561 00144 02421 02263 00626 02861 17801 02621 02262 02861 00023 02301 02442 01123 02301 02623 02365 02603 00342 02301 09040 02521 00823 02623 02442 11202 02741 02623 09030 08440 08580 02603 02541 02801 00342 02301 28803 [...]

    </body>

    </text>

    (4) Part-of-speech tagged version of the DECTE orthographic transcription

    <text xml:id="decten1tlsg01tagged">

    <body>

    [...]

    <u who="#informantTLSG37">

    <w type="PPIS1" lemma="I">I</w>

    <w type="VABDZ" lemma="be">was</w>

    <w type="VVN" lemma="born">born</w>

    <w type="II" lemma="at">at</w>

    <w type="MC" lemma="eleven">eleven</w>

    <w type="NN1" lemma="Victoria">Victoria</w>

    <w type="NN1" lemma="street">Street</w>

    <pause/> <w type="NP1" lemma="Gateshead">Gateshead</w>

    </u>

    [...]

    </body>

    </text>

  • The interviews in the PVC and NECTE 2 components of DECTE have fewer types of content representation than those in the TLS subcorpus, and therefore interviews in those sets do not have as many <text> elements in their <group> element.