THE NEWCASTLE ELECTRONIC CORPUS OF TYNESIDE ENGLISH

Home

Acknowledgements

Documentation

The corpus

People

Publications

Sponsors

References

Links

Appendices

Documentation: structure of the <text> element
As already noted, a single NECTE interview consists of five types of representation: audio, NECTE orthographic transcription, TLS orthographic transcription, phonetic transcription, and part-of-speech tagged. To encode these in a TEI-conformant way, each interview is regarded as a composite text, for which TEI provides the <group> element (Guidelines 7). Using <group>, the structure of a NECTE interview looks like this:
 
<text>
 
<group>
 
<!-- a sequence of five <text> elements -->
 
<text id='tlsg01audio'>
<body>
<!-- content -->
</body>
</text>
 
<text id='tlsg01necteortho'>
<body>
<!-- content -->
</body>
</text>
 
<text id='tlsg01tlsortho'>
<body>
<!-- content -->
</body>
</text>
 
<text id='tlsg01phonetic'>
<body>
<!-- content -->
</body>
</text>
 
<text id='tlsg01tagged'>
<body>
<!-- content -->
</body>
</text>
 
</group>
 
</text>

where:

  • Each of the five <text> elements in a <group> contains one of the types of representation. The 'id' attribute identifies both the interview and the representational type by concatenating the relevant entity name taken from the list defined by <!ENTITY % interviews SYSTEM 'interviews.ent'> %interviews; in the DOCTYPE declaration and a mnemonic indicating the type of representation --audio, necte orthographic, and so on.
  • The <body> subelement of each of the <text> elements contains the actual content of the associated representational type:

-- Audio: Non-text content like audio and graphics cannot appear explicitly in XML documents, but they can be embedded using a referencing mechanism that an XML processor is able to interpret appropriately. The TEI Guidelines do not provide an embedding mechanism for audio, but they do provide one for graphics via the <figure> tag  (Guidelines 22.3), and <!ENTITY % TEI.extensions.ent SYSTEM 'necte.ent' > in the DOCTYPE declaration adapts this graphics mechanism for audio by renaming the <figure> tag as <audio> in accordance with Guidelines 29: Modifying and Customizing the TEI DTD; <!NOTATION wav SYSTEM "wmplayer.exe"> in the DOCTYPE decaration provides XML processors with an indication of what application to use for playing the audio. A random example of an audio <text> declaration looks like this:

<text id='tlsg37audio'>
<body>
<audio entity='tlsaudiog37'>
</audio>
</body>
</text>

where the entity 'tlsaudiog37' is defined by <!ENTITY % audiofiles SYSTEM "audiofiles.ent"> %audiofiles; in the DOCTYPE declaration as denoting the operating system file 'tlsaudiog37.wav'.

-- The other four representational types are text, and are thus lexically present in the <body> elements of their respective <text>s. A brief example of each is given:

<text id="tlsg37necteortho">
<body>
<u who="informantTlsg37"> <anchor id="tlsg37necteortho0000"/>t l s what’s that </u><u who="interviewerTlsg37"> g </u><u who="informantTlsg37"> e <pause/> five two </u><u who="interviewerTlsg37"> thanks <pause/> ta <pause/> eh could you tell us first eh where you were born please <event desc="interruption"/> <unclear/> </u><u who="informantTlsg37"> i was born at eleven victoria street <pause/> gateshead </u><u who="interviewerTlsg37"> eh <pause/> aye yeah whereabouts is that again...
</body>
</text>

<text id="tlsg37tlsortho">
<body>
<u who="informantTlsg37"><anchor id="tlsg37tlsortho0000"/>I was born at eleven Victoria Street Gateshead thats just against the <pause/> the flats <anchor id="tlsg37tlsortho0020"/>you know no Barney Close the old <pause/> Victoria Street eleven Victoria Street oh well I my mother got out they were building houses for the people then down the Old Fold and I went down the Old <anchor id="tlsg37tlsortho0040"/>Fold to live...
</body>
</text>

<text id="tlsg37phonetic">
<body>
<u who="informantTlsg37"><anchor id="tlsg37phonetic0000"/>01304 02941 02641 02201 00626 02741 08760 02301 02081 02781 00244 02561 02021 02741 02561 00144 02421 02263 00626 02861 17801 02621 02262 02861 00023 02301 02442 01123 02301 02623 02365 02603 00342 02301 09040 02521 00823 02623 02442 11202 02741 02623 09030 08440 08580 02603 02541 02801 00342 02301 28803 <anchor id="tlsg37phonetic0020"/>02741...
</body>
</text>

<text id="tlsg37tagged">
<body>
<u who="informantTlsg37"><anchor id="0000"/> <s> <w type="VVN"lemma="see"> seen </w> <w type="II"lemma="at"> at </w> <w type="AT"lemma="the"> the </w> <w type="NN2"lemma="picture"> pictures </w> <w type="VVBDZ"lemma="be"> was </w> <w type="UH"lemma="ehm"> ehm </w> <w type="RR"lemma="so"> so </w> <w type="PPIS1"lemma="i"> i </w> <w type="VVD"lemma="marry"> married </w> <w type="AT1"lemma="an"> an </w> <w type="NN1"lemma="axe"> axe </w> <w type="NN1"lemma="murderer"> murderer </w>...
</body>
</text>