Alignment within <text>


Within the <group>-based structure of the individual <text>, the real-time alignment scheme described elsewhere on this website is implemented using the <anchor> tag, where, in each tag, the @xml:id attribute specifies a real-time offset from the start of the audio file in question — the tag <anchor xml:id="decten1tlsg01necteortho0040"/>, for example, marks a place in the DECTE orthographic transcription corresponding to a time offset of 40 seconds from the start of the corresponding decten1tlsaudiog01 audio file.

Since, for a given interview XML file, such markers are inserted into the correct places not only in the DECTE orthographic text but also the various other transcribed representations of it (TLS orthographic, phonetic, tagged), an XML processor can use them to align all the <text> elements enclosed by <group>.

This alignment mechanism requires careful attention to the values assigned to the @xml:id attribute.

The TEI specification of the <anchor> tag states that the value assigned to its @xml:id attribute must be unique within the document. This condition is violated if any given offset value, say '0040', is assigned in more than one <text> in an interview <group>, as it would have to be for the alignment to work; to compound the problem, any such offset value would be non-unique not only within a given interview <text>, but also across the interview <text> elements that comprise the corpus.

To overcome this difficulty, a unique string prefix is assigned to the offset value of each <anchor> instance in the corpus, thus guaranteeing overall uniqueness.

The string prefix for the <anchor> @xml:id attributes in any given interview is constructed by concatenation: the entity name for that interview in the entity list defined by interviews.ent in the DTD, plus a mnemonic string for the representational type, plus the time offset string.

Thus, for the interview whose entity name is decten1tlsg01, for the DECTE orthographic representation, and for time offset '0040', the <anchor> looks like this: <anchor xml:id="decten1tlsg01necteortho0040"/>; for the interview whose entity name is decten1tlsg21, for the phonetic representation, and for time offset '0340', the <anchor> looks like this: <anchor xml:id="decten1tlsg21phonetic0340"/>, and so on.

The <anchor> tags for a randomly-chosen passage of DECTE orthographic transcription and the corresponding phonetic transcription look like this, with relevant tags highlighted:

(a) DECTE orthographic transcription

<u who="#interviewerTLSG01">eh <anchor xml:id="decten1tlsg01necteortho0020"/> where do you mean by that eh</u>

<u who="#informantTLSG01">that's ehm <pause/> down by eh Clark Chapman's</u>

<u who="#interviewerTLSG01">oh aye like Saltmeadows</u>

<u who="#informantTLSG01">yes Saltmeadows</u>

<u who="#interviewerTLSG01"><unclear/> whereabouts else have you lived since then you know I mean how long did you stay there</u>

<u who="#informantTLSG01">five year <anchor xml:id="decten1tlsg01necteortho0040"/> lived eh over in Gateshead moved from there to ehm Elswick Street of Gateshead on Sunderland Road</u>

(b) TLS phonetic transcription (recording informant only)

<anchor xml:id="decten1tlsg01phonetic0020"/> 02081 02301 08580 02322 01443 02741 02201 01284 08580 02383 02801 00421 02421 02501 00342 02164 02721 02021 02741 02642 04321 02621 00503 02825 02301 02721 00246 02341 12601 02642 02541 01284 02561 02881 01641 <anchor xml:id="decten1tlsg01necteortho0040"/>

Note that, for the files representing the Newcastle group of speakers, decten1tlsn01 to decten1tlsn07, it has not been possible to time-align the phonetic transcriptions, since the audio recordings for this group have not survived and there is consequently no basis for alignment. The formatting of numerical phonetic codes in these files therefore differs from that in the other TLS-based files, where the codes are in a continuous sequence with interspersed <anchor> elements.

For the Newcastle files, the numerical codes are arranged in a sequence of code-strings terminated by a line-break, where each code-string in the sequence corresponds to a single informant utterance. The motivation for this formatting was to facilitate re-ordering of the codes if this is ever undertaken in future — if, for example, the audio files or the index card sets for the Newcastle group should ever come to light.