Alignmentimplementation

Documentation: alignment within <text>

Within the <group>-based structure of the interview <text>, the real-time alignment scheme described elsewhere in this website is implemented using the <anchor> tag (Guidelines 14.3), where, in each tag, the 'id' attribute specifies a real-time offset from the start of the operating system audio file in question -- <anchor id='0060'/> in, say, the NECTE orthographic transcription of the audio file for an interview would mark a place in the text corresponding to a time offset of 60 seconds in the audio file. If, for a given interview, such markers are inserted into the correct places not only in the NECTE orthographic but also the TLS orthographic, phonetic, and tagged representations, then an XML processor can use them to align all the <text>s in an interview <group>.

This alignment mechanism requires careful attention to the values assigned to the 'id' attribute. The TEI specification of the <anchor> tag states that the value assigned to its 'id' attribute 'may be chosen freely provided that it is unique within the document' (Guidelines 35). This condition is violated if any given offset value, say '0060', is assigned in more than one <text> in an interview <group>, as it would have to be for the alignment to work; to compound the problem, any such offset value would be non-unique not only within a given interview <text>, but also across the interview <text>s that comprise the corpus. To overcome this difficulty, a unique string prefix is assigned to the offset value of each <anchor> instance in the corpus, thus guaranteeing overall uniqueness. The string prefix for the <anchor> 'id's in any given interview is constructed by concatenation: the entity name for that interview in the entity list defined by <!ENTITY % interviews SYSTEM 'interviews.ent'> %interviews; in the DOCTYPE declaration + a mnemonic string for the representational type + the time offset string. Thus, for the interview whose entity name is 'tlsg01', for the NECTE orthographic representation, for time offset '0060', the <anchor> looks like this: <anchor id='tlsg01necteortho0060'/>; for the interview whose entity name is 'tlsg21', for the phonetic representation, for time offset '0340', the <anchor> looks like this: <anchor id='tlsg21phonetic0340'/>, and so on. In practice, <anchor> tags for a randomly-chosen passage of NECTE orthographic transcription and the corresponding phonetic transcription look like this, with relevant tags highlighted:

<anchor id="tlsg01necteortho0020"/>where do you mean by that eh that's ehm <pause/> down by eh clark chapman's oh aye like saltmeadows yes saltmeadows <unclear/> whereabouts else have you lived since then you know i mean how long did you stay there five year <anchor id="tlsg01necteortho0040"/>

<anchor id="tlsg01phonetic0020"/>02081 02301 08580 02322 01443 02741 02201 01284 08580 02383 02801 00421 02421 02501 00342 02164 02721 02021 02741 02642 04321 02621 00503 02825 02301 02721 00246 02341 12601 02642 02541 01284 02561 02881 01641 <anchor id="tlsg01phonetic0040"/>

Note that, for the Newcastle group of speakers 'tlsn01' to 'tlsn07', it has not been possible to time-align the phonetic transcriptions, since the audio recordings for this group has not survived and there is consequently no basis for alignment. The formatting of numerical codes therefore differs from that in the other TLS-based files, where the codes are in a continuous sequence with interspersed <anchor>s. For the Newcastle files, the numerical codes are arranged in a sequence of code-strings terminated by a line-break, where each code-string in the sequence corresponds to a single informant utterance. The motivation for this formatting was to facilitate re-ordering of the codes if this is ever undertaken in future --if, for example, the audio files or the index card sets for the Newcastle group should ever come to light.