THE NEWCASTLE ELECTRONIC CORPUS OF TYNESIDE ENGLISH |
Documentation:
alignment within <text> Within the <group>-based structure of the interview <text>, the real-time alignment scheme described elsewhere in this website is implemented using the <anchor> tag (Guidelines 14.3), where, in each tag, the 'id' attribute specifies a real-time offset from the start of the operating system audio file in question -- <anchor id='0060'/> in, say, the NECTE orthographic transcription of the audio file for an interview would mark a place in the text corresponding to a time offset of 60 seconds in the audio file. If, for a given interview, such markers are inserted into the correct places not only in the NECTE orthographic but also the TLS orthographic, phonetic, and tagged representations, then an XML processor can use them to align all the <text>s in an interview <group>. This alignment mechanism requires careful attention to the values assigned to the 'id' attribute. The TEI specification of the <anchor> tag states that the value assigned to its 'id' attribute 'may be chosen freely provided that it is unique within the document' (Guidelines 35). This condition is violated if any given offset value, say '0060', is assigned in more than one <text> in an interview <group>, as it would have to be for the alignment to work; to compound the problem, any such offset value would be non-unique not only within a given interview <text>, but also across the interview <text>s that comprise the corpus. To overcome this difficulty, a unique string prefix is assigned to the offset value of each <anchor> instance in the corpus, thus guaranteeing overall uniqueness. The string prefix for the <anchor> 'id's in any given interview is constructed by concatenation: the entity name for that interview in the entity list defined by <!ENTITY % interviews SYSTEM 'interviews.ent'> %interviews; in the DOCTYPE declaration + a mnemonic string for the representational type + the time offset string. Thus, for the interview whose entity name is 'tlsg01', for the NECTE orthographic representation, for time offset '0060', the <anchor> looks like this: <anchor id='tlsg01necteortho0060'/>; for the interview whose entity name is 'tlsg21', for the phonetic representation, for time offset '0340', the <anchor> looks like this: <anchor id='tlsg21phonetic0340'/>, and so on. In practice, <anchor> tags for a randomly-chosen passage of NECTE orthographic transcription and the corresponding phonetic transcription look like this, with relevant tags highlighted:
Note that, for the Newcastle group of speakers 'tlsn01' to 'tlsn07', it has not been possible to time-align the phonetic transcriptions, since the audio recordings for this group has not survived and there is consequently no basis for alignment. The formatting of numerical codes therefore differs from that in the other TLS-based files, where the codes are in a continuous sequence with interspersed <anchor>s. For the Newcastle files, the numerical codes are arranged in a sequence of code-strings terminated by a line-break, where each code-string in the sequence corresponds to a single informant utterance. The motivation for this formatting was to facilitate re-ordering of the codes if this is ever undertaken in future --if, for example, the audio files or the index card sets for the Newcastle group should ever come to light. |