THE NEWCASTLE ELECTRONIC CORPUS OF TYNESIDE ENGLISH |
Documentation:
content representation The NECTE content is provided in several types of representation: audio, orthographic transcription, part-of-speech tagged orthographic transcription, and phonetic transcription. 1. Audio The primary representation of both TLS and PVC interviews is audio-recorded speech. The TLS recordings are on analog reel-to-reel tape dating from the late 1960s and on analog tape cassettes taken from the original tapes in 1994-5. Because they are re-recordings, the cassettes are inevitably of lower quality than the originals. The originals have, however, deteriorated quite badly, especially in recent years, and the cassettes in many cases provide information no longer recoverable from the reel-to-reel tapes. All the TLS recordings included in NECTE were digitized from the cassette versions in .wav format at 12000 Hz 16-bit mono, enhanced by amplitude adjustment, graphic equalisation, clip and hiss elimination, and regularisation of speed. All PVC recordings were digitised in .wav format direct from the original DAT tapes. 2. Orthographic transcription NECTE created a complete orthographic transcription of the TLS and PVC audio recordings. In doing the transcription, NECTE was guided by protocols used in comparable and successful projects such as Poplack (1989), Poplack and Tagliamonte (1991), and Tagliamonte (2004). The transcription process consisted of four passes through the audio files, where:
a) Capitalization and punctuation Capitalization and punctuation are syntactic markers. To avoid pre-judging discourse structure, they have not been used in transcription. b) Spelling As a general principle, NECTE aimed to use Standard English orthography for transcription, where 'Standard English' is taken to mean spellings of words found in the Oxford English Dictionary. This implies that, where British and US conventions differ, British conventions are used, e.g. 'colour', 'theatre', 'legalise', 'traveller'. Because TLS and PVC are dialect corpora, however, they contain frequent morphological and lexical segments for which no Standard English spelling exists. In such cases:
NECTE also includes the orthographic transcriptions produced by the TLS project. These are included partly for historical reasons, and partly because they sometimes have readings which were no longer recoverable by the NECTE transcribers on account of deterioration of the audio tapes. Electronic copies of the orthographic transcription text on the index cards were made, and the copies were proof-read relative to the cards. No changes of any kind, including corrections, were made. Note that TLS only transcribed the interviewees' utterances, ignoring the interviewer entirely. 3. Part-of-speech tagged orthographic transcription The NECTE (that is, not the TLS) orthographic transcriptions of the TLS and the PVC audio were part-of-speech tagged by the University Centre for Computer Corpus Research on Language (UCREL) at the University of Lancaster, UK, using the CLAWS4 tagger. Interpretation of the results is via the listing of the C8 tagset provided in Appendix 3, with permission from UCREL. 4 Phonetic transcription NECTE includes partial phonetic transcriptions of the TLS and PVC interviews. These, particularly the TLS phonetic transcriptions, require some detailed discussion. a) TLS phonetic transcriptions To realize its main research aim, TLS had to compare the audio interviews it had collected at the phonetic level of representation. This required the analog speech signal to be discretized into phonetic segment sequences, or, in other words, to be phonetically transcribed. The standard way of doing this is to select a transcription scheme, that is, a set of symbols each of which represents a single phonetic segment (for example, the International Phonetic Alphabet, or IPA), and then to partition the linguistically-relevant parts of the analog audio stream such that each partition is assigned a phonetic symbol. The result is a set of symbol strings each of which represents the corresponding interview phonetically. These strings can then be compared and, if also given a digital electronic representation, the comparison can be done computationally. TLS generated phonetic transcriptions of a substantial part of its audio materials, and they are included in the NECTE corpus, but to make them usable in the NECTE context they have required extensive restoration. This section describes (i) the TLS phonetic transcription scheme and (ii) the restoration of the TLS electronic phonetic files. i. TLS phonetic transcription and digital encoding schemes TLS made the simple, purely sequential transcription procedure described above the basis for a rather complex hierarchical scheme for representing the phonetics of its corpus. That scheme has to be understood if its phonetic data is to be competently interpreted, and it is consequently explained in detail. TLS developed its hierarchical phonetic transcription scheme in order to capture as much of the phonetic variability in the interviews as possible. To see this, consider what happens when data generated by a sequential transcription procedure is analyzed, and, more specifically, the transcribed interviews are compared. An obvious way to do the comparison is to count, for each interview, the number of times each of the phonetic symbols in the transcription scheme being used occurs; this yields a phonetic frequency profile for each of the interviews, and the resulting profiles can then be compared using a wide variety of methods. But such profiles fail to take into account a commonplace of variation between and among individual speakers and speaker groups: that different speakers and groups typically distribute the phonetics of their speech differently in different lexical environments. Frequency profiles of the sort in question here only say how many times each of the various speakers uses phonetic segment x without regard to the possibility that they distribute x differently over their lexical repertoires. The hierarchical TLS transcription scheme was designed to capture such distributional variation. The scheme is based on a way of specifying the phonemes of any given dialect which, following Wells (1982), has come to be known as ‘lexical sets’, and is fairly widely used in English dialectology and historical phonology. As its name indicates, a lexical set is a set of words. More specifically, it is the set of words that contains a specific phoneme, and thus gives an extensional definition of the phoneme –the set {ship, rib, dim, milk, slither, myth, pretty, build, women, busy…}, for example, defines the phoneme /i/ in Standard American and Received Pronunciation British English. The TLS hierarchical transcription scheme has three levels:
For example (Jones-Sargent (1983), 295):
The OU i: defined by the lexical set from which there are examples in the rightmost column can be realized by the phonetic segment symbols in the States column, and these symbols are grouped by phonetic relatedness in the PDV column. This transcription scheme captures the required distributional phonetic information by allowing any given State segment to realize more than one OU. In the above figure, note that several of the State symbols for OU i: occur also in the OU I. What this means is that, in the TLS transcription scheme, a State phonetic segment symbol represents not a distribution-independent sound, but a sound in relation to the phonemes over which it is distributed. The implications of this can be seen in the encoding scheme that the TLS developed for its transcription scheme so that its phonetic data could be computationally analyzed. Each State symbol is encoded as a five-digit integer. The first four digits of any given State symbol designate the PDV to which the symbol belongs, and the fifth digit indexes the specific State within that PDV. Thus, for the OU i: there are 6 PDVs, each of which is assigned a unique four-digit code; the specifics of which numbers are used are irrelevant, and could have been anything else. For a given PDV within the i: OU, say I, the first of the state symbols in left-to-right order is encoded as 00041, the second as 00042, and so on. Now, note that the State symbols 00023 and 00141 are identical, that is, they denote the same sound. Crucially, however, they have different codes because they realize different phonemes relative to OU –or, in other words, the different codes represent the phonemic distribution of the single sound that both the codes denote. The complete TLS phonetic transcription scheme is available in Appendix 1. ii. Restoration of the TLS phonetic transcriptions The phonetic transcriptions of the TLS interviews survive in two forms: as a collection of index cards, and as electronic files. Each electronic file is a sequence of the 5-digit codes just described; a random excerpt from one of these files looks like this (the ‘-&&&&’ sequences are end-of-line markers):
The electronic files initially appeared to be a labour and time saving alternative to keying-in the numerical codes from the index cards, but a peculiarity that stems from the original electronic data entry by the TLS meant that they had to be extensively edited. The problem arose from the way in which the 5-digit codes were laid out on the index cards:
For reasons that are no longer clear, all the consonant codes were written on one line, and all the vowel codes on the line or lines below. When the TLS gave these index cards to the University of Newcastle data entry service, the typists entered the codes line by line, with the result that, in any given electronic line, all the consonant codes come first, followed by the vowel codes. This problem pervades the TLS electronic phonetic transcription files. Simply to keep this ordering would have made the phonetic representation difficult to relate to the other types of representation. The TLS files were therefore edited with reference to the index cards so as to restore the correct code sequencing, and the result was proofread. The only exception to this restoration are the files for the Newcastle speakers. Because neither the audio recordings nor the index card sets for these speakers survive, restoration of the correct sequencing would have been a hugely time-consuming task, and one that could not be undertaken within the limited time available to the NECTE project. Even in their unordered state, however, these files are still usable for certain types of phonetic analysis such as ones that involve segment frequency counts, and they are included in NECTE in their present state for that reason. Moreover, the formatting of numerical codes in these files differs from that in the other TLS-based files, where the codes are in a continuous sequence. For the Newcastle files, the original TLS formatting has been retained: the numerical codes are arranged in a sequence of code-strings each of which is terminated by a line-break, where a code-string in the sequence corresponds to a single informant utterance. The motivation was to facilitate re-ordering of the codes if this is ever undertaken in future --if, for example, the audio files or the index card sets for the Newcastle group should ever come to light. It should, finally, be noted that the Gateshead TLS transcriptions were done exclusively by a single member of the project, Vince McNeany, who was both a trained phonetician and a native speaker of the Tyneside dialect of which the TLS corpus is a sample. This is important for analysis of the phonetic level because it minimizes the subjectivity and variation that inevitably compromises phonetic transcriptions. Who did that TLS Newcastle transcriptions is unknown, however. b) PVC phonetic transcriptions Sample phonetic transcriptions of the PVC materials are provided for comparison with the TLS phonetic transcriptions. These are far less extensive than TLS on account of the extremely time-consuming nature of the process. Consultation with socio-phonetic specialists (Gerard Docherty, Paul Foulkes, Paul Kerswill and Dom Watt) and potential end-users confirmed that most researchers whose primary interest was in phonetics would prefer to do their own analysis, so a decision was taken to provide only broad transcriptions of a startified sub-sample. The first five minutes of each of four PVC tapes was transcribed, giving samples of eight speakers in all: two middle class males; two middle class females; two working class males and two working class females. This was done at a ‘broad’ level, equivalent to the TLS ‘PDV’ level, using minimal diacritics. |