Content alignment

The DECTE project felt that the usefulness of its corpus would be enhanced by provision of an alignment mechanism to relate the representational types to one another, so that corresponding segments in the various types can be conveniently identified and simultaneously displayed — given some orthographic segment of interest, for example, what do the corresponding audio, part-of-speech tagged, and phonetic transcription segments look like?

The decision to provide an alignment mechanism immediately raises the question of granularity: how large should the alignment segments be? Should the levels be aligned phonetic segment by phonetic segment, or word-by-word, or sentence by sentence, or utterance by utterance?

The answer had to take into account research utility on the one hand, and feasibility in terms of cost on the other — is, say, word-by-word alignment useful enough from the point of view of potential research on the corpus to justify the human effort required to insert the necessary markers at so fine-grained a resolution?

For the TLS materials, the format of the interviews made alignment at the granularity of utterance the natural choice. An interview consists of a sequence of paired utterances (interviewer-question, interviewee-answer) in which the utterance boundaries are generally clear-cut; there is some degree of overlap on account of interruption and third-party intervention, but this is infrequent enough to be handled fairly straightforwardly within an utterance-aligned framework.

For the PVC and NECTE2 materials, however, the situation is very different. Interviews are less structured: the interviewer is frequently in the background, conversation is multi-speaker and free-form, and overlaps are the norm. Attempting to disentangle the speakers would on the one hand require very detailed markup (with consequent additional cost), and on the other would necessitate ad hoc decisions about conversational structure, thereby imposing an undesirable pre-analysis on the data.

Assuming the need for a uniform alignment mechanism across the entire corpus, it was clear that alignment at the utterance level was impractical. What were the alternatives?

More detailed alignment at the granularity of the phonetic segment or of the word was ruled out on account of excessive cost. So was alignment on the basis of syntactic unit, since this would have necessitated either manual syntactic markup, which would once again have been expensive, or provision of a reliable automatic parser for the highly non-standard English that the corpus contains, which to our knowledge does not exist.

This left alignment by real-time interval, which was the method that the DECTE team finally adopted.

Our real-time interval alignment mechanism works as follows. It begins with the observation that real time — that is, time as it is conceived by humans in day-to-day life — is meaningful only for the audio level of representation in the corpus; text, be it orthographic, tagged, or a sequence of phonetic symbols, has no temporal dimension. A time interval t is selected, and the audio level is partitioned into some number n of length-t audio segments s: s(t x 1), s(t x 2) … s(t x n), where 'x' denotes multiplication.

Corresponding markers are then inserted into the other levels of representation such that they demarcate substrings corresponding to the audio segments — for the audio segment s(t x i), for some i in the range 1 … n, there are markers in the other representational levels which identify the corresponding orthographic, phonetic, and part-of-speech tagged segments.

In this way, selection of any segment s in any level of representation allows the segments corresponding to s in all the other levels to be identified.

A time interval of 20 seconds was selected on the grounds of cost and usability. With regard to cost it is clear that the shorter the interval, the greater the effort of marker insertion. The increase is more than linear as the interval shrinks, that is, markup for a 1-second interval takes more than 20 times longer than markup for a 20-second interval, due to the simple mechanics of starting and stopping the audio stream in exactly the right place and then deciding where to put the markers in the other levels.

A 20-second interval was found to be a cost-effective choice; with regard to usability, 20-second chunks were found to yield about the right amount of aligned text from the levels of representation on a typical computer screen when all the levels are simultaneously displayed.

It is, of course, a straightforward matter to decrease granularity by multiples of 20 if required. Finer granularity would, however, require insertion of markers at the appropriate places in all levels of representation.

Implementation of this time-interval mechanism in TEI-conformant XML is described elsewhere on this site.