Tagging

Documentation: part-of-speech tags

Part-of-speech tagging assigns class labels to the morphemes of an utterance or a sequence of utterances. As noted in the overview of content representation, NECTE provides a part-of-speech tagged representation of the corpus content, and in particular of the NECTE orthographic transcriptions of the TLS and PVC audio interviews. The tagging was carried out by the University Centre for Computer Corpus Research on Language (UCREL) at the University of Lancaster, UK, using the CLAWS4 tagger and the C8 tagset.

Given a randomly-selected NECTE passage, the CLAWS4 tagger generates the following output:

<w id="2.2" pos="UH" lem="well">well</w>
<w id="2.3" pos="VM" lem="could">could</w>
<w id="2.4" pos="PPY" lem="you">you</w>
<w id="2.5" pos="VVI" lem="tell">tell</w>
<w id="2.6" pos="PPIO2" lem="we">us</w>
<w id="2.7" pos="MD" lem="first">first</w>

where:

the <w> tag marks a morpheme
the 'id' attribute identifies, for a given morpheme token, the line number of the text in which the morpheme occurs, followed by its positional index in the line.Thus 'you' in the above example occurs in line 2, and is the fourth morpheme in that line.
the 'pos' attribute states the class label of the morpheme as detailed in the C8 tagset.
the 'lem' attribute states the lemma of the morpheme, that is, its uninflected form as it would appear in a standard dictionary.
the morpheme between the <w> and </w> tags is the form that actually occurs in the text.

NECTE has emended the CLAWS4 output format in two ways:

1. The structuring of the corpus does not feature partition of the various types of NECTE content into lines, and the 'id' attribute is consequently not only superfluous but also potentially misleading relative to the <anchor>-based alignment system. The 'id' attribute is therefore deleted.

2. The CLAWS4 output is not TEI-conformant. The <w> element is included in TEI to represent a single morpheme in part-of-speech tagging (Guidelines 15.1), but neither the 'pos' nor the 'lem' attributes are defined for <w> in TEI. To establish TEI conformance, one approach is to emend the TEI DTD to include these attributes. The other is to substitute the corresponding TEI attribute names 'type' and 'lemma' for 'pos' and 'lem' in the tagged text. NECTE chose the latter to avoid having to emend the TEI standard, which is after all a standard.

Applying these changes, the above passage now looks like this:

<w type="UH" lemma="well">well</w>
<w type="VM" lemma="could">could</w>
<w type="PPY" lemma="you">you</w>
<w type="VVI" lemma="tell">tell</w>
<w type="PPIO2" lemma="we">us</w>
<w type="MD" lemma="first">first</w>