Part-of-speech tags
Part-of-speech tagging assigns class labels to the words or morphemes of an utterance or a sequence of utterances. As noted in the overview of content representation, DECTE provides a part-of-speech tagged representation of the corpus content, and in particular of the DECTE orthographic transcriptions of the TLS and PVC audio interviews.
The tagging was carried out as part of the earlier NECTE project (2001-2005) by the University Centre for Computer Corpus Research on Language (UCREL) at the University of Lancaster, UK, using the CLAWS4 tagger.
Given a randomly-selected DECTE passage, the CLAWS4 tagger generates the following output:
<w id="2.2" pos="UH" lem="well">well</w>
<w id="2.3" pos="VM" lem="could">could</w>
<w id="2.4" pos="PPY" lem="you">you</w>
<w id="2.5" pos="VVI" lem="tell">tell</w>
<w id="2.6" pos="PPIO2" lem="we">us</w>
<w id="2.7" pos="MD" lem="first">first</w>
where:
- the <w> tag marks a word
- the @id attribute identifies, for a given word token, the line number of the text in which the word occurs, followed by its positional index in the line. Thus the word you in the above example occurs in line 2, and is the fourth word in that line
- the @pos attribute states the class label of the word as detailed in the CLAWS C8 tagset
- the @lem attribute states the lemma of the word, that is, its uninflected form as it would appear in a standard dictionary
- the word that appears between the opening <w> tag and the closing </w> tag is the form that actually occurs in the text.
DECTE has emended the CLAWS4 output format in two ways:
1. The structuring of the corpus does not feature partition of the various types of DECTE content into lines, and the @id attribute is consequently not only superfluous but also potentially misleading relative to the <anchor>-based alignment system. The @id attribute is therefore deleted.
2. The CLAWS4 output is not TEI-conformant. The <w> element is included in TEI to represent 'a grammatical (not necessarily orthographic) word' (TEI Guidelines 17.1.1), but neither the @pos nor the @lem attributes are defined for <w> in the TEI scheme.
To establish TEI conformance, one approach is to emend the TEI DTD to include these attributes. The other is to substitute the corresponding TEI attributes @type and @lemma for the CLAWS @pos and @lem in the part-of-speech tagged text.
DECTE chose the latter to avoid having to emend the TEI standard, which is after all a standard.
Applying these changes, the example passage above now looks like this:
<w type="UH" lemma="well">well</w>
<w type="VM" lemma="could">could</w>
<w type="PPY" lemma="you">you</w>
<w type="VVI" lemma="tell">tell</w>
<w type="PPIO2" lemma="we">us</w>
<w type="MD" lemma="first">first</w>
|