THE NEWCASTLE ELECTRONIC CORPUS OF TYNESIDE ENGLISH |
Documentation:
part-of-speech tags Part-of-speech tagging assigns class labels to the morphemes of an utterance or a sequence of utterances. As noted in the overview of content representation, NECTE provides a part-of-speech tagged representation of the corpus content, and in particular of the NECTE orthographic transcriptions of the TLS and PVC audio interviews. The tagging was carried out by the University Centre for Computer Corpus Research on Language (UCREL) at the University of Lancaster, UK, using the CLAWS4 tagger and the C8 tagset. Given a randomly-selected NECTE passage, the CLAWS4 tagger generates the following output:
where:
NECTE has emended the CLAWS4 output format in two ways: 1. The structuring of the corpus does not feature partition of the various types of NECTE content into lines, and the 'id' attribute is consequently not only superfluous but also potentially misleading relative to the <anchor>-based alignment system. The 'id' attribute is therefore deleted. 2. The CLAWS4 output is not TEI-conformant. The <w> element is included in TEI to represent a single morpheme in part-of-speech tagging (Guidelines 15.1), but neither the 'pos' nor the 'lem' attributes are defined for <w> in TEI. To establish TEI conformance, one approach is to emend the TEI DTD to include these attributes. The other is to substitute the corresponding TEI attribute names 'type' and 'lemma' for 'pos' and 'lem' in the tagged text. NECTE chose the latter to avoid having to emend the TEI standard, which is after all a standard. Applying these changes, the above passage now looks like this:
|