

The corpus







Documentation: document type definition

A valid as opposed to merely well-formed XML document must include a document type definition in relation to which the document can be validated. This is done by means of a document type or DOCTYPE declaration in which the elements, attributes, and so on used in the document are specified. This specification can be internal in the sense that its components appear lexically within the DOCTYPE declaration, or external in that the names of one or more files containing the specification are given in the DOCTYPE declaration, or it can be a combination of the two. For further information on DTD definition see Guidelines 2.4; 2.10; Guidelines 3.

The NECTE document "necte.xml" has been validated using the Topologi Schematron Validator, which was downloaded from

The NECTE DOCTYPE declaration looks like this:

<!DOCTYPE TEI.2 SYSTEM "tei2.dtd " [


<!ENTITY % TEI.spoken 'INCLUDE'>

<!ENTITY % TEI.linking 'INCLUDE'>
<!ENTITY % TEI.analysis 'INCLUDE'>
<!ENTITY % TEI.figures 'INCLUDE'>
<!ENTITY % TEI.corpus 'INCLUDE'>
<!NOTATION wav SYSTEM "wmplayer.exe">
<!NOTATION jpg SYSTEM "iexplorer.exe">

<!ENTITY % interviews SYSTEM "interviews.ent"> %interviews;

<!ENTITY % audiofiles SYSTEM "audiofiles.ent"> %audiofiles;

<!ENTITY % graphics SYSTEM "graphics.ent"> %graphics;
<!ENTITY % urls SYSTEM "urls.ent"> %urls;
<!ENTITY % emails SYSTEM "emails.ent"> %emails;


  • <!DOCTYPE TEI.2 SYSTEM "tei2.dtd " [ ]> is the DOCTYPE declaration in which:

-- TEI.2 names the root element of the document or documents to which the DTD applies. As we shall see, the NECTE corpus consists of a sequence of XML documents each of which has a root element TEI.2; the DOCTYPE declaration applies to them all.

 -- SYSTEM "tei.dtd" says that the file containing the required DTD definitions is available locally in the file "tei.dtd" (Guidelines 3.6). This file contains the full TEI DTD, and is provided free by the TEI Consortium. NECTE has downloaded this file and provides it along with the rest of the NECTE materials rather than referring to it at a remote site with the aim of making the corpus usable on a standalone computer or convenient for users with a slow internet connection.

--the square brackets [ ] enclose selections from the DTD definitions contained in "tei.dtd" and a sequence of NECTE-specific NOTATION and ENTITY declarations.

  • The ENTITY declarations from <!ENTITY % TEI.XML 'INCLUDE> to <!ENTITY % TEI.corpus 'INCLUDE'> select the relevant parts of the full DTD provided by TEI. The TEI standard is defined by a DTD which is partitioned into fragments that can be selected according to the requirements of particular applications, thus obviating the need to include the entire DTD in situations where its full range is not required. For more information on this selection mechanism see  Guidelines 2.8.2 and Guidelines 3.

-- <!ENTITY % TEI.XML 'INCLUDE'> is the core tag set that every TEI-conformant document must have access to; see Guidelines 6.

-- <!ENTITY % TEI.mixed 'INCLUDE'> TEI regards corpora as ‘mixed’ documents that transcend conceptually unitary types like verse or prose, and that consequently require tags from a selection of DTD fragments. This declaration specifies the mixed base fragment, which requires that all DTD fragments that are to constitute the mix be specified; these are given below.

-- <!ENTITY % TEI.spoken 'INCLUDE'> defines tags appropriate to spoken / audio corpora; see Guidelines 11.

-- <!ENTITY % TEI.linking 'INCLUDE'> defines tags for linking and aligning documents; see Guidelines 14.

-- <!ENTITY % TEI.analysis 'INCLUDE'> defines 'a tag set for associating simple analyses and interpretations with text elements', including grammatical markup; see Guidelines 15.

-- <!ENTITY % TEI.figures 'INCLUDE'> defines a tag set for referring to graphics declared as external entities; see Guidelines 22.

-- <!ENTITY % TEI.corpus 'INCLUDE'> defines corpus-specific tags; see Guidelines 23.

  •  NOTATION declarations: Non-text entities such as graphics and sound can be embedded in XML documents, but instructions on how these are to be dealt with must be provided for XML processors. This is done using NOTATION declarations (Guidelines 2.7.4; Guidelines 22.3). In the NECTE document, audio and graphics are required: <!NOTATION wav SYSTEM "wmplayer.exe"> says that any audio files are in '.wav' format and are played using the Microsoft Windows Media Player©, and <!NOTATION jpg SYSTEM "iexplorer.exe"> that any graphics files are in '.jpg' format and are viewed using Microsoft Internet Explorer©. These applications were selected on account of their ubiquity; users are, of course, at liberty to alter this arrangement if desired.

  • <!ENTITY % interviews SYSTEM "interviews.ent"> %interviews; declares a file of entity declarations "interviews.ent" and inserts it into DOCTYPE by the "%interviews;" call (Guidelines 2.7; Guidelines 22.3). The entity declarations in this file are used to insert the interviews into the corpus, as described in the discussion of the document instance.

  • <!ENTITY % graphics SYSTEM "graphics.ent"> %graphics; declares a file of entity declarations "graphics.ent"; and inserts it into DOCTYPE by the "%graphics;" call (Guidelines 2.7; Guidelines 22.3). The entity declarations in this file are used in the global header of the NECTE document to refer to figures in ".jpg" format.

  • <!ENTITY % urls SYSTEM "urls.ent"> %urls; declares a file of entity declarations "urls.ent"; and inserts it into DOCTYPE by the "%urls;" call (Guidelines 2.7; Guidelines 22.3). The entity declarations in this file are used in the global header of the NECTE document to refer to external websites.

  • <!ENTITY % emails SYSTEM "emails.ent"> %emails; declares a file of entity declarations "emails.ent"; and inserts it into DOCTYPE by the "%emails;" call (Guidelines 2.7; Guidelines 22.3). The entity declarations in this file are used in the global header of the NECTE document to refer to email addresses.