THE NEWCASTLE ELECTRONIC CORPUS OF TYNESIDE ENGLISH

Home

Acknowledgements

Documentation

The corpus

People

Publications

Sponsors

References

Links

Appendices

Documentation: document prolog

The necte.xml prolog consists of an XML declaration and a document type declaration.

1. XML declaration

<?xml version="1.0" encoding="UTF-8" standalone="no"?>.

  • The version specification is standard and requires no comment.

  • The encoding specification says that the version of the current Unicode standard that allows for 1-byte representation of universal 7-bit ASCII is used; see the  Wikipedia entry on Unicode.

  • The standalone specification makes explicit that the document refers to an external DTD.

2. Document type declaration

A valid as opposed to merely well-formed XML document must include a DTD in relation to which the document can be validated. This is done by means of a document type or DOCTYPE declaration in which the elements, attributes, and so on used in the document are specified. This specification can be internal in the sense that its components appear lexically within the DOCTYPE declaration, or external in that the names of one or more files containing the specification are given in the DOCTYPE declaration, or it can be a combination of the two. For further information on DTD definition see Guidelines 2.4; 2.10; Guidelines 3.

The necte.xml DOCTYPE declaration looks like this:

<!DOCTYPE teiCorpus.2 SYSTEM 'tei2.dtd' [

<!ENTITY % TEI.XML 'INCLUDE'>

<!ENTITY % TEI.extensions.ent SYSTEM 'necte_extensions.ent' >

<!ENTITY % TEI.spoken 'INCLUDE'>

<!ENTITY % TEI.mixed 'INCLUDE'>
<!ENTITY % TEI.linking 'INCLUDE'>
<!ENTITY % TEI.analysis 'INCLUDE'>
<!ENTITY % TEI.figures 'INCLUDE'>
<!ENTITY % TEI.corpus 'INCLUDE'>
 
<!NOTATION wav SYSTEM "wmplayer.exe">
 

<!ENTITY % interviews SYSTEM 'interviews.ent'> %interviews;

<!ENTITY % audiofiles SYSTEM 'audiofiles.ent'> %audiofiles;

]>

where:

  • <!DOCTYPE teiCorpus.2 SYSTEM 'tei2.dtd' [ ]> is the DOCTYPE declaration in which

-- teiCorpus.2 names the root element of the document to which the DTD applies. This is the name defined by the TEI as the root element name for language corpora (Guidelines 23).

 -- SYSTEM tei2.dtd says that the required DTD definitions are available locally via the file tei2.dtd. To understand the role of this file, one has to realize that the TEI DTD is partitioned into fragments that can be selected according to the requirements of particular applications, thus obviating the need to include the entire DTD in situations where its full range is not required; tei2.dtd is a 'driver' file (Guidelines 3.2) 'which refers to a number of other DTD files, the exact set of other files referred to depending on which base and which additional tagsets are in use' (Guidelines 3.6). For more information on this selection mechanism see  Guidelines 2.8.2 and Guidelines 3.

-- the square brackets [ ] enclose the NECTE-specific selections from the TEI DTD and some NECTE-specific NOTATION and ENTITY declarations. These are described in what follows.

  • The ENTITY declarations from <!ENTITY % TEI.XML 'INCLUDE> to <!ENTITY % TEI.corpus 'INCLUDE'> select the parts of the full TEI DTD provided by TEI that are relevant to NECTE, and also specify a single NECTE-specific modification to the DTD.

-- <!ENTITY % TEI.XML 'INCLUDE'> is the core tag set that every TEI-conformant document must have access to (Guidelines 6).

-- <!ENTITY % TEI.extensions.ent SYSTEM 'necte_extensions.ent' > is the NECTE-specific modification to the TEI DTD. The NECTE corpus includes audio files that must be embedded into its TEI-conformant encoding. The Guidelines do not, however, provide any obvious way of doing this, and so the mechanism for embedding graphics (Guidelines 22.3) was adapted by renaming the <figure> tag as <audio>. The procedure for this modification is given in Guidelines 29; necte_extensions.ent is the name of the file containing the required declaration.

-- <!ENTITY % TEI.mixed 'INCLUDE'> TEI regards corpora as ‘mixed’ documents that transcend conceptually unitary types like prose, and that consequently require tags from a selection of DTD fragments. This declaration specifies the mixed base fragment, which requires that all DTD fragments that are to constitute the mix be specified; these are given below.

-- <!ENTITY % TEI.spoken 'INCLUDE'> defines tags appropriate to transcriptions of spoken language (Guidelines 11).

-- <!ENTITY % TEI.linking 'INCLUDE'> defines tags for linking and aligning documents (Guidelines 14).

-- <!ENTITY % TEI.analysis 'INCLUDE'> defines 'a tag set for associating simple analyses and interpretations with text elements', including grammatical markup (Guidelines 15).

-- <!ENTITY % TEI.figures 'INCLUDE'> defines a tag set for referring to graphics declared as external entities (Guidelines 22); this is required for the <audio> tag mentioned above.

-- <!ENTITY % TEI.corpus 'INCLUDE'> defines corpus-specific tags (Guidelines 23).

  • <!NOTATION wav SYSTEM "wmplayer.exe"> Non-text entities such as graphics and sound can be embedded in XML documents, but instructions on how these are to be dealt with must be provided for XML processors. This is done using NOTATION declarations (Guidelines 2.7.4; Guidelines 22.3). In the NECTE document, audio is required: <!NOTATION wav SYSTEM "wmplayer.exe"> says that any audio files are in '.wav' format and are played using the Microsoft Windows Media Player© . This application was selected on account of its ubiquity; NECTE users are, of course, at liberty to alter the specification if desired.

  • <!ENTITY % interviews SYSTEM 'interviews.ent'> %interviews; declares a file of entity declarations "interviews.ent" and inserts it into DOCTYPE by the '%interviews;' reference (Guidelines 2.7; Guidelines 22.3). The entity declarations in this file are used to insert the interviews into the corpus, as described in the discussion of the document instance.
  • <!ENTITY % audiofiles SYSTEM "audiofiles.ent"> %audiofiles; declares a file of entity declarations "audiofiles.ent" and inserts it into DOCTYPE by the '%audiofiles;' reference (Guidelines 2.7; Guidelines 22.3). The entity declarations in this file are used to insert the non-text audio component of the interviews into the corpus, as described in the discussion of the interview structure.

The DOCTYPE declaration refers to a variety of files that must be available both for validation and when the NECTE document is used by an XML processor. These are all available from the present NECTE website; see the corpus download area for details.