Corpus Files




Related Resources


A valid as opposed to merely well-formed XML document must include a DTD in relation to which the document can be validated. This is done by means of a document type or DOCTYPE declaration in which the XML element, attribute, and other tags used in the document are specified. This specification can be internal in the sense that its components appear lexically within the DOCTYPE declaration, or external in that the names of one or more files containing the specification are given, or it can be a combination of the two. In the present case the specification takes the form of references to external files some of which are a selection TEI module files containing the tag sets used in the corpus, and some of which are files containing the names of the content files which constitute the corpus.

The DTD declaration in decte.xml is as follows:

<!-- DTD declaration -->
<!DOCTYPE teiCorpus SYSTEM "tei.dtd" [

<!-- Additions to core entities in tei.dtd -->
<!ENTITY % TEI.corpus 'INCLUDE'>
<!ENTITY % TEI.header 'INCLUDE'>
<!ENTITY % TEI.textstructure 'INCLUDE'>
<!ENTITY % TEI.linking 'INCLUDE'>
<!ENTITY % TEI.analysis 'INCLUDE'>
<!ENTITY % TEI.namesdates 'INCLUDE'>
<!ENTITY % TEI.spoken 'INCLUDE'>

<!-- Additional ENTITY declarations -->
<!-- Interview XML files -->
<!ENTITY % interviews SYSTEM "interviews.ent"> %interviews;
<!-- Interview audio files -->
<!ENTITY % audiofiles SYSTEM "audiofiles.ent"> %audiofiles;


  •  <!DOCTYPE teiCorpus SYSTEM "tei.dtd" [...]> is the DOCTYPE declaration in which

- teiCorpus names the root element of the document to which the DTD applies. This is the name defined by the TEI as the root element name for language corpora (Guidelines 15).

- SYSTEM "tei.dtd" says that the required DTD definitions are available locally via the file tei.dtd. To understand the role of this file, one has to realize that the TEI DTD is partitioned into modules that can be selected according to the requirements of particular applications, thus obviating the need to include the entire DTD in situations where its full range is not required; tei.dtd is a 'driver' file which refers to these TEI DTD files (Guidelines 1).

- the square brackets [ ] enclose the DECTE-specific selections from the TEI DTD and some DECTE-specific <ENTITY> declarations. These are described in what follows.

  • The <ENTITY> declarations from <!ENTITY % TEI.core 'INCLUDE'> to <!ENTITY % TEI.spoken 'INCLUDE'> select the parts of the full TEI DTD provided by TEI that are relevant to DECTE.

- <!ENTITY % TEI.core 'INCLUDE'> : The core tag set that every TEI-conformant document must have access to (Guidelines 3).

- <!ENTITY % TEI.corpus 'INCLUDE'> : Tags specific to language corpora (Guidelines 15).

- <!ENTITY % TEI.header 'INCLUDE'> : Inclusion of metadata (Guidelines 2).

- <!ENTITY % TEI.textstructure 'INCLUDE'> : Document structuring tags (Guidelines 4);

- <!ENTITY % TEI.linking 'INCLUDE'> : Linking and alignment of document components (Guidelines 16).

- <!ENTITY % TEI.analysis 'INCLUDE'> : Interpretation of document elements (Guidelines 17).

- <!ENTITY % TEI.namesdates 'INCLUDE'> : Inclusion of names, dates, and related information (Guidelines 13).

- <!ENTITY % TEI.spoken 'INCLUDE'> : Transcription of spoken language (Guidelines 8).

  • The additional <ENTITY> declarations are DECTE-specific additions to the TEI DTD.

- <!ENTITY % interviews SYSTEM "interviews.ent"> %interviews: List of the XML-formatted files included in DECTE.

- <!ENTITY % audiofiles SYSTEM "audiofiles.ent"> %audiofiles: List of the audio files included in DECTE.