Corpus Files




Related Resources

The DECTE corpus files

DECTE is available free of charge for non-commercial use by individuals or groups that can demonstrate a bona fide interest. Potential users include those in the list below, though others are also welcome to apply.

  • Academic researchers in linguistics and related disciplines such as anthropology, ethnography, sociology, social history, and cultural studies.

  • Educationalists.

  • The media in non-commercial applications.

  • Organisations such as language societies and individuals that may not belong to categories (1) - (3) above, but have a serious interest in historical dialect materials.

How to obtain the DECTE corpus

DECTE can be obtained from the School of English Literary and Linguistic Studies at Newcastle University. The DECTE access request form should be downloaded and returned by post or email to:

  • Postal address: The Newcastle Electronic Corpus of Tyneside English, School of English Literary and Linguistic Studies, Percy Building, University of Newcastle, Newcastle upon Tyne NE1 7RU, United Kingdom.

  • Email: k.p.corrigan@ncl.ac.uk

Successful applicants will be sent a user ID and password for access to the download area of this site. It should, however, be noted that the DECTE as a whole is quite large, particularly on account of the audio files, and that download of the entire corpus may be impractical. We therefore offer applicants the option of receiving DECTE by post on DVD.

To enter the download area, click here.

Using the DECTE corpus

There are two main ways of using DECTE: (1) as an integrated corpus, and (2) as a collection of files that can be individually accessed according to interest or need. These are discussed separately.

1. Using DECTE as an integrated corpus

DECTE was designed to be used in this way, but it's important to understand what is meant by 'integrated'. DECTE is just a collection of text files. It doesn't 'do' anything. The files are, however, formatted in such a way that application software can interpret them as constituting a single corpus with a well-defined structure using Text Encoding Initiative (TEI)-conformant XMLwhich has emerged as a world standard for structuring of text; details of the TEI XML formatting are given in the Documentation section of this website.

For XML-aware application software to interpret DECTE as an integrated corpus, the following files must all be in the same operating system folder:

  • The master file decte.xml: This contains the document type declaration, the global header containing a range of information about the corpus, and entity references to the interview files that comprise the content of the corpus. This is the master file in the sense that XML-aware application software uses it to assemble the other DECTE files into a single corpus.

  • The TEI P5 document type definition files:

analysis.dtd analysis-decl.dtd core.dtd core-decl.dtd corpus.dtd corpus-decl.dtd
header.dtd header-decl.dtd linking.dtd linking-decl.dtd namesdates.dtd namesdates-decl.dtd
spoken.dtd spoken-decl.dtd tei.dtd textstructure.dtd textstructure-decl.dtd  
  • The files containing DECTE-specific modifications to the TEI document type definition: audiofiles.ent and interviews.ent.

  • The interview content files. For each interview there are two files: an audio recording (with *.wav or *.mp3 extension) and an XML text (with *.xml extension) containing the subset of orthographic, phonetic, and part-of-speech tagged transcription of the audio available for that interview.

All these files are all available in the download area.

The master file decte.xml includes all the available TLS, PVC, and NECTE2 content files in the DECTE corpus by means of the following list of entity references:

&decten1tlsg01; &decten1tlsg02; &decten1tlsg03; &decten1tlsg04; &decten1tlsg05; &decten1tlsg06;
&decten1tlsg07; &decten1tlsg08; &decten1tlsg09; &decten1tlsg10; &decten1tlsg11; &decten1tlsg12;
&decten1tlsg13; &decten1tlsg14; &decten1tlsg15; &decten1tlsg16; &decten1tlsg17; &decten1tlsg18;
&decten1tlsg19; &decten1tlsg20; &decten1tlsg21; &decten1tlsg22; &decten1tlsg23; &decten1tlsg24;
&decten1tlsg25; &decten1tlsg26; &decten1tlsg27; &decten1tlsg28; &decten1tlsg29; &decten1tlsg30;
&decten1tlsg31; &decten1tlsg32; &decten1tlsg33; &decten1tlsg34; &decten1tlsg35; &decten1tlsg36;
&decten1tlsg37; &decten1tlsn01; &decten1tlsn02; &decten1tlsn03; &decten1tlsn04; &decten1tlsn05;
&decten1tlsn06; &decten1tlsn07; &decten1pvc01; &decten1pvc02; &decten1pvc03; &decten1pvc04;
&decten1pvc05; &decten1pvc06; &decten1pvc07; &decten1pvc08; &decten1pvc09; &decten1pvc10;
&decten1pvc11; &decten1pvc12; &decten1pvc13; &decten1pvc14; &decten1pvc15; &decten1pvc16;
&decten1pvc17; &decten1pvc18; &decten2y07i001; &decten2y07i002; &decten2y07i003; &decten2y07i004;
&decten2y07i005; &decten2y07i006; &decten2y07i007; &decten2y07i008; &decten2y07i009; &decten2y07i010;
&decten2y07i011; &decten2y07i012; &decten2y07i013; &decten2y07i014; &decten2y08i001; &decten2y08i002;
&decten2y08i003; &decten2y08i004; &decten2y10i001; &decten2y10i002; &decten2y10i003; &decten2y10i004;
&decten2y10i005; &decten2y10i006; &decten2y10i007; &decten2y10i008; &decten2y10i009; &decten2y10i010;
&decten2y10i011; &decten2y10i012; &decten2y10i013; &decten2y10i014; &decten2y10i015; &decten2y10i016;
&decten2y10i017; &decten2y10i018; &decten2y10i019; &decten2y10i020; &decten2y10i021; &decten2y10i022;
&decten2y10i023; &decten2y10i024; &decten2y10i025; &decten2y10i026;    

This list can be edited to include only files of interest using either a conventional text editor or a special-purpose one like the oXygen XML editor One might, for example, only be interested in the TLS material and would therefore edit the PVC and NECTE2 files out of the above list.

Numerous XML viewers, editors, and parsers are available; see for example

For corpus analysis, the  Oxford University Computing Service's Xaira system is the obvious first choice of XML-aware analytical software. It is a general purpose XML search engine that works with any corpus of well-formed XML documents, but is best used with TEI-conformant documents like DECTE.

2. Accessing individual DECTE files

DECTE files can be individually accessed as required without recourse to the full TEI XML corpus structuring apparatus using suitable software such as that referred to in the preceding section or standard multimedia applications for the audio files. To judge from our experience with the earlier NECTE corpus, however, the XML formatting can itself be a problem for some users. The Documentation section of this website observes, rather economically, that 'familiarity with XML and TEI is assumed throughout'; users not familiar with these may find the pervasive markup tags in the DECTE files a distracting encumbrance and yearn for the good old days of plain text files.

This is a not-unreasonable position. XML was never intended to be reader-friendly. It is a markup language that provides a standard for structuring of documents and document collections, and, though XML-formatted documents are text files that can be read by humans, in general they should not be. For an XML document to be readily legible, software that can represent the structural markup in a visually-accessible way is required.  That notwithstanding, some users may still only want plain text files without the markup, and for their convenience text files with the markup stripped out are provided in the download area. These are, however, not part of the corpus proper in the sense that the master file decte.xml does not refer to them, and they are consequently invisible to XML-aware software used to process DECTE. In other words, users of the plain text files are on their own.

Two final notes:

1. Because DECTE is freely available to legitimate users, there can be and therefore is no restriction on user emendation of content and/or TEI-conformant XML encoding. The '.ent' and '.dtd' files should, however, only be edited by users conversant with XML and TEI. Changes that have not been carefully considered will cause the corpus to behave in unpredictable ways or to malfunction when used with application software. The DECTE team would, moreover, be obliged if any such changes were explicitly stated in any public output based on an emended corpus.

2. DECTE is hot off the press, and as such we would be more than pleased to be told by users about omissions, errors, and improvements so that these can be incorporated into future revisions. The relevant contacts are karen.corrigan@ncl.ac.uk and hermann.moisl@ncl.ac.uk.