DECTE | Corpus Files

The DECTE files

DECTE is available free of charge for non-commercial use by individuals or groups that can demonstrate a bona fide interest. Potential users include those in the list below, though others are also welcome to apply.

Academic researchers in linguistics and related disciplines such as anthropology, ethnography, sociology, social history, and cultural studies.
Educationalists.
The media in non-commercial applications.
Individuals and organisations such as language societies that may not belong to the three categories above, but have a serious interest in historical dialect materials.

How to obtain DECTE

DECTE can be obtained from the School of English Literature, Language and Linguistics at Newcastle University. The DECTE access request form should be downloaded and returned by email or by post to:

Postal address:
Professor Karen Corrigan
The Diachronic Electronic Corpus of Tyneside English
School of English Literature, Language and Linguistics
Percy Building
Newcastle University
Newcastle upon Tyne
NE1 7RU
United Kingdom
Email:
k.p.corrigan@ncl.ac.uk

Successful applicants will be sent a user ID and password for access to the download area of this site. It should be noted, however, that DECTE as a whole is quite large, particularly on account of the audio files, and that download of the entire corpus may be impractical. We therefore offer applicants the option of receiving DECTE by post on DVD.

To enter the download area click here.

Using DECTE

There are two main ways of using DECTE: (1) as an integrated corpus, and (2) as a collection of files that can be individually accessed according to interest or need. These are discussed separately.

1. Using DECTE as an integrated corpus

DECTE was designed to be used in this way, but it's important to understand what is meant by 'integrated'. DECTE is just a collection of text files (and associated audio files). It doesn't 'do' anything on its own.

The files are, however, formatted in such a way that application software can interpret them as constituting a single corpus with a well-defined structure using Text Encoding Initiative (TEI)-conformant XML which has emerged as a world standard for the structuring of electronic text; details of the TEI XML formatting used in DECTE are given in the Documentation section of this website.

For XML-aware application software to interpret DECTE as an integrated corpus, the following files must all be in the same operating system folder:

The master file decte.xml: this contains the document type declaration, the global header containing a range of information about the corpus, and entity references to the interview files that constitute the content of the corpus. This is the master file in the sense that XML-aware application software uses it to identify and assemble the other DECTE files into a single corpus.
The TEI P5 document type definition files:

analysis.dtd	analysis-decl.dtd	core.dtd	core-decl.dtd	corpus.dtd
corpus-decl.dtd	header.dtd	header-decl.dtd	linking.dtd	linking-decl.dtd
namesdates.dtd	namesdates-decl.dtd	spoken.dtd	spoken-decl.dtd	tei.dtd
textstructure.dtd	textstructure-decl.dtd

The files containing DECTE-specific modifications to the TEI document type definition: audiofiles.ent and interviews.ent.
The interview content files. For each interview there are two files: an audio recording (with *.wav or *.mp3 extension) and an XML text (with *.xml extension) containing the subset of orthographic, phonetic, and part-of-speech tagged transcription of the audio available for that interview.

All these files are all available in the download area.

The master file, decte.xml, uses the following list of entity references to identify and include all the available TLS, PVC, and NECTE2 content files in the corpus:

*(a) TLS file entity references*
&decten1tlsg01;	&decten1tlsg02;	&decten1tlsg03;	&decten1tlsg04;	&decten1tlsg05;
&decten1tlsg06;	&decten1tlsg07;	&decten1tlsg08;	&decten1tlsg09;	&decten1tlsg10;
&decten1tlsg11;	&decten1tlsg12;	&decten1tlsg13;	&decten1tlsg14;	&decten1tlsg15;
&decten1tlsg16;	&decten1tlsg17;	&decten1tlsg18;	&decten1tlsg19;	&decten1tlsg20;
&decten1tlsg21;	&decten1tlsg22;	&decten1tlsg23;	&decten1tlsg24;	&decten1tlsg25;
&decten1tlsg26;	&decten1tlsg27;	&decten1tlsg28;	&decten1tlsg29;	&decten1tlsg30;
&decten1tlsg31;	&decten1tlsg32;	&decten1tlsg33;	&decten1tlsg34;	&decten1tlsg35;
&decten1tlsg36;	&decten1tlsg37;	&decten1tlsn01;	&decten1tlsn02;	&decten1tlsn03;
&decten1tlsn04;	&decten1tlsn05;	&decten1tlsn06;	&decten1tlsn07;
*(b) PVC file entity references*
&decten1pvc01;	&decten1pvc02;	&decten1pvc03;	&decten1pvc04;	&decten1pvc05;
&decten1pvc06;	&decten1pvc07;	&decten1pvc08;	&decten1pvc09;	&decten1pvc10;
&decten1pvc11;	&decten1pvc12;	&decten1pvc13;	&decten1pvc14;	&decten1pvc15;
&decten1pvc16;	&decten1pvc17;	&decten1pvc18;
*(c) NECTE2 file entity references*
&decten2y07i001;	&decten2y07i002;	&decten2y07i003;	&decten2y07i004;	&decten2y07i005;
&decten2y07i006;	&decten2y07i007;	&decten2y07i008;	&decten2y07i009;	&decten2y07i010;
&decten2y07i011;	&decten2y07i012;	&decten2y07i013;	&decten2y07i014;	&decten2y08i001;
&decten2y08i002;	&decten2y08i003;	&decten2y08i004;	&decten2y10i001;	&decten2y10i002;
&decten2y10i003;	&decten2y10i004;	&decten2y10i005;	&decten2y10i006;	&decten2y10i007;
&decten2y10i008;	&decten2y10i009;	&decten2y10i010;	&decten2y10i011;	&decten2y10i012;
&decten2y10i013;	&decten2y10i014;	&decten2y10i015;	&decten2y10i016;	&decten2y10i017;
&decten2y10i018;	&decten2y10i019;	&decten2y10i020;	&decten2y10i021;	&decten2y10i022;
&decten2y10i023;	&decten2y10i024;	&decten2y10i025;	&decten2y10i026;

This list can be edited to include only files of interest using either a conventional text editor program or more specialized software that is designed for editing XML files, such as the oXygen XML editor or the freeware Notepad++ source code editor. One might, for example, be interested only in the TLS material and would therefore edit the PVC and NECTE2 file names out of the above list.

Numerous XML viewers, editors, and parsers are available; for details on some of these, see for example:

For corpus analysis, the Oxford University Computing Service's Xaira system is the obvious first choice of XML-aware analytical software. It is a general purpose XML search engine that works with any corpus of well-formed XML documents, but is best used with TEI-conformant documents like DECTE.

2. Accessing individual DECTE files

DECTE files can be individually accessed as required, without recourse to the full TEI XML corpus structuring apparatus, using suitable software such as that referred to in the preceding section or standard multimedia applications for the audio files.

To judge from our experience with the earlier NECTE project, however, the XML formatting can itself be a problem for some users. The Documentation section of this website observes, rather economically, that 'familiarity with XML and TEI is assumed throughout'; users not familiar with these may find the pervasive markup tags in the DECTE files a distracting encumbrance and yearn for the good old days of plain text files.

This is a not-unreasonable position. XML was never intended to be reader-friendly. It is a markup language that provides a standard for the structuring of documents and document collections, and, though XML-formatted documents are text files that can be read by humans, in general they should not be. For an XML document to be rendered easily legible, software that can represent the structural markup in a visually-accessible way is required.

That notwithstanding, some users may still only want plain text files without the markup, and for their convenience text file versions of the interview transcripts without the TEI XML markup are provided in the download area.

These are, however, not part of the corpus proper, in the sense that the master file decte.xml does not refer to them, and they are consequently invisible to any XML-aware software that might be used to process DECTE.

In addition to being read as plain text files, The TXT file versions of the interview transcripts can be used in a fairly straightforward way with text corpus analysis software such as AntConc and WordSmith. (There are some basic guidelines for using the text files with AntConc in the Schools section of our public-facing Talk of Toon website.)

Two final notes:

1. Because DECTE is freely available to legitimate users, there can be and therefore is no restriction on user emendation of content and/or TEI-conformant XML encoding. The '.ent' and '.dtd' files should, however, only be edited by users conversant with XML and TEI. Changes that have not been carefully considered will cause the corpus to behave in unpredictable ways or to malfunction when used with XML-aware software. The DECTE team would, moreover, be obliged if any such changes were explicitly stated in any public output based on an emended corpus.

2. We would be more than pleased to be told by users about any omissions or errors that they encounter, or suggestions for improvements, so that these can be considered for incorporation into future revisions of the corpus. The relevant contacts are Karen Corrigan and Adam Mearns.