Data on diversity | COST ES1103 | Newcastle University

Task 2: Generate protocols for the analysis of the data on diversity

The excitement generated by the use of the next generation of sequencing has also been matched by considerable controversy. Broadly speaking, there are two kinds of errors:

sequencing errors, and
erroneous estimates of diversity.

The two forms of error are linked. The sequencing error is largely caused by the tendency of the new generation of massively parallel sequencing to miscall homopolymers (sequences of DNA with the same base), though there are undoubtedly problems with chimera as well (when the DNA from two separate organisms give one sequence).

The effect of this error is to increase the apparent diversity of the sample and lead to inflated numbers of unique sequences. It also skews the underlying distribution of the sequences which is used to estimate the true diversity.

Solutions for sequencing errors

There are a number of approaches to resolving the issue of sequencing error.

The best, but perhaps more difficult approach is to reanalyse the data to remove the sequencing error.
The simpler, but arguably less satisfactory approach is to analyse the data in such a way that the effects of sequence error are minimised.

The relative merits of these approaches need to be discussed and evaluated and we need to consider the effect on the quality and clarity of our picture of microbial diversity.

We are particularly concerned that we should clearly understand the long-term impact of using a simpler approach. Will it condemn us and future generations to a degraded picture of diversity that will prevent us from seeing the fundamental patterns in the data? We could cause long-term harm to the knowledge base if, for short-term gain, we neglect to analyse the data correctly.

Choosing the right approach

We need to evaluate the costs and consequences of the contrasting approaches. Even with very large sample sizes, the observed diversity typically falls well below the true diversity of the system under study.

There are a range of mathematical tools to extrapolate data from a sample to give the diversity of the whole community. Again, there are differing perspectives on how best to achieve this and differing techniques.

We need to discuss and recommend how to determine diversity. This debate must be informed by the debate on the sequencing problem, because the strategy for coping with sequencing error will inevitably affect on the strategy for estimating diversity.