Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Are multiple transcription start sites functional or mistakes?

If you look in the various databases you'll see that most human genes have multiple Transcription start sites. The evidence for the existence of these variants is solid—they exist—but it's not clear whether the minor start sites are truly functional or whether they are just due to mistakes in transcription initiation. They are included in the databases because annotators are unable to distinguish between these possibilities.

Let's look at the entry for the human triosephosphate isomerase gene (TPI1; Gene ID 7167).


The correct mRNA is NM_0003655, third from the top. (Trust me on this!). The three other variants have different transcription start sites: two of them are upstream and one is downstream of the major site. Are these variants functional or are they simply transcription initiation errors? This is the same problem that we dealt with when we looked at splice variants. In that case I concluded that most splice variants are due to splicing errors and true alternative splicing is rare.

This is not a difficult question to answer when you are looking at specific well-characterized genes such as TPI1. The three variants are present at very low concentrations, they not conserved in other species, and they encode variant proteins that have never been detected. It seems reasonable to go with the null hypothesis; namely, that they are non-functional transcripts due to errors in transcription initiation.

However, this approach is not practical for every one of the 25,000 genes in the human genome so several groups have looked for a genomics experiment that will address the question. I'd like to recommend a recent paper in PLoS Biology that tries do this in a very clever way. It's also a paper that does an excellent job of explaining the controversy in a way that all scientific papers should copy.1
Xu, C., Park, J.-K., and Zhang, J. (2019) Evidence that alternative transcriptional initiation is largely nonadaptive. PLoS Biology, 17(3), e3000197. [doi: 10.1371/journal.pbio.3000197]

Abstract

Alternative transcriptional initiation (ATI) refers to the frequent observation that one gene has multiple transcription start sites (TSSs). Although this phenomenon is thought to be adaptive, the specific advantage is rarely known. Here, we propose that each gene has one optimal TSS and that ATI arises primarily from imprecise transcriptional initiation that could be deleterious. This error hypothesis predicts that (i) the TSS diversity of a gene reduces with its expression level; (ii) the fractional use of the major TSS increases, but that of each minor TSS decreases, with the gene expression level; and (iii) cis-elements for major TSSs are selectively constrained, while those for minor TSSs are not. By contrast, the adaptive hypothesis does not make these predictions a priori. Our analysis of human and mouse transcriptomes confirms each of the three predictions. These and other findings strongly suggest that ATI predominantly results from molecular errors, requiring a major revision of our understanding of the precision and regulation of transcription. [my emphasis - LAM]

Author summary

Multiple surveys of transcriptional initiation showed that mammalian genes typically have multiple transcription start sites such that transcription is initiated from any one of these sites. Many researchers believe that this phenomenon is adaptive because it allows production of multiple transcripts, from the same gene, that potentially vary in function or post-transcriptional regulation. Nevertheless, it is also possible that each gene has only one optimal transcription start site and that alternative transcriptional initiation arises primarily from molecular errors that are slightly deleterious. This error hypothesis makes a series of predictions about the amount of transcription start site diversity per gene, relative uses of the various start sites of a gene, among-tissue and across-species differences in start site usage, and the evolutionary conservation of cis-regulatory elements of various start sites, all of which are verified in our analyses of genome-wide transcription start site data from the human and mouse. These findings strongly suggest that alternative transcriptional initiation largely reflects molecular errors instead of molecular adaptations and require a rethink of the precision and regulation of transcription.
I'm not going to describe the experimental results; if you're interested you can read the paper yourself. Instead, I want to focus on the way the authors present the problem and how it could be resolved.

One of the important issues in these kinds of problems is not whether there are well-established cases where the phenomenon is responsible for functional alternatives but whether the phenomenon is widespread. In this case, we know of specific examples of genes with multiple transcription start sites (TSS) that have a well-established function. The authors include a brief summary of these examples and conclude with an important caveat.
Nevertheless, alternative TSSs with verified benefits account for only a tiny fraction of all known TSSs, while the vast majority of TSSs have unknown functions. More than 90,000 TSSs are annotated for approximately 20,000 human protein-coding genes in ENSEMBL genome reference consortium human build 37 (GRCh37). Recent surveys using high-throughput sequencing methods such as deep cap analysis gene expression (deepCAGE) showed that human TSSs are much more abundant than what has been annotated. Are most TSSs of a gene functionally distinct, and is ATI generally adaptive? While this possibility exits, here we propose and test an alternative, nonadaptive hypothesis that is at least as reasonable as the adaptive hypothesis. Specifically, we propose that there is only one optimal TSS per gene and that other TSSs arise from errors in transcriptional initiation that are mostly slightly deleterious. This hypothesis is based on the consideration that transcriptional initiation has a limited fidelity, and harmful ATI may not be fully suppressed by natural selection if the harm is sufficiently small or if the cost of fully suppressing harmful ATI is even larger than the benefit from suppressing it.
This is how scientific papers should be written but too often we see scientists who assume that because some variants are functional it must mean that all variants are functional. They don't bother to mention the possibility that some could be functional but most are not.

Why is it important to decide whether multiple transcription start sites are functional? The simple answer is that it's always better to know the truth but there's more to it than that. Because these variants are included in the sequence databases it means that they are usually assumed to be functional. Let's say someone wants to look at 5' UTR sequences in order to see if there are specific signals that control RNA stability. In the case of the TPI1 gene (see above) they will get 4 different results because there are four different transcription start sites and the programs that scan the databases aren't able to recognize that three of these might be artifacts. That's a problem.

It also affects the definition of a gene and the amount of DNA devoted to genes. If the longest transcript is taken as the true size of the gene, as it often is, then this misrepresents the true nature of the gene. There's no easy way to fix this problem unless we pay annotators to closely examine each individual gene to figure out which transcripts are functional and which ones are not. They've done this for many splice variants, which is why many splice variants have been removed from the sequence databases, but it's a labor-intensive and expensive task.

Up until now, most scientists have not been aware that there's a problem. As is the case with alternative splicing and other phenomena, the average scientists just assumes that the variants in the databases represent true functional alternatives that contribute to gene expression. The authors of this paper (Xu et al., 2019) want to alert everyone to the distinct possibility that their results with transcription start sites raise a much more general concern that needs to be addressed. That's why they say,
Our results on ATI echo recent findings about a number of phenomena that increase transcriptome diversity, including alternative polyadenylation, alternative splicing, and several forms of RNA editing. They have all been shown to be largely the results of molecular errors instead of adaptive regulatory mechanisms. Together, these findings reveal the astonishing imprecision of key molecular processes in the cell, contrasting the common view of an exquisitely perfected cellular life.
Read that last sentence very carefully because it addresses what I think is the main problem. It's a question of contradictory worldviews that color ones interpretation of the data. If you think that life is exquisitely designed (by natural selection) then you tend to look at all variants as part of an extremely complex system that fine-tunes gene expression. On the other hand, if you think that the drift-barrier hypothesis is valid then you tend to discount the power of natural selection to weed out all transcription and splicing errors and you see biochemistry as an inherently messy process.


1. I've been highly critical of papers about junk DNA and alternative splicing because they often ignore the fact that there is a controversy. They do not mention that there is solid evidence for junk DNA and solid evidence that alternative splicing is uncommon.


This post first appeared on Sandwalk, please read the originial post: here

Share the post

Are multiple transcription start sites functional or mistakes?

×

Subscribe to Sandwalk

Get updates delivered right to your inbox!

Thank you for your subscription

×