Deciphering Splicing Codes of Spliceosomal Introns

Figure1
Fig.1 The central dogma of molecular biology. 1. Duplication; 2. Transcription; 3. reverse transcription; 4. RNA modification and 5. Translation.

The Central Dogma of molecular biology in Fig.1 was first articulated by F. Crick in 1958 and shows that information transfers among different large molecules3. Among information transfers in Fig.1, duplication (1), transcription (2), reverse transcription (3) and translation (5) always follow their rules to transfer their information faithfully. In prokaryotes, regulation of the information transfer or communication with environmental factors, are stored at “naked” DNA molecules and a number of related genes are organized as an operon to regulate their gene expression via (inducible or repressible) negative and/or positive control. For example, The lac operon of the model bacterium Escherichia coli consists of three adjacent structural genes, a promoter, a terminator, and an operator. Increasing lactose interacts with the repressor protein which binds very tightly to the lac operator. The repressor is released from the lac operator to allow transcription of lac structural genes. This demonstrates that prokaryotes are able to sense their environments and their genetic information can direct make direct dialogue with environments. More importantly, this information is encoded on their DNA sequences in form of the repressor and operator. Unlike the prokaryotes, eukaryotic organisms have nuclear membrane that encloses the genetic materials and prevents direct dialogue or communication between their genetic materials and their environments.

Do eukaryotic organisms communicate with their environments? This question has been answered by Charles Darwin's “On the Origin of Species” published on 24 November 1859. You have known the answer. During the last two decades, more than 170 eukaryotic genomes have been sequenced and analyzed (http://www.genomesonline.org/). No large numbers of repressors have been identified so far. No conserved sequences similar to the lac operator have been characterized in functions-related genes. Consistent with these data, human genome encodes much smaller numbers of protein-coding genes than what had been predicted. However, most nuclear genes of eukaryotic organisms characterized so far have intervening sequences or spliceosomal introns. The numbers of spliceosomal introns, intron densities and average intron sizes are generally increased as eukaryotic organism become complex. Unlike a drop of ink in a cup of water becoming uniform over time, spliceosomal introns are against natural force. This indicates that these “junk” DNA sequences have preferred by natural selection and beneficial to eukaryotic organisms.

Figure2
View larger version in a new window
Fig.2 a) Example of E5-I3 and I5-E3 alignments for intron 8 (168 bp) of the human ciz1 gene2. The black and red italic uppercase letters represent the 5’ and 3’ exonic sequences at splice sites, respectively, and the red italic and black lowercase letters indicate the 5’ and 3’ intronic sequences. The vertical lines indicate uninterrupted identical nucleotides extending from the splice junctions for the E5-I3 and I5-E3 alignments, and are designated as LIN (length of identical nucleotides). Asterisks represent identical nucleotides outside of this region. b) Sizes of animal intron datasets, and proportion with LIN ≥6, also expressed as the ratio between E5-I3 (black in b) and I5-E3 (red). All observed differences between E5-I3 and I5-E3 were statistically significant (p< 0.001)."

We previously have shown that recently acquired human spliceosomal introns have signatures of similar 5’ and 3’ splice sites. To assess whether the degree of similarity is uniform along the splice site regions, we divided each splice junction into its exonic and intronic portions (designated as E5 and I5 for the 5’ splice site and I3 and E3 for the 3’ splice site) and starting from the splice junction, we scored the length of identical nucleotides (LIN) in an uninterrupted stretch independently for the E5-I3 and I5-E3 alignments. Fig.1a gives a specific example showing the sequences flanking intron 8 of the human ciz1 gene which encodes Cip1-interacting zinc finger protein 12. This was done for human introns as well as those for other vertebrates (mouse, zebrafish and chicken) and the invertebrates Caenorhabditis elegans and Drosophila melanogaster (Fig.2b). Notably, as shown in Fig.2b, the percentage of E5-I3 alignments with LIN ≥6 is significantly higher (p < 0.001) than that of I5-E3 in humans (by 3-fold), in other vertebrates (by 2.4 to 5.3 fold) and D. melanogaster (by 4.5 fold). Interestingly, in C. elegans whose genome is believed to contain relatively few recently gained introns, there is a 14-fold excess of E5-I3, driven in part by a low frequency of I5-E3 with LIN ≥6 (compared to vertebrates). One intriguing possibility is that this pronounced asymmetry might relate to trans-splicing which plays a significant role in gene expression in C. elegans unlike in the other animals surveyed here.


To examine the distributions of E5-I3 and I5-E3 alignments for the full range of LIN from 0 to ≥20, we plotted their frequencies for human and D. melanogaster (Fig.3a, b; mouse, zebrafish, chicken and C. elegans omitted and available if requested). The black arrows delimit the window for which there was a significantly higher value observed for E5-I3 (blue) than for I5-E3 (red), as judged by U-test with p <0.001, and the values for all vertebrates were significantly higher than for random sequences (Fig. 3a,b, black). For large LIN (≥10 for human and C. elegans, ≥8 for D. melanogaster and ≥9 for the others), no significant difference was seen between E5-I3 and I5-E3 at distances that are more than 10 nt away from the splice junctions, suggesting that the asymmetry is restricted to the vicinity near splice sites. For both the invertebrates, the virtually complete absence of introns with long LINs is consistent with few recent intron gains. Moreover, the observed bias is not the result of multiple, linked evolutionary events in a few genes for any of the six organisms (data not shown). The E5-I3 and I5-E3 alignments also were compared to scramble (mix-and-match) data produced by randomly aligning E5 with I3, and I5 with E3, from a non-redundant intron dataset and again a statistically significant difference was observed in all cases (data available if requested).


Figure3
View larger version in a new window
Fig.3 Comparison of LIN (length of identical nucleotides) distributions for E5-I3 and I5-E3 alignments from various animals. a) human, b) mouse, c) zebrafish, d) chicken, e). C. elegans and f) D. melanogaster. The solid blue squares and open red triangles represent E5-I3 and I5-E3 alignments, respectively. The dashed lines show the random sequence controls. The black arrows delimit the windows for which the frequencies of E5-I3 were statistically significantly higher than those of I5-E3 (p< 0.001), with the exception of one case in each of zebrafish and human (p<0.05).

Because it is known that U1 snRNA, in addition to base-pairing with sequences at the 5’ end of the intron, also imposes a strong constraint on the terminal AG of the upstream exon (within E5), as does the binding of U2AF35 to the cAG region at the 3’ end of the intron (within I3), we repeated the analysis omitting the sequences located at positions -3 to +3 of both the 5’ and 3’ splice sites. The frequencies of LINs with values ≥1 and ≥5 for human E5-I3 and I5-E3 were significantly higher (p<0.05) than those of the corresponding scrambles. Taken together, our analyses indicate that the known preference of AG at the 5’ (E5) and 3’ (I3) splice sites, although strong, is not entirely responsible for the observed E5-I3 bias. Thus, the excess of introns with high LIN values for the E5-I3 alignment (compared to I5-E3) appears not to be due simply to the conservation of sequences that are part of the splicing consensus motifs.

This disparity suggests that E5 sequences are similar to their ancestors, self-splicing group II ribozymes where 5’ intronic-binding sequences (IBS1 and IBS2) were complementary to specific exonic-binding sites (EBS1 and EBS2) within domain I in addition to long-range single base-pair interaction at the 3’ splice-site and these interactions were important for the specificity of splicing. We can hypothesize that EBS1, EBS2 and EBS3 have evolved into separate molecules (RNAs or proteins) as the conserved domains of group II ribozymes have evolved into U1, U2, U4, U5 and U6 snRNAs of spliceosomes, respectively. Therefore, we propose that both 5’ exonic and 3’ intronic sequences constitute the splicing codes of spliceosomal introns, which are decoded by yet uncharacterized trans-acting splicer RNAs/proteins, which are first proposed by V. Murray and R. Holliday 1979.

From this splicing codes theory, we can make the following predictions: