Landscapes and Dynamics of Fusion Transcripts in Healthy Human Tissues

Fusion transcripts, or chimeric mRNAs, are traditionally thought to be hallmarks of cancer and have been used as cancer diagnostic biomarkers and therapeutic targets. However, previous studies have shown that fusion transcripts have been detected in normal healthy tissues. Recent rapid advances in RNA-seq technologies make it possible to systematically investigate fusion transcripts in cancerous and normal tissues. Over 20 software has been developed to identify fusion transcripts. However, these software systems are not able to identify fusion transcripts accurately and quickly. Using splicingcode theory, we have developed SCIF (SplicingCodes Identify Fusion Transcripts) to identify fusion transcripts resulted from genomic instabilities (such as CNVs, insertions, deletions, inversions, and translocations), read-through and trans-splicing. The SCIF software has developed technologies to eliminate “spurious” fusion transcripts produced during generation of RNA-seq and false fusion transcripts induced by software itself. Here, we have reported that SCIF is used to analyze fusion transcripts from some RNA-seq datasets of GTEx and HPA RNA-seq normal tissues by Science for Life Laboratory. Stockholm, Sweden (defined as HPA). To reduce variations generated from different laboratories and experimental protocols and to generate consistent data, we have only focused on analysis and results of HPA RNA-seq normal tissues.

The HAP dataset has total 203 RNA-seq samples from 32 healthy human tissues of 127 different individuals and contains 6,917 million RNA-seq reads. We have identified total 33,975 fusion transcripts of unique fusion junctions, which are average of 4.9 fusion transcripts per million (FTPM). 1815 fusion transcripts have been detected in more than 2 samples while 944 fusion transcripts have been observed in over 2 different tissues. Analysis results show that skeletal muscle and salivary gland are poorly in fusion transcripts and have 1.94 and 2.80 reads per million (RPM), respectively. On the other hand, bone marrow is rich in fusion transcripts and has 33.1 RPM, which are 17 folds of that of skeletal muscle. These tissue differences in fusion transcripts have been supported by other datasets and suggest that fusion transcripts may be associated with cellular functions. In addition, some fusion transcripts are tissue-specific. For example, fusion transcripts with B2M gene are most likely found in bone marrow.

The most frequent fusion transcripts are KANSARL (KANSL1-ARL17) fusion transcripts, which have detected in 78 out of 203 samples and 27 out of 32 tissues. Recently, we have validated KANSARL fusion gene as the first cancer predisposition fusion gene specific to populations of European ancestry origin. Six additional fusion genes similar to KANSARL have been identified, suggesting familially-inherited fusion genes may play essential roles in human genetic diversities. Another highly recurrent fusion transcripts is read-through MTG1-SCART1, which expresses seven isoforms. The main MTG1-SCART1 fusion transcripts are detected in 75 samples and 28 different tissues. However, these MTG1-SCART1 fusion transcripts are highly variable among different tissues and among different individuals, suggesting these variations may be caused by alternation of cellular physiology, environments and epigenetics as well as potential genetic mutations.

In summary, we have identified large numbers of fusion transcripts, many of which are highly recurrent and may reflect complex consequences of genetic, developmental, epigenetic and environmental interactions of healthy individuals. More systematic efforts are required.