Analysis of Transcriptome Complexity Through RNA Sequencing in Normal and Failing Murine HeartsNovelty and Significance
Rationale: Accurate and comprehensive de novo transcriptome profiling in heart is a central issue to better understand cardiac physiology and diseases. Although significant progress has been made in genome-wide profiling for quantitative changes in cardiac gene expression, current knowledge offers limited insights to the total complexity in cardiac transcriptome at individual exon level.
Objective: To develop more robust bioinformatic approaches to analyze high-throughput RNA sequencing (RNA-Seq) data, with the focus on the investigation of transcriptome complexity at individual exon and transcript levels.
Methods and Results: In addition to overall gene expression analysis, the methods developed in this study were used to analyze RNA-Seq data with respect to individual transcript isoforms, novel spliced exons, novel alternative terminal exons, novel transcript clusters (ie, novel genes), and long noncoding RNA genes. We applied these approaches to RNA-Seq data obtained from mouse hearts after pressure-overload–induced by transaortic constriction. Based on experimental validations, analyses of the features of the identified exons/transcripts, and expression analyses including previously published RNA-Seq data, we demonstrate that the methods are highly effective in detecting and quantifying individual exons and transcripts. Novel insights inferred from the examined aspects of the cardiac transcriptome open ways to further experimental investigations.
Conclusions: Our work provided a comprehensive set of methods to analyze mouse cardiac transcriptome complexity at individual exon and transcript levels. Applications of the methods may infer important new insights to gene regulation in normal and disease hearts in terms of exon utilization and potential involvement of novel components of cardiac transcriptome.
Regulation of gene expression has a critical role in normal cardiac function and pathogenesis of heart failure. A global change in cardiac transcriptome from normal to one with characteristics of “fetal-like” profile is a major part of the pathological remodeling in failing hearts.1–3 Although much insight has been learnt from transcriptome profiling studies using microarray technologies, limitations in coverage and sensitivity still leave a significant part of the cardiac transcriptome landscape unexplored, especially concerning expression and variation at single exon resolution. Recent advances in high-throughput sequencing technologies are enabling a new way to study transcriptomes: massively parallel sequencing of short reads derived from mRNAs (RNA-Seq).4,5 Compared with microarray technologies, RNA-Seq was shown to enable more accurate quantification of gene expression levels.6,7 More importantly, RNA-Seq does not require a priori annotation of gene and transcript structures. It allows not only in-depth studies of expression changes in known genes and alternative isoforms but also unbiased characterization of novel exons and novel transcript clusters. It also enables investigation of long noncoding RNA (lncRNA) genes, which are not usually targeted by alternative transcriptome profiling methods, such as microarrays. Thus, RNA-Seq opens the way to de novo transcriptome reconstruction and discovery of novel transcripts of any mammalian cell. Indeed, recent reports using RNA-Seq to profile transcriptome in mouse heart have revealed interesting new insights in cardiac transcriptional and signaling networks in genetic models of heart failure.7–10
In the present study, we developed bioinformatic methods to identify transcript structures and analyze transcriptome complexities with a particular emphasis on quantification of RNA splicing variants at single exon resolution using RNA-Seq data of normal and failing murine hearts. The methods take full advantage of the strength of RNA-Seq. We show that they allowed in-depth profiling and quantification of alternative mRNA structures, novel exons, novel transcript clusters (NTCs) and long noncoding RNA genes. The results open ways to direct experimental investigation of these novel transcriptome features and highlight the power of RNA-Seq to provide a comprehensive bioinformatic delineation of disease-specific transcriptomes.
RNA-Seq Data Generation and Mapping
An expanded Methods section is available in the Data Supplement at http://hyper.ahajournals.org.
Left ventricular tissues were collected from male C57BL/6 mice after 1 week (hypertrophy stage, HY) and 8 weeks after transaortic constriction (TAC) procedure (heart failure stage, HF) and their corresponding sham controls (sham-HY, sham-HF) (Online Supplement and Online Table I). To conduct RNA-Seq analysis, total RNAs from 6 TAC and sham-operated mice at the HY stage and 4 TAC and corresponding sham mice at the HF stage were obtained. Paired-end RNA-Seq reads (2×72nt or 2×76nt long) were mapped to the mouse Ensembl transcript sequences (release 56), using Bowtie11 and BLAT.12 Mapping was first carried out for individual reads without considering the read-pairing information. Next, all read pairs were inspected for correct pairing by considering whether they map to the same chromosome, potentially in the same gene and with correct orientation relative to each other (Online Supplement). The pair of reads was considered as uniquely mapped if and only if 1 unique pair of mapped locations was identified.
Analysis of Gene and Transcript Isoform Expression
Levels of gene and exon expression were quantified using the RPKM measure13 and a minimum RPKM value of 3 (≈1 copy per cell) is required for expressed genes/isoforms (see Results for justification of this cutoff). Gene expression differences were evaluated using Fisher's exact test after normalizing by the total number of mapped reads in each lane using the upper-quartile normalization method.14 The resulted probability values were corrected through the Benjamini and Hochberg method. Differentially expressed genes were defined as those with changes of at least 1.5-fold between a pair of samples at a false discovery rate (FDR) of 5% for genes expressed at ≥3 RPKM in ≥1 sample. The Cufflinks software (v 0.9.2)15 was used to estimate expression levels of individual isoforms of an Ensembl gene, which allowed identification of isoform-specific expression changes due to alternative transcription start site (ATSS) or alternative splicing (AS). To further assess overall isoform expression dissimilarity in 2 samples (A and B), a dissimilarity score was defined based on the Morisita-Horn similarity index as follows:
where Pi(A) and Pi(B) represent the expression of isoform i normalized by overall gene expression in the sample A or B. We only considered genes expressed at ≥3 RPKM in this analysis.
We developed 2 different methods for the reconstruction of transcript isoforms: (1) guided transcriptome reconstruction: a method to reconstruct isoforms and discover novel exons within known genes; and (2) de novo reconstruction: a method to reconstruct completely new isoforms independent of known gene annotations. Details of the 2 approaches are presented in the Online Supplement. Briefly, the following steps are common to both approaches: (1) define expressed sequence fragments (seq-frags); (2) identify connections between seq-frags based on reads mapped to spliced junctions; (3) generate a directed graph using seq-frags and their connections for each gene or each chromosome; and (4) construct the isoforms by finding all possible paths in the graph. In guided transcriptome reconstruction, novel seq-frags can represent novel exons or extended regions of known exons. Identification of exon boundaries (novel or known) depends on the presence of reads mapped to spliced junctions. Thus, to reduce possible false-positive isoforms, we required at least 2 junction reads as evidence of a splicing event. In de novo reconstruction, NTCs were identified in intergenic regions and clustered together on each chromosome. The boundaries of NTCs were decided by confirming the absence of spliced junctions or expressed seq-frags. Filters for minimum expression levels of seq-frags and canonical splicing signals were applied. The coding potential of NTCs was evaluated using the Coding Potential Calculator software.16
Statistical and Computational Methods
For gene expression analysis, the statistical significance was assessed by Fisher's exact test as described above. Pearson correlation coefficients for gene expression validation were calculated in R. For GO analysis, empirical probability values were estimated based on 10 000 randomized simulations (Online Supplement), and the Bonferroni cutoff was used to determine significant probability values. All other computational procedures including transcript isoform reconstructions were carried out using in-house programs written in Python, Perl, and R.
Mapping of RNA-Seq Reads
We obtained a total of 168 million pairs of reads using the standard paired-end RNA-Seq protocol on the Illumina GA II sequencer. Table 1 shows the number of reads in each TAC/sham group and the mapping results. Our mapping procedure (Figure 1) ensures that reads generated by both known genes and novel transcribed regions were identifiable. In addition, it enabled detection of novel and known spliced junctions that connect 2 or more exons intervened by long introns. The usage of paired-end sequencing brings the advantage of improved mapping performance. We estimated that 4% of all the original reads were mapped nonuniquely as singletons but uniquely as pairs. Ambiguous mapping results can be removed by examining the pairing of reads. For example, 12% of all reads were categorized as unmapped in the paired-end mode but mapped uniquely as singletons (possible mapping errors in the single-end mode). Only pairs of reads that mapped uniquely to the transcriptome and/or the genome were retained for further analyses (Figure 1). Among the 168 million pairs of reads obtained in our study, about 95 million (57%) were uniquely mapped in total, which covered 72% of the known exon-exon junctions, 82% of exon bodies and 77% of known genes.
Analysis of Gene Expression Levels Through RNA-Seq
RNA-Seq has been demonstrated to be an effective approach for gene expression profiling in mouse heart.7–9 One advantage of this method is its ability to provide quantitative read-out of the mRNA expression levels in 1 sample, in contrast to microarrays that just permit comparative analyses without absolute expression values. Consistent with the previous studies,7–9 we also observed a wide dynamic range of expression values varying from about 1 copy per cell (3 RPKM) for the Ankrd12 gene to 8,048 copies per cell for the mt-Co1 gene in the sham hearts. In this work, we use 3 RPKM as the minimum cutoff to filter for expressed genes13 considering its biological relevance and the fact that some genes (eg, Kcnd2, Kcnd3) with heart-related function are expressed at about 3 RPKM in shams. Other RNA-Seq studies also showed that low abundant transcripts expressed at about 1 copy per cell include transcriptional factors and other functionally important genes for cardiac regulation.7–9 In addition, in a PubMed search of published abstracts, we found that genes with ≥3 RPKM expression levels are about twice as often associated with the keywords “heart” and “cardiac” than those expressed at <3 RPKM. Altogether, 9833 genes (29% of all Ensembl genes) are expressed at ≥3 RPKM in at least 1 sample in our study.
Another advantage of the RNA-Seq method over microarrays is the improved quantification of differential gene expression between samples.7 We validated expression changes of 42 genes using real-time PCR (Online Figure I and Online Table II). This validation demonstrated that the results from RNA-Seq and real-time PCR are highly concordant (r=0.90, Pearson correlation). To establish biologically meaningful criteria to determine significant differential gene expression, we examined the levels of genes known to be altered in failing hearts, including Myh7, Egr1, Nppb, Pln, and Actb. We confirmed that all the above genes had expression changes of at least 1.5-fold between the HF and sham-HF samples in the real-time PCR or RNA-Seq. Thus, we used the following criteria to identify differentially expressed genes: (1) gene expression level ≥3 RPKM in either sham or HY/HF or both; (2) change in expression level ≥1.5 fold; and (3) Fisher's exact test (Methods) FDR <5%. Altogether, 97 and 1435 genes passed the above filters between sham-HY and HY, sham-HF and HF, respectively (Online Table III). The number of genes and their magnitudes of changes are larger at the HF stage, consistent with the expected higher degrees of cardiac remodeling in HF contributed by vascular remodeling, inflammatory response and fibrosis in the myocardium. In addition to the well-known HF or HY-related genes, we found genes with significant alteration in expression but that have never been implicated in heart failure or cardiac hypertrophy, such as Prc1, E2f1, Birc5, Iqgap3, Cdc20, and Cdca8. Our results suggest that RNA-Seq may provide novel insights in overall gene expression as demonstrated previously.7–9
Analysis of Alternative Transcript Isoforms Through RNA-Seq
One of the main voids in our current knowledge of cardiac transcriptome is the genome-wide profile of mRNA splicing variants, a very challenging task not readily accomplishable by most microarray platforms. For this purpose, we used the package Cufflinks,15 which was shown to effectively capture isoform-specific expression and alteration in RNA-Seq data. Although Cufflinks has a number of modules for different purposes, we focused on its usage to infer expression levels and differential expression of individual transcript isoforms. As inputs to Cufflinks, we used our read mapping results described above and the set of Ensembl-defined genes and their spliced variants. The novel exons and NTCs identified in our study were not included because the nature of the short sequencing reads limits the accuracy in predicting complete structures of spliced variants that are needed to estimate their expression levels.
We first examined the absolute isoform expression estimated by Cufflinks. For genes with multiple transcript isoforms, we detected a total of 7811 isoforms with expression level ≥3 RPKM in at least 1 sample. The most abundant isoform was from the gene Myl2 (Myosin regulatory light chain 2) in the sham-HF sample. The isoforms of other heart-related transcripts such as Actc1, Atp2a2, Myh6, Tnnt2, and Tpm1 were also highly expressed. We observed spliced variants for numerous genes (eg, Atp2a2, Cacna1c, Slc6a8, and Ank2) known to undergo alternative splicing in cardiac tissues.7 For the gene Camk2, we confirmed that its neuronal-specific isoform is much less expressed than the heart-specific isoforms (2.96 versus 22.11 RPKM) in the sham-HF sample. These findings suggest that RNA-Seq can readily detect isoforms of a gene due to alternative RNA splicing.
We next analyzed the differential expression patterns of individual transcript isoforms. Cufflinks analysis identified 1087 genes (mostly protein-coding genes) with significant isoform-specific expression changes (q-value <0.05) due to ATSS or AS or both (Figure 2A and Online Table IV). As examples, Figure 2B shows 2 genes (with ATSS and AS, respectively), their reads distributions, and RT-PCR validation results. If the same criteria were applied as for determining differential gene expression (q-value <0.05, fold change ≥1.5, and expression level ≥3 RPKM), a total of 720 isoforms in 475 genes were identified as differentially expressed. Overall, genes with ATSS and AS are both enriched in biological processes related to muscle function and ATP synthesis (Online Table V).
Similar to our findings in overall gene expression changes, more genes were found to have altered isoform expression in the HF stage than the HY stage. To further compare the 2 stages, we calculated a dissimilarity score that quantifies the overall isoform difference of a gene between a pair of samples (see Methods). This measure is independent of gene expression levels. Figure 2C shows that most genes have similar scores between the HY and HF stages (data distributed close to the diagonal line). However, a significant number of genes (n>250) have dissimilarity scores differing by more than 0.2, suggesting a significant change in isoform usage at different stages of heart failure. Interestingly, among these genes, many (eg, Garnl1, Sipa1, Rgs12, Rin2, and Rabgap1) are known to be involved in processes well-studied in heart failure. In addition, genes involved in chromatin and histone modifications (such as Hdac7, Ezh1, and Aof2) also demonstrated stage-specific isoform expression changes. Therefore, the quantitative gene isoform analysis suggests a global change in exon utilization due to alternative RNA splicing that can potentially affect functionally important genes in failing hearts.
Guided Transcriptome Reconstruction for Novel Transcripts of Known Genes
Another major advantage of the RNA-Seq approach is the capability to discover previously unknown transcript isoforms. We thus developed a guided transcriptome reconstruction method to enable identification of novel isoforms in known genes (Online Supplement). In this method, RNA-Seq reads that mapped inside or in the vicinity of Ensembl genes were examined. Reads that do not support the known Ensembl transcript structures (eg, those in the intronic regions) may suggest existence of novel transcripts. However, such reads may also arise from other sources such as incompletely processed transcripts, degradation intermediates of introns, or mapping errors. To reduce false-positives, we implemented 2 additional requirements to define novel transcripts. First, we applied stringent filters for expression levels of the novel fragments (details in Online Supplement). Second, because novel exons should be spliced to other exons, we required the existence of at least 2 spliced junction reads flanking each end of the novel exon. When the detected novel fragments were identified to be extensions of known internal exons or terminal exons, they would indicate new exon splicing pattern, or alternative transcriptional initiation or termination events.
For the 10 061 multi-exon genes (defined by Ensembl and/or our isoform reconstruction) with an expression level of ≥3 RPKM, 5112 (51%) were detected with novel isoforms (with novel exons or novel splicing patterns among known exons) as a result of alternative splicing, 1651 (16%) as a result of alternative initiation or termination, and 830 (8%) with both types of novel isoforms. Novel transcript structures were identified across a broad range of expression levels (Online Figure II), with more isoforms detected for higher expressed genes (most likely due to higher read coverage). Therefore, our findings suggest a significant deficiency in the current mouse transcriptome annotation (Ensembl v 56 used in this analysis) and deep RNA-Seq combined with guided transcriptome reconstruction can provide a much more comprehensive profile of the total complexity in transcript structures.
Evaluation of Novel Spliced Exons Identified by Guided Transcriptome Reconstruction
Although a large number of genes were identified with novel spliced variants, it is possible that many of them were resulted from random errors or noise in the process of splicing or transcription detected by the highly sensitive RNA-Seq method. Thus, we analyzed in detail the novel exons in the spliced variants to determine if our method leads to findings with potential biological significance. A total of 1873 novel exons were identified corresponding to different types of alternative splicing events (Table 2 and Online Table VI). Because the Ensembl v 56 database was used as a reference to define known genes and exons and updated databases now exist, we examined whether the above novel exons are annotated as known exons in the new Ensembl v 61 database (April 2011), or the UCSC, RefSeq databases. Indeed, 26% of the novel spliced exons identified from our original analysis are now “known” to one or more of the above databases (Table 2). This serves as a validation of our approach for identifying the novel exons and we refer to those exons that remain to be not annotated in the above databases (1384 in total) as “updated novel exons.” We validated 29 of the randomly selected updated novel exons through RT-PCR and the expression of 97% of them was confirmed (Online Table VII).
To provide further evaluation on the identified “updated novel exons,” we performed additional bioinformatic analyses on these novel exons that were alternatively skipped, the most common type of alternatively spliced exons. We first examined whether the novel exons have features that resemble those of known skipped exons. The following features were considered: evolutionary conservation, expression level, exon length, and splice site strength. The conservation level and expression level of the novel exons, although lower than those of the known skipped exons, are significantly higher than intronic regions with matched GC content and length (Figure 3A and 3B). About 54% of the novel skipped exons have a length that is multiples of 3, significantly higher than expected (probability value=3.1e-07, χ2 test). This observation implies the existence of strong selection on the alternative protein products derived from these exons. Approximately 96% of the novel skipped exons are flanked by GT-AG in the immediate intronic regions, representing consensus splice site sequences. Figure 3C shows the result of principal component analysis of the above features in the novel and the known exon populations, respectively, which suggests that the 2 groups of exons are largely similar. The similarities of the novel exons to the known alternative exons suggest that they are likely authentic exons with biological function.
To further evaluate their biological significance, we then analyzed the expression patterns of the novel exons in more detail. If a novel exon exists due to nonfunctional random splicing noise, its absolute expression level is most likely low and similar across different samples. However, we found about 72% of the novel exons had an expression level of ≥3 RPKM in at least 1 of the samples used in our study. In examining the expression difference of the novel exons in the HY/HF and sham controls, we observed that a substantial fraction (682 exons, 68% of all with ≥3 RPKM in absolute expression) had an expression difference of at least 1.5-fold (10 examples shown in Figure 3D, left panel). Furthermore, we computed the expression of the novel exons in other mouse tissues or cell types based on available RNA-Seq data.13,17,18 Interestingly, many novel exons with relatively low abundance in the mouse heart had much higher expression levels in one or more of other tissues (Online Figure III; examples in Figure 3D). Thus, our result suggests that the novel exons identified by RNA-Seq even with low expression level in heart may be authentic exons with biological roles in other tissues.
Finally, we analyzed the impact of these novel exons on the predicted protein products. A large fraction (60%) of them will introduce premature termination codons or are expected to induce nonsense-mediated decay, suggesting that many novel exons may have a major impact on the final protein expression (Online Supplement). Furthermore, we found evidence of translation for 174 of the detected novel exons in public proteomic databases, suggesting that these exons may contribute to the overall complexity of the proteome (Online Supplement).
Alternative Initiation and Termination Events
On the basis of the reconstructed transcript structures, we identified novel alternative 5′ terminal exons that differ from the Ensembl annotations. Two different categories of such events were defined: (1) 5′ terminal exon overlapping the annotated first exon but with extended regions or alternative splice sites supported by junction reads and (2) 5′ terminal exon not overlapping any annotated exon and occurring upstream of the annotated 5′ start sites. To be conservative and avoid the complication of incomplete transcript reconstruction, we excluded 5′ terminal exons whose start sites are downstream of the annotated 5′ start sites. Similar analyses were carried out to identify alternative 3′ terminal exons. Altogether, we identified 1613 exons (in 1535 genes expressed at ≥3 RPKM) with novel alternative initiation or termination events that are not annotated in the most recent databases including UCSC, RefSeq and Ensembl (v 61) (Online Tables VIII and IX). Figure 4 shows 2 examples of such events. Among these exons, about 469 differ in expression by at least 1.5-fold between HY and sham or HF and sham. In addition, the corresponding genes significantly overlap the list of differentially expressed genes (162 genes in common, P=4.2e-13). Interestingly, 39% of the alternative 3′ terminal exons contain predicted target sites of known miRNAs expressed in mouse heart (Online Supplement), suggesting that the alternative terminal events may be functionally involved in gene regulation.
Methods to Identify and Evaluate Novel Transcript Clusters
In addition to the guided transcriptome reconstruction, we also conducted de novo transcript identification for those reads that match to genomic regions with no annotated genes (Methods). We focused on NTCs corresponding to genes with multiple exons only. We refer to an NTC as a cluster of all possible transcript isoforms. It is possible that an NTC contains false-positive isoforms because the nature of the short reads in RNA-Seq does not allow identification of complete transcript structures. In addition, complete profile of all isoforms may not be identifiable due to low read coverage. Nevertheless, each NTC suggests the possible existence of a novel gene with multiple transcript isoforms.
Using these criteria, 1884 NTCs were identified that do not overlap any Ensembl genes (v 56), all of which were multi-exon transcripts (Online Table X). Among the 5869 exons within these clusters, 8% had spliced junction reads suggesting alternative splicing, with the most prevalent type being alternatively skipped exons (Table 2). Of all NTCs, 863 (46%) are now annotated as known genes in the most recent databases of UCSC, Refseq and Ensembl (v 61), supporting the validity of our method in identifying novel genes. We validated the expression patterns of 7 NTCs (3 newly annotated and 4 remain novel) using real-time PCR in the same mouse samples as used for RNA-Seq (Online Figure I and Online Table II). The results confirmed the expression of all 7 NTCs and also showed that RNA-Seq and real-time PCR gave highly consistent measures in gene expression changes (r=0.87, Pearson correlation).
Next, we focused on the 1021 NTCs that are still not annotated in the newest databases. To infer biological significance, we analyzed their expression patterns in our samples and the other mouse RNA-Seq data sets mentioned above.13,17,18 Remarkably, many of these NTCs with low abundance in the heart had much higher expression levels in other tissues (examples shown in Figure 5A). A total of 199 NTCs passed the minimum expression level cutoff of 3 RPKM in at least 1 sample. When examined for differential expression using the same criteria as for known genes, 195 NTCs were differentially expressed between at least 1 pair of samples (34 between HF/HY and sham controls). In addition, we found that the NTCs may be more tissue-specific than annotated genes evaluated by an entropy-based tissue specificity index (Online Supplement and Figure 5B). These results suggest that many NTCs may be actively regulated and may have functional ramifications in specific tissues.
We next classified the NTCs into coding and noncoding clusters. Among all 1021 NTCs, 105 clusters (containing 315 possible transcript isoforms) were found to have significant coding potential.16 Interestingly, the coding clusters had significantly higher expression and conservation levels than the noncoding clusters (Online Figure IV), consistent with the existence of stronger selection pressure on protein-coding sequences as observed in known genes. Indeed, we found that 46% (48 of 105) of the putatively coding NTCs had sequence matches in public proteomic databases (Online Supplement). These results suggest that our de novo transcript identification method can effectively identify novel transcripts from RNA-Seq data sets, and the novel genes identified in heart contribute to the total complexity of cardiac transcriptome and proteome.
Inference of Potential Function of Novel Transcript Clusters
Functionally related genes involved in the same biological pathways or protein interaction networks are often regulated by similar transcription factors or other gene regulators. Thus, one approach to infer the potential function of novel genes is by determining whether their expression patterns correlate with those of known genes of certain function, based on coexpression analysis.19 Note that such analyses only provide tentative functional indications of a gene. However, they may help to formulate hypotheses for further experimental studies. We applied this scheme to examine the potential functions of NTCs. Using the WGCNA method,20 we constructed coexpression networks encompassing all known genes and 98 NTCs discovered in the heart samples (≥3 RPKM). We identified 52 network modules in total and focused on 20 that are enriched with differentially expressed genes between HY/HF and the corresponding shams. Genes in these modules are significantly associated with GO categories related to heart and muscle functions (Online Table XI). A total of 58 NTCs were included in the 20 significant modules. Among them, we analyzed the 10 most highly connected NTCs (ie, with highest connectivity) in each module that had at least 10 neighboring known genes. Fifteen NTCs from 6 modules were chosen in this way (example shown in Figure 6A). A complete list of GO categories related to each of the 15 NTCs is included in Online Table XII.
Long Noncoding RNA Genes
Among the novel multi-exon transcript clusters, a large fraction (81%) had low coding potential (Online Table X) and thus most likely belongs to the category of lncRNA genes. LncRNAs were recently recognized to have diverse functions in gene regulation and may contribute to disease etiology.21 To assess the expression of lncRNAs in mouse heart failure, we combined our noncoding NTCs with the Ensembl and other mouse lncRNA data sets18,22 to constitute a comprehensive repository of 5767 lncRNAs. A total of 703 lncRNAs were expressed at levels ≥3 RPKM in our RNA-Seq data of at least 1 sample. Interestingly, for lncRNAs expressed below 3 RPKM, 62% of them had an expression level of at least 3 RPKM in 1 or more of the publicly available mouse RNA-Seq data.13,17,18 Figure 5B shows that lncRNAs have higher tissue specificity in their expression than protein-coding genes on average. Among all expressed lncRNAs (≥3 RPKM in ≥1 of our samples), 15 and 135 are differentially expressed (≥1.5-fold change, FDR <5%) between HY and sham-HY and between HF and sham-HF, respectively. Intriguingly, the well-known H19 gene demonstrated significant upregulation in the HF stage compared with the corresponding sham controls (Figure 6B). This gene is highly expressed during embryogenesis and was shown to have tumor-suppressor activity.23 However, its role in heart failure is not yet clear.
Accurate and de novo transcriptome profiling is a central issue in disease research. We describe methods and applications of transcriptome analysis through RNA-Seq in normal and failing murine hearts. As clearly demonstrated by Matkovich et al,7 RNA-Seq has both accuracy and sensitivity to detect transcripts with low abundance (such as those encoding transcription factors), so new gene expression networks associated with heart failure can be established (such as transcriptional networks). Our work highlights the power of RNA-Seq in de novo transcriptome profiling given its accuracy and the independence of a priori knowledge of transcripts to be analyzed. We used newly developed bioinformatic tools, namely a combination of guided transcriptome reconstruction and de novo reconstruction approaches, to identify novel exons and transcripts in normal and diseased mouse hearts. As demonstrated by our validation studies, the vast majority of the identified novel exons or transcripts are confirmed by RT-PCR for their expression in heart. We showed that the sequence and conservation features of the novel exons are generally similar to those of the known alternatively spliced exons. In addition, both novel exons and NTCs were found to be expressed in a tissue-specific manner. These properties suggest the potential functional significance of the novel isoforms and novel transcripts. They also suggest that the RNA-Seq technology, combined with bioinformatic analysis, is a sensitive tool to comprehensively profile the transcriptome complexity at individual exon resolution.
It is somewhat unexpected to us that cardiac transcriptome encompasses such a large number of novel transcripts and exons with potential biological significance that have never been annotated despite extensive genome-wide profiling of cardiac transcriptome in the past decade. These findings open the way to further experimental investigations of their relevance in the pathogenesis of heart failure. Yet, our study does have limitations. Because the findings are mainly based on bioinformatic observations, their functional relevance would need to be further established at molecular and cellular levels experimentally. In addition, the poly-A selection procedure during RNA-Seq library preparation may introduce a positional bias in the coverage of the entire transcript (favoring the 3′ end) and not all expressed RNAs can be included in the library. In this study, we focused on pressure-overload–induced heart failure, and whether the findings are general to other types of heart failure models is unclear. Nevertheless, our analyses revealed previously uncharacterized complexity in the cardiac transcriptome as well as their dynamic changes during heart failure. This study highlights the need to use both RNA-Seq and bioinformatic tools to reevaluate cardiac transcriptome in other heart disease models as well as in human heart failure. Because the bioinformatic approaches developed in our study are generally applicable to any RNA-Seq data sets, we suggest that application of these new tools will vastly expand our current knowledge of transcriptome architecture and dynamics in general.
Sources of Funding
This work was supported in part by NIH grant R21DA027039, an Alfred P. Sloan Foundation Research Fellowship, and research grant No. 5-FY10-486 from the March of Dimes Foundation to X.X., an American Heart Association Postdoctoral Fellowship to J.H.L., and NIH grants HL088640, HL070079, HL103205, and HL098954 to Y.W.
In September 2011, the average time from submission to first decision for all original research papers submitted to Circulation Research was 16 days.
Data availability: RNA-Seq data are available at the Gene Expression Omnibus with ID GSE29446.
Non-standard Abbreviations and Acronyms
- alternative splicing
- alternative transcription start site
- heart failure
- long noncoding RNA
- novel transcript cluster
- RNA sequencing
- reads per kilobase of exon per million mapped reads
- corresponding sham control for hypertrophy
- corresponding sham control for heart failure
- Received May 27, 2011.
- Revision received October 18, 2011.
- Accepted October 20, 2011.
- © 2011 American Heart Association, Inc.
- Olson EN
- Marioni JC,
- Mason CE,
- Mane SM,
- Stephens M,
- Gilad Y
- Matkovich SJ,
- Zhang Y,
- Van Booven DJ,
- Dorn GW II.
- Zhang Y,
- Matkovich SJ,
- Duan X,
- Gold JI,
- Koch WJ,
- Dorn GW II.
- Zhang Y,
- Matkovich SJ,
- Duan X,
- Diwan A,
- Kang MY,
- Dorn GW II.
- Kent WJ
- Kong L,
- Zhang Y,
- Ye ZQ,
- Liu XQ,
- Zhao SQ,
- Wei L,
- Gao G
- Gregg C,
- Zhang J,
- Weissbourd B,
- Luo S,
- Schroth GP,
- Haig D,
- Dulac C
- Guttman M,
- Garber M,
- Levin JZ,
- Donaghey J,
- Robinson J,
- Adiconis X,
- Fan L,
- Koziol MJ,
- Gnirke A,
- Nusbaum C,
- Rinn JL,
- Lander ES,
- Regev A
- Stuart JM,
- Segal E,
- Koller D,
- Kim SK
- Wilusz JE,
- Sunwoo H,
- Spector DL
Novelty and Significance
What Is Known?
Accurate and de novo transcriptome profiling is a central issue in studying the mechanisms of cardiovascular development and diseases.
Understanding of the global cardiac transcriptome landscape is currently limited concerning expression and variation at single exon resolution.
Whole-transcriptome sequencing (RNA-Seq) offers a new way to study transcriptomes.
What New Information Does This Article Contribute?
We present bioinformatic methods to identify transcript structures and to analyze transcriptome complexities with a particular emphasis on quantification of RNA splicing variants at single exon resolution using RNA-Seq data of normal and failing murine hearts.
We validate the effectiveness and accuracy of our bioinformatic approaches, based on experimental confirmation and cross-database analyses.
We show that the bioinformatic analyses of RNA-Seq allow in-depth profiling and quantification of alternative mRNA structures, novel exons, novel transcript clusters and long noncoding RNA genes in mouse heart.
Transcriptome profiling offers detailed insight in understanding gene regulation in health and diseases. Previous technologies (eg, microarrays) for this purpose are limited in coverage and sensitivity. RNA-Seq using massively parallel next-generation sequencing platforms can potentially enable de novo and unbiased characterization of the transcriptome at individual exon resolution. However, as a result of the massive amount of raw data, RNA-Seq poses new challenges to data analysis and interpretation. We present bioinformatic methods that enable effective analysis of RNA-Seq data either integrating or independent of known gene annotations. We demonstrate that such analyses can provide a more comprehensive profile of the mouse cardiac transcriptome. Indeed, a large number of novel transcripts and exons with potential biological significance were found from this study that had never been annotated previously. These findings open the way to further experimental investigations of their relevance in the pathogenesis of heart failure. Our study highlights the need to use both RNA-Seq and bioinformatic tools to achieve comprehensive evaluation of cardiac transcriptome in heart disease models as well as in human heart failure.