Integration of Cardiac Proteome Biology and Medicine by a Specialized KnowledgebaseNovelty and Significance
Rationale: Omics sciences enable a systems-level perspective in characterizing cardiovascular biology. Integration of diverse proteomics data via a computational strategy will catalyze the assembly of contextualized knowledge, foster discoveries through multidisciplinary investigations, and minimize unnecessary redundancy in research efforts.
Objective: The goal of this project is to develop a consolidated cardiac proteome knowledgebase with novel bioinformatics pipeline and Web portals, thereby serving as a new resource to advance cardiovascular biology and medicine.
Methods and Results: We created Cardiac Organellar Protein Atlas Knowledgebase (COPaKB; www.HeartProteome.org), a centralized platform of high-quality cardiac proteomic data, bioinformatics tools, and relevant cardiovascular phenotypes. Currently, COPaKB features 8 organellar modules, comprising 4203 LC-MS/MS experiments from human, mouse, drosophila, and Caenorhabditis elegans, as well as expression images of 10 924 proteins in human myocardium. In addition, the Java-coded bioinformatics tools provided by COPaKB enable cardiovascular investigators in all disciplines to retrieve and analyze pertinent organellar protein properties of interest.
Conclusions: COPaKB provides an innovative and interactive resource that connects research interests with the new biological discoveries in protein sciences. With an array of intuitive tools in this unified Web server, nonproteomics investigators can conveniently collaborate with proteomics specialists to dissect the molecular signatures of cardiovascular phenotypes.
- computational biology
- Omics science
- translational medical research
Recent studies on cardiovascular biology have been transformed by growing applications of Omics technologies.1–5 For example, large-scale proteomic investigations have discovered the protein anatomy and dynamics of individual cardiac organelles as well as new principles of cardiac regulations in forms of widespread post-translational modifications. Numerous other large-scale data sets are now being generated, which enable investigators to ask more complex biological questions.
Efficient bioinformatics resources are vital for connecting these increasingly voluminous and diversified data to experts of various disciplines to formulate new biological insights.6,7 However, at this moment, the data are often distributed in forms that are not readily accessible, and as a result, they are often of limited value to researchers outside a particular area of expertise. Furthermore, efficient use of these data sets has been impeded by inconsistent annotation guidelines. Addressing these challenges, therefore, requires new computational tools and bioinformatics infrastructures to integrate, analyze, and visualize multidisciplinary data sets.8
Here, we present the Cardiac Organellar Protein Atlas Knowledgebase (COPaKB), a specialized resource for the cardiovascular community with 3 distinct components. First, it comprises comprehensive spectral libraries of individual cardiac organelles and a search engine for investigators to quickly identify proteins from supplied data sets with high coverage. Second, it contains a curated database and a set of bioinformatics tools to integrate the identified proteins with relevant biomedical attributes (eg, genetic mutations, disease phenotypes) and orthogonal biomolecular properties (eg, protein expression imaging, gene transcription activity; Figure 1). Lastly, COPaKB provides a unified Web portal with a robust Web service infrastructure to allow proteomic data to be efficiently analyzed, distributed, and queried. The Wiki component of the Web portal, in particular, facilitates interactions and collaborations among investigators, supporting a knowledge-building process in cardiovascular biology and medicine.
A primary benefit of this unified knowledgebase is that it contextualizes and distributes cardiac proteome data within a consistent set of standards. The data sets from multiple studies of different investigators can be combined through COPaKB as the shared reference for effective comparative analyses. Moreover, COPaKB Client is created to enable high-speed analysis of large data sets on a proteome scale. Overall, COPaKB encapsulates carefully curated data, new informatics schema, and an effective Web portal in a complete package. We anticipate this platform will broaden the use of proteomic data for the entire cardiovascular community and help bridge discovery-driven and hypothesis-driven studies.
Construction of a Modular Knowledgebase for Cardiovascular Proteome Biology
COPaKB contains the following components: a relational database supporting multiple modules of proteome knowledge, a Wiki interface promoting user input, and a computational toolbox facilitating data analyses (Figure 1). All components are regularly updated and maintained.
Structure and Organization of COPaKB
COPaKB contains a repertoire of protein properties based on their subcellular compartments. The underlying relational database is configured using spectral libraries as a backbone structure. The selection of representative spectra has been based on the cross-correlation score (Xcorr) assigned by ProLuCID.8 An Oracle database has been used to manage orthogonal proteomic data sets in COPaKB. Known associations between protein function and cardiac diseases were retrieved from Online Mendelian Inheritance in Man Web service9 and from peer-reviewed publications via PubMed using keyword combinations of protein name, gene symbol, and heart diseases.
Protein expression profiles probed with specific antibodies were integrated from the Human Protein Atlas (HPA; http://www.proteinatlas.org).10 The differential expression of gene transcript was integrated from Gene Expression Atlas.11 Gene Ontology (GO) annotations were obtained using Universal Protein Resource (UniProt)12 and QuickGO services.13 The relationships among different GO terms were delineated using Ontology Lookup Service14 by European Molecular Biology Laboratory - European Bioinformatics Institute. The schema of this relational database is readily expandable to accommodate additional forms of knowledge (Figure 1). Each component of the COPaKB are constantly updated and maintained. The release history is outlined in the COPaKB Website (http://www.HeartProteome.org/copa/ReleaseHistory.aspx).
COPaKB Computational Toolbox
We have implemented the COPaKB Web server on a DELL Precision T7500 workstation. Details on server configuration are documented in the Online Data Supplement.
We developed the COPaKB Client software to enable Web-based data transfer via the Simple Object Access Protocol (SOAP). Details on the coding and application of this program are documented in the Online Data Supplement.
The analysis of proteomic data files from COPaKB users has been supported by a spectral library search engine that we previously developed.15
We processed mass spectral data files to create spectral libraries for COPaKB. Thus far, 10 modules have been configured on multiple replicate analyses (biological and technical), 8 of which are organellar modules. They include human heart mitochondria (29 replicates), human heart proteasomes (20 replicates), murine heart mitochondria (34 replicates), murine heart proteasomes (22 replicates), murine heart nucleus (30 replicates), murine heart cytosol (9 replicates), drosophila mitochondria (18 replicates), and Caenorhabditis elegans mitochondria (9 replicates). Two modules are total tissue lysates: human heart lysate (20 replicates) and mouse heart lysate (1 replicate). Details on data source and data processing are documented in the Online Data Supplement.
We created a publicly accessible Website to interface COPaKB with the scientific community. The specific procedures of implementing both the Website and its Wiki Web portal (software development for each component) are described in the Online Data Supplement.
Data Source and Tissue Collection
As of May 31, 2013, COPaKB hosts 10 modules. The modules of murine heart mitochondria, murine heart proteasome, murine heart cytosol, drosophila mitochondria, human heart mitochondria, and human heart proteasome were created using the data collected at UCLA; the modules of murine heart nuclei, murine heart total lysates, C. elegans mitochondria, and human heart total lysates were curated from public resources.
Regarding data created at UCLA, all procedures involving mice (ICR strain) were performed in accordance with the Animal Research Committee guidelines at UCLA and the Guide for the Care and Use of Laboratory Animals, published by the National Institutes of Health. Drosophila mitochondria were extracted from the Oregon-R-C strain. The experimental procedures involving human samples were approved by the UCLA Human Subjects Protection Committee and the UCLA Institutional Review Boards. The phenotypes of all samples are documented online at COPaKB Website. Additional information can be found in the Online Data Supplement.
Demonstration of the COPaKB-Assisted Proteomics Workflow
A test data set was downloaded from the Peptide Atlas Repository16 (PAe000353),17 containing a total of 111 raw proteomic data files. In this data set, murine heart mitochondrial proteins were extracted from the female ICR mouse strain and analyzed on a LCQ Deca XP mass spectrometer. This data set was processed by the COPaKB-directed workflow to benchmark its performance against that of the SEQUEST (BioWorks)-assisted workflow.
The analytic efficiency and robustness of the COPaKB Client–directed workflow were evaluated using spectral files in mzML format by scientists at 6 different test centers globally. Each center conducted 3 replicate tests and reported the fastest rate.
The use of COPaKB in integrating discoveries from multiple analyses was examined using 3 sets of LTQ-Orbitrap–collected data on murine heart mitochondria, with the mass resolutions of MS1 scan set at 60 000, 15 000, and 7500, respectively. Each set contains 21 LC-MS/MS experiments.
To create COPaKB (www.HeartProteome.org), we compiled a large collection of annotated protein mass spectral data sets on human, mouse, drosophila, and C. elegans samples. We integrated these data in a modular structure that parallel the organization of subcellular organelles inside the cardiac cell. The following examples demonstrate the use of COPaKB for efficiently conducting analyses of protein properties.
Assembly of Cardiac Spectral Data Sets
Mass spectral data sets of proteins were organized into modular fashions based on their subcellular locations. Each module included multiple replicate analyses, which encapsulate the dynamic range of protein expression. For the human mitochondria module, a total of 6 biological replicates were integrated (Figure 2A); for the human proteasome module, a total of 5 biological replicates were integrated (Figure 2B). Altogether, the human mitochondria mass spectral library module was built on 856 LC-MS/MS experiments, and the human proteasome module incorporated 160 LC-MS/MS experiments.
A total of 41 758 nonredundant mass spectra representing 1398 proteins and 28 031 peptides were compiled for the human mitochondria module. A total of 5668 mass spectra representing 283 proteins and 3482 peptides were assembled for the human proteasome module; these data include proteasome subunits and their associated proteins. A total of 59 020 mass spectra representing 1619 proteins and 38 421 peptides were organized into the murine mitochondria module. A total of 9442 mass spectra representing 151 proteins and 6409 peptides were collected for the murine proteasome module (Table 1).
The data coverage of knowledgebase modules was evaluated by analyzing the cumulative number of protein entries as more replicates were incorporated. For the 4 example modules (human heart mitochondria, human heart proteasome, murine heart mitochondria, and murine heart proteasome) presented in Figure 2, the coverage of these modules reached plateau when sufficient number of replicates were included.
Data Integration in COPaKB
COPaKB organizes cardiac proteome knowledge in a relational database (Figure 1). The core structure of this database is built on an integrated framework using mass spectrometry data sets, which capture multidimensional molecular features ranging from peptides to proteins. This unique hierarchy structure of the database renders seamless incorporation of diverse properties.
Protein expression image data sets were synchronized with those curated by the HPA10 and compiled into a HPA reference table of COPaKB. Immunofluorescence and immunohistochemistry images of a total of 10 924 human proteins were included (Table 1). Immunofluorescence images enlist protein expression profiles with subcellular resolution; immunohistochemistry images assist visualization of the protein expression profiles in different cardiac cell types.
Changes in protein properties associated with cardiovascular pathogenesis were documented by performing a systematic literature search on peer-reviewed sources; a total of 413 nonredundant perturbations were found. In parallel, biomedical data from Online Mendelian Inheritance in Man18 and Gene Expression Atlas19 were incorporated into COPaKB to present the relevance of individual proteins to heart diseases.
Application of COPaKB to Support New Biology in Cardiovascular Studies
The main utilities and functional outputs of COPaKB are summarized in Table 2. Queries may take the format of (1) a protein identifier of interest as Protein Identifier, (2) a particular amino acid sequence of interest as Amino Acid Sequence; (3) any mass spectrometry data files from users as MS Data File(s); or (4) identifiers of existing analyses by any investigator team(s) as Analysis of Multiple Data Sets. Specifically, a protein identifier may be a name of a protein; the glycogen synthase kinase-3α is shown in Table 2 as an example. Moreover, the protein identifier may also be its gene symbol, for example, GSK3α, or its UniProt ID, for example, P49840. When a query is made, COPaKB will generate reports regarding the relevance of this protein in cardiac phenotypes, information in the literature on its mRNA expression, its interacting protein partners, its immunohistochemistry images in the myocardium, and its immunocytofluorescence images in human cells. In addition, COPaKB will also report a list of peptides identified with corresponding mass spectra data about this protein and all peer-reviewed publications on this protein as documented by Information Hyperlinked over Proteins (iHOP). Furthermore, COPaKB welcomes user inputs to further annotate this protein in the format of a Wiki page. An example output of this query by COPaKB is presented in Table 2 as www.heartproteome.org/copa/proteinInfo.aspx?qType=protein%20ID&qValue=P49840. In similar fashions, Table 2 details examples highlighting query formats made as Amino Acid Sequence, MS Data File(s), or Analysis of Multiple Data Sets.
Furthermore, to demonstrate the utility of COPaKB-facilitated analyses, we acquired a test data set of murine mitochondria proteins from the Peptide Atlas Repository (PAe000353).17 This test data set was reprocessed using the murine mitochondria module of COPaKB as a reference. In this analysis, we identified a total of 261 proteins with a statistical confidence of 95%. In contrast, when this test data set was analyzed using the SEQUEST, a commonly used mass spectra search engine, we identified 183 proteins at the same confidence level (Online Table II). Specifically, 78.7% of proteins were commonly identified by the 2 approaches (Figure 3A), whereas COPaKB identified an additional 64.5% proteins (ie, 117 proteins) that were not covered by the SEQUEST search (Figure 3B). This added protein coverage significantly expanded the search outcome; the additional 117 proteins include mitochondrial proteins (eg, cytochrome c oxidase 7A1 of the electron transport chain complex IV).
Moreover, an automatic query to the knowledgebase for protein identification was integrated with functional annotations. Among the 117 proteins uniquely identified by COPaKB, 92 proteins (79%) had a GO annotation of the mitochondrion as their primary subcellular location (Figure 3A). Seventy-four out of the 92 mitochondrial proteins (80%) were involved in metabolism, 3 were involved in transport, and 4 were involved in apoptosis. Furthermore, 199 protein expression images (among the 261 identified proteins) were available in HPA and were automatically retrieved (Figure 3C). According to peer-reviewed publications, 49 of the 261 proteins were involved in the processes of cardiac pathogenesis.
Performance Efficiency of COPaKB Workflow
A reliable delivery of Omics-scale information requires robust Web portals. The traditional Web-based data transfer protocol is limited by its ability to transfer data effectively within a defined timeframe (eg, 60 minutes). To overcome this challenge, a COPaKB Client program was engineered to implement a SOAP-assisted workflow (Figure 4). Its performance efficiency was benchmarked against that of the traditional HTTP-assisted workflow by participating investigators from 6 test centers (Table 3 and Online Table I). In all these tests, large-scale data files were reliably and consistently delivered by the COPaKB Client. In a load test, the COPaKB server was able to process 50 simultaneous search requests without compromising its reliability.
The analytic efficiency of the COPaKB-supported search engine was also benchmarked against that of the SEQUEST. Analysis using COPaKB requires extra time to upload the data onto the knowledgebase server. Despite the required extra time, the COPaKB-supported workflow completed the analyses faster than that performed by the SEQUEST-supported workflow. This test was repeated on a personal computer and on a computing cluster of moderate size (7 nodes; Table 4). This enhanced efficiency of COPaKB workflow is accomplished by a condensed search space within the mass spectral library, a simplified spectral matching algorithm, a strategy of selecting only MS2 spectra for Web-based protein identification, and the integration of the SOAP protocol.
Platform to Promote New Discoveries Through Collaborative Effort
Discoveries from multiple studies can be integrated in real time via the COPaKB server (Table 2). COPaKB receives the identifier of each analysis task as input and returns a list of proteins that are identified in each of these analyses. Three test data sets of murine mitochondrial proteins were collected using an LTQ-Orbitrap mass spectrometer with different settings of MS1 resolution; they were independently analyzed via COPaKB (Figure 5A). These results were then combined using the murine mitochondria module of COPaKB as a reference. Comparative analysis on these data sets was subsequently conducted (Figure 5B).
COPaKB relies on inputs from the scientific community. Integration of mass spectrometry–based data sets from various sources necessitates standard operation protocols. COPaKB achieves consistency following the principles established by the Minimum Information about a Proteomics Experiment.20 This guideline describes experimental procedures with sufficient details using controlled vocabularies. The complexity of the controlled vocabulary, however, often serves as a double-edged sword; the redundant terminology allows the same experiments to be described in different terms. Using a specialized subset of the data submission utility,21 COPaKB conducts standardized collection and propagation of proteomics data (Online Figure III).
Along with the effort to streamline the knowledgebase with consistent vocabularies, a text-based Wiki component has been created to facilitate communication among investigators. The Wiki component is coupled to each peptide, protein, and spectrum entry of the knowledgebase, welcoming users to add or edit the content relevant to the subject. The open nature of the Wiki component fosters a worldwide collaborative effort (Online Figure IV).
Our goal is to create a protein knowledgebase to support the long-term advancement of cardiovascular biology in an informatics-driven era. As the continued growth of any bioinformatics platform hinges on community support, we designed COPaKB from the grounds up with community participation in mind.
Systems Integration and Software Engineering for COPaKB
COPaKB assembles its individual modules based on protein localizations. Each module consists of orthogonal data sets curated from either a public resource or a large cohort of experiments. The public resources also include UniProt,12 HPA,22 OMIM,9 and Gene Expression Atlas.19 The annotated data sets are integrated using a relational database schema, which offers the investigators easy access to an array of protein properties.
COPaKB operates via an innovative workflow. First, it has the design features to support multidimensional data integration. Specifically, using mass spectrometry–based proteomics data has the benefit that information from different disciplines can be synthesized. Second, it carries a new mechanism of conducting data query. Many online proteomics resources exist, including repertoires of 2-dimensional PAGE images,23 mass spectra,10 and protein expression images.16,21 Retrieving these data from multiple databases remains laborious and requires repetitive mining efforts. COPaKB overcomes these challenges with 2 unique functional features. One is to connect specific interests of individual investigators with vast information hosted in the knowledgebase; another is to support the analysis of raw data files in an efficient manner. Both features are supported by computational strategies created for COPaKB. These strategies include a spectral library and a search engine that automatically decodes raw spectral data,24–26 subsequently combining them with orthogonal properties of proteins and genes. This computational toolbox provides a new mechanism of using biomedical knowledgebase where functional annotations of each protein in the sample are presented in a cohesive context. Accordingly, this built-in bioinformatics pipeline in COPaKB supports investigators to complement their targeted investigation with discovery-based or hypothesis-driven approaches.
Finally, COPaKB Client addresses the technical limitations of transferring large data files on a proteome scale. We have engineered this program to use a Web-based SOAP portal to enable robust connections with the COPaKB server, allowing reliable access to contextualized properties on proteins of interest.
Building COPaKB Via a Community-Driven Paradigm
Comprehensive understanding of cardiovascular biology on a proteome scale is a long-term goal that often exceeds the capacity of individual investigators. COPaKB alleviates this limitation by implementing the bioinformatics infrastructure necessary to facilitate effective collaboration. In particular, COPaKB allows real-time integration of analyses on multiple data sets from numerous investigators in parallel. This strategy surmounts technical challenges in long-distance collaborations, namely geographic boundaries and platform discrepancies. Additionally, the Wiki component in COPaKB supports investigators to communicate and contribute information on individual proteins in a variety of formats (eg, images or text).
Investigator participation is essential to the future growth of a community-driven knowledgebase. Currently, the capacity of COPaKB modules is primarily contributed by data sets publicly available. However, the current coverage of COPaKB on several modules is at infant stage. Particularly, this situation applies to the module of mouse heart total lysate. Because the ProteomeXchange consortium27 led by EBI has accelerated the supply of raw proteomic data, COPaKB is expected to grow and will cover additional modules on organelles and cells from cardiovascular-relevant model systems. The content of each module will expand with the increasingly available public data.
Despite the rapid development in proteomics science, the gap in translating the new tools to biological applications has widened. This is largely due to limited access to high-end instrumentation and difficulties to manage large data analyses. The user-friendly workflow of the COPaKB helps nonproteomics investigators to benefit directly from the technological advancement and new data sets. Investigators are no longer restricted amid lacking a direct access to high-end mass spectrometry. For example, COPaKB documents the parameters (eg, molecular mass and charge state) that are associated with peptides; the institutional proteomic core can then adjust instrument settings accordingly to improve the detection of their selected proteins. In this scenario, the cost and time in using the proteomic technologies are optimized. Taken together, COPaKB provides a number of effective workflows to guide cardiovascular investigators from proteomics data to systematic interpretation of biomedical properties. We are continuing to develop innovative workflows to aid better understanding of protein functions in cardiovascular diseases.
In conclusion, COPaKB is a novel computational platform with its unique bioinformatics pipelines and Web portals, engaging a community effort to build a knowledgebase. As proteome biology has become increasingly integrated into cardiovascular medicine, we envision a growing importance of this new resource.
Sources of Funding
This work was supported, in part, by National Heart, Lung, and Blood Institute Proteomics Center Award HHSN268201000035C, NIH R01 HL063901, an endowment from Theodore C. Laubisch to P.P.; R37HL063901 diversity supplement award to D.M.; and R37HHSN268201000035C diversity supplement award to I.Z.
In July 2013, the average time from submission to first decision for all original research papers submitted to Circulation Research was 13.24 days.
The online-only Data Supplement is available with this article at http://circres.ahajournals.org/lookup/suppl/doi:10.1161/CIRCRESAHA.113.301151/-/DC1.
- Nonstandard Abbreviations and Acronyms
- Cardiac Organellar Protein Atlas Knowledgebase
- Gene Ontology
- Human Protein Atlas
- Simple Object Access Protocol
- Universal Protein Resource
- Received February 9, 2013.
- Revision received August 19, 2013.
- Accepted August 21, 2013.
- © 2013 American Heart Association, Inc.
- Pazdrak K,
- Young TW,
- Straub C,
- Stafford S,
- Kurosky A
- Wang SB,
- Foster DB,
- Rucker J,
- O’Rourke B,
- Kass DA,
- Van Eyk JE
- Chen C,
- McGarvey PB,
- Huang H,
- Wu CH
- Kapushesky M,
- Adamusiak T,
- Burdett T,
- et al
- Wu CH,
- Apweiler R,
- Bairoch A,
- et al
- Binns D,
- Dimmer E,
- Huntley R,
- Barrell D,
- O’Donovan C,
- Apweiler R
- Côté R,
- Reisinger F,
- Martens L,
- Barsnes H,
- Vizcaino JA,
- Hermjakob H
- Desiere F,
- Deutsch EW,
- King NL,
- Nesvizhskii AI,
- Mallick P,
- Eng J,
- Chen S,
- Eddes J,
- Loevenich SN,
- Aebersold R
- Kislinger T,
- Cox B,
- Kannan A,
- Chung C,
- Hu P,
- Ignatchenko A,
- Scott MS,
- Gramolini AO,
- Morris Q,
- Hallett MT,
- Rossant J,
- Hughes TR,
- Frey B,
- Emili A
- Hamosh A,
- Scott AF,
- Amberger JS,
- Bocchini CA,
- McKusick VA
- Kapushesky M,
- Emam I,
- Holloway E,
- Kurnosov P,
- Zorin A,
- Malone J,
- Rustici G,
- Williams E,
- Parkinson H,
- Brazma A
- Jones P,
- Côté RG,
- Cho SY,
- Klie S,
- Martens L,
- Quinn AF,
- Thorneycroft D,
- Hermjakob H
- Zhai P,
- Sadoshima J
- Shiojima I,
- Walsh K
Novelty and Significance
What Is Known?
Proteomics techniques allow for large-scale analysis of global protein expression, but are not commonly accessible because of the need for specialized computational and informatics expertise.
What New Information Does This Article Contribute?
Cardiac Organellar Protein Atlas Knowledgebase (COPaKB) is a new resource consolidating relevant protein data sets from multiple scientific disciplines and linking protein molecular properties to their functional phenotypes.
COPaKB features a novel algorithm supporting user-directed protein pathway studies in their specified biological context of interests.
COPaKB could serve as a centralized Web portal, allowing remote data management, and as an online platform for collaboration among investigators.
Proteomics investigations have received increasing attention in cardiovascular research, but several obstacles remain before effective translation and utilization of proteomic data. These challenges include fragmented data structure, inconsistent data annotations, and, often, investigator inaccessibility to relevant technology platforms. COPaKB was created as a unique resource to facilitate better understanding of proteomic data sets. This platform is a curated relational database of protein molecular and biomedical phenotype properties, interfaced to a Website for public data retrieval. It allows any investigator to process raw proteomic data sets without the need of accessing high-end instrumentation, and it returns a consistently annotated report of protein properties. The platform also offers a wide range of informatics tools for investigators to analyze different studies in parallel and to conduct meta-analyses.