ABSTRACT
Functional genomics, the effort to understand the role of genomic elements in biological processes, has led to an avalanche of diverse experimental and semantic information defining associations between genes and various biological concepts across species and experimental paradigms. Integrating this rapidly expanding wealth of heterogeneous data, and finding consensus among so many diverse sources for specific research questions, require highly sophisticated big data structures and algorithms for harmonization and scalable analysis. In this context, multipartite graphs can often serve as useful structures for representing questions about the role of genes in multiple, frequently-occurring disease processes. The main focus of this paper is on finding and analyzing efficient algorithms for dense subgraph enumeration in such graphs. An O(3n/3)-time procedure was devised to enumerate all maximal k-partite cliques in a k-partite graph, where k ≥ 3. The maximum number of such cliques is also shown to obey this bound, and thus this procedure obtains the best possible asymptotic performance. Empirical testing on both real and synthetic data is conducted. Concrete applications to biological data are described, as are scalability issues in the context of big data analysis.
- Abu-Khzam, F. N., Baldwin, N. E., Langston, M. A. and Samatova, N. F., On the Relative Efficiency of Maximal Clique Enumeration Algorithms, with Application to High-Throughput Computational Biology. in Proceedings, International Conference on Research Trends in Science and Technology, (Beirut, Lebanon, 2005).Google Scholar
- Aigner, M. Turán's Graph Theorem. The American Mathematical Monthly, 102 (9). 808--816.Google Scholar
- Baker, E. J., Jay, J. J., Bubier, J. A., Langston, M. A. and Chesler, E. J. GeneWeaver: a web-based system for integrative functional genomics. Nucleic Acids Res, 40 (Database issue). D1067--1076.Google Scholar
- Bomze, I., Budinich, M., Pardalos, P. and Pelillo, M. The Maximum Clique Problem. in Du, D.-Z. and Pardalos, P. M. eds. Handbook of Combinatorial Optimization, Kluwer Academic Publishers, 1999.Google ScholarCross Ref
- Bron, C. and Kerbosch, J. Algorithm 457: finding all cliques of an undirected graph. Proceedings of the ACM, 16(9). 575--577. Google ScholarDigital Library
- Castro, V. M., Minnier, J., Murphy, S. N., Kohane, I., Churchill, S. E., Gainer, V., Cai, T., Hoffnagle, A. G., Dai, Y., Block, S., Weill, S. R., Nadal-Vicens, M., Pollastri, A. R., Rosenquist, J. N., Goryachev, S., Ongur, D., Sklar, P., Perlis, R. H. and Smoller, J. W. Validation of Electronic Health Record Phenotyping of Bipolar Disorder Cases and Controls. American Journal of Psychiatry, 172 (4).Google Scholar
- Clinton, S. M., Stead, J. D. H., Miller, S., Watson, S. J. and Akil, H. Developmental underpinnings of differences in rodent novelty-seeking an emotional reactivity. The European Journal of Neuroscience, 34 (6). 994--1005.Google ScholarCross Ref
- Cui, C., Shurtleff, D. and Harris, R. A. Neuroimmune Mechanisms of Alcohol and Drug Addiction. International Review of Neurobiology, 118. 1--12.Google Scholar
- Davis, A. P., Grondin, C. J., Lennon-Hopkins, K., Saraceni-Richards, C., Sciaky, D., King, B. L., Wiegers, T. C. and Mattingly, C. J. The Comparative Toxicogenomics Database's 10th year anniversary: update 2015. Nucleic Acids Res, 43 (Database issue). D914--920.Google Scholar
- Dean, J. and Ghemawat, S. MapReduce: simplified data processing on large clusters. Commun. ACM, 51 (1). 107--113. Google ScholarDigital Library
- Eppstein, D., Löffler, M. and Strash, D. Listing All Maximal Cliques in Sparse Graphs in Near-Optimal Time. in Cheong, O., Chwa, K.-Y. and Park, K. eds. Algorithms and Computation, Springer Berlin Heidelberg, 2010, 403--414.Google ScholarCross Ref
- Gaspers, S., Kratsch, D. and Liedloff, M. On Independent Sets and Bicliques in Graphs. Algorithmica, 62 (3-4). 637--658. Google ScholarDigital Library
- Hagan, R. D., Phillips, C. A., Wang, K., Rogers, G. L. and Langston, M. A., Toward an efficient, highly scalable maximum clique solver for massive graphs. in IEEE International Conference on Big Data, (2014), 41--45.Google ScholarCross Ref
- Jay, J., Eblen, J., Zhang, Y., Benson, M., Perkins, A., Saxton, A., Voy, B., Chesler, E. and Langston, M. A systematic comparison of genome-scale clustering algorithms. BMC Bioinformatics, 13 (Suppl 10). S7.Google Scholar
- Jay, J. J. Cross Species Integration of Functional Genomics Experiments. International Review of Neurobiology, 104. 1--24.Google Scholar
- Jones, K. A. and Thomsen, C. The Role of the Innate Immune System in Psychiatric Disorders. Molecular and Cellular Neuroscience, 53. 52--62.Google Scholar
- Karp, R. Reducibility among combinatorial problems. in Miller, R. and Thatcher, J. eds. Complexity of Computer Computations, Plenum Press, 1972, 85--103.Google ScholarCross Ref
- Kose, F., Weckwerth, W., Linke, T. and Fiehn, O. Visualizing plant metabolomic correlation networks using clique--metabolite matrices. Bioinformatics, 17. 1198--1208.Google Scholar
- Li, J., Li, H., Soh, D. and Wong, L. A Correspondence Between Maximal Complete Bipartite Subgraphs and Closed Patterns. in Jorge, A., Torgo, L., Brazdil, P., Camacho, R. and Gama, J. eds. Knowledge Discovery in Databases: PKDD 2005, Springer Berlin Heidelberg, 2005, 146--156. Google ScholarDigital Library
- Liu, Q., Chen, Y.-P.P. and Li, J. k-Partite cliques of protein interactions: A novel subgraph topology for functional coherence analysis on PPI networks. Journal of Theoretical Biology, 340 (0). 146--154.Google Scholar
- Mayfield, J., Ferguson, L. and Harris, R. A. Neuroimmune Signaling: A Key Component of Alcohol Abuse. Current opinion in neurobiology, 23 (4). 513--520.Google Scholar
- Miller, A. H., Haroon, E., Raison, C. L. and Felger, J. C. Cytokine Targets in the Brain: Impact on Neurotransmitters and Neurocircuits. Depression and anxiety, 30 (4). 297--306.Google Scholar
- Miller, R. E. and Muller, D. E. A problem of maximum consistent subsets. IBM Research Report RC-240, Watson Research Center, Yorktown Heights, NY.Google Scholar
- Moon, J. W. and Moser., L. On Cliques in Graphs. Israel J. Math, 3. 23--28.Google Scholar
- Potash, J. B. Electronic Medical Records: Fast Track to Big Data in Bipolar Disorder. The American Journal of Psychiatry.Google Scholar
- Rogers, G. L., Perkins, A. D., Phillips, C. A., Eblen, J. D., Abu-Khzam, F. N. and Langston, M. A., Using out-of-core techniques to produce exact solutions to the maximum clique problem on extremely large graphs. in Proceedings, ACS/IEEE International Conference on Computer Systems and Applications, (Rabat, Morocco, 2009), 374--381.Google ScholarCross Ref
- Setubal, J. C. and Meidanis, J. Introduction to Computational Molecular Biology. PWS Publishing Company, Boston, 1997.Google Scholar
- Tomita, E., Tanaka, A. and Takahashi, H. The Worst-Case Time Complexity for Generating all Maximal Cliques and Computational Experiments. Theoretical Computer Science, 363. 28--42. Google ScholarDigital Library
- Torrente, M. P., Freeman, W. M. and Vrana, K. E. Protein biomarkers of alcohol abuse. Expert Review of Proteomics, 9 (4). 425--436.Google ScholarCross Ref
- Turán, P. On an Extremal Problem in Graph Theory. Matematikai és Fizikai Lapok (in Hungarian), 48. 436--452.Google Scholar
- White, T. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 2009. Google ScholarDigital Library
- Wood, D. On the Number of Maximal Independent Sets in a Graph. Discrete Mathematics & Theoretical Computer Science, 13. 17--20.Google Scholar
- Zaki, M. J., Peters, M., Assent, I. and Seidl, T. Clicks: An effective algorithm for mining subspace clusters in categorical datasets. Data & Knowledge Engineering, 60 (1). 51--70. Google ScholarDigital Library
- Zhang, Y., Abu-Khzam, F. N., Baldwin, N. E., Chesler, E. J., Langston, M. A. and Samatova, N. F., Genome-scale computational approaches to memory-intensive applications in systems biology. in Proceedings, Supercomputing, (Seattle, Washington, 2005). Google ScholarDigital Library
- Zhang, Y., Abu-Khzam, F. N., Baldwin, N. E., Chesler, E. J., Langston, M. A. and Samatova, N. F., Genome-Scale Computational Approaches to Memory-Intensive Applications in Systems Biology. in Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, (2005), 12--12. Google ScholarDigital Library
- Zhang, Y., Phillips, C. A., Rogers, G. L., Baker, E. J., Chesler, E. J. and Langston, M. A. On finding bicliques in bipartite graphs: a novel algorithm and its application to the integration of diverse biological data types. BMC Bioinformatics, 15 (1). 110.Google Scholar
Index Terms
- Scalable multipartite subgraph enumeration for integrative analysis of heterogeneous experimental functional genomics data
Recommendations
Scalable subgraph enumeration in MapReduce
Subgraph enumeration, which aims to find all the subgraphs of a large data graph that are isomorphic to a given pattern graph, is a fundamental graph problem with a wide range of applications. However, existing sequential algorithms for subgraph ...
On the termination of some biclique operators on multipartite graphs
We define a new graph operator, called the weak-factor graph, which comes from the context of complex network modelling. The weak-factor operator is close to the well-known clique-graph operator but it rather operates in terms of bicliques in a ...
Scalable subgraph enumeration in MapReduce
Subgraph enumeration, which aims to find all the subgraphs of a large data graph that are isomorphic to a given pattern graph, is a fundamental graph problem with a wide range of applications. However, existing sequential algorithms for subgraph ...
Comments