ABSTRACT
How to extract the truly relevant information from a large relational data set? The answer of this paper is a technique integrating graph summarization, graph clustering, link prediction and the discovery of the hidden structure on the basis of data compression. Our novel algorithm SCMiner (for Summarization-Compression Miner) reduces a large bipartite input graph to a highly compact representation which is very useful for different data mining tasks: 1) Clustering: The compact summary graph contains the truly relevant clusters of both types of nodes of a bipartite graph. 2) Link prediction: The compression scheme of SCMiner reveals suspicious edges which are probably erroneous as well as missing edges, i.e. pairs of nodes which should be connected by an edge. 3) Discovery of the hidden structure: Unlike traditional co-clustering methods, the result of SCMiner is not limited to row- and column-clusters. Besides the clusters, the summary graph also contains the essential relationships between both types of clusters and thus reveals the hidden structure of the data. Extensive experiments on synthetic and real data demonstrate that SCMiner outperforms state-of-the-art techniques for clustering and link prediction. Moreover, SCMiner discovers the hidden structure and reports it in an interpretable way to the user. Based on data compression, our technique does not rely on any input parameters which are difficult to estimate.
Supplemental Material
- D. Chakrabarti, S. Papadimitriou, D. S. Modha, and C. Faloutsos. Fully automatic cross-associations. In KDD, pages 79--88, 2004. Google ScholarDigital Library
- H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum-squared residue co-clustering of gene expression data. In SDM, 2004.Google ScholarCross Ref
- I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In KDD, pages 269--274, 2001. Google ScholarDigital Library
- I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In KDD, pages 89--98, 2003. Google ScholarDigital Library
- T. George and S. Merugu. A scalable collaborative filtering framework based on co-clustering. In ICDM, pages 625--628, 2005. Google ScholarDigital Library
- M. A. Hasan, V. Chaoji, S. Salem, and M. Zaki. Link prediction using supervised learning. In Proc. of SDM Workshop on Link Analysis, 2006.Google Scholar
- L. B. Holder, D. J. Cook, and S. Djoko. Substucture discovery in the subdue system. In KDD Workshop, pages 169--180, 1994.Google Scholar
- L. Katz. A new status index derived from sociometric analysis. PSYCHOMETRIKA, 18(1):39--43, 1953.Google ScholarCross Ref
- J. Kunegis, E. W. D. Luca, and S. Albayrak. The link prediction problem in bipartite networks. CoRR, abs/1006.5367, 2010.Google Scholar
- D. Liben-Nowell and J. M. Kleinberg. The link prediction problem for social networks. In CIKM, pages 556--559, 2003. Google ScholarDigital Library
- R. Lichtenwalter, J. T. Lussier, and N. V. Chawla. New perspectives and methods in link prediction. In KDD, pages 243--252, 2010. Google ScholarDigital Library
- B. Long, X. Wu, Z. M. Zhang, P. S. Yu, and P. S. Yu. Unsupervised learning on k-partite graphs. In KDD, pages 317--326, 2006. Google ScholarDigital Library
- B. Long, Z. M. Zhang, P. S. Yu, and P. S. Yu. Co-clustering by block value decomposition. In KDD, pages 635--640, 2005. Google ScholarDigital Library
- S. Navlakha, R. Rastogi, and N. Shrivastava. Graph summarization with bounded error. In SIGMOD, pages 419--432, 2008. Google ScholarDigital Library
- M. E. J. Newman. Clustering and preferential attachment in growing networks. PHYS.REV.E, 64:025102, 2001.Google ScholarCross Ref
- J. Rissanen. Information and Complexity in Statistical Modeling. Springer, 2007. Google ScholarDigital Library
- H. Shan and A. Banerjee. Bayesian co-clustering. In ICDM, pages 530--539, 2008. Google ScholarDigital Library
- A. Stolcke and S. M. Omohundro. Hidden markov model induction by bayesian model merging. In NIPS, pages 11--18, 1992. Google ScholarDigital Library
- A. Stolcke and S. M. Omohundro. Inducing probabilistic grammars by bayesian model merging. In ICGI, pages 106--118, 1994. Google ScholarDigital Library
- Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation for graph summarization. In SIGMOD, pages 567--580, 2008. Google ScholarDigital Library
- N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In ICML, pages 1073--1080, 2009. Google ScholarDigital Library
- N. Zhang, Y. Tian, and J. M. Patel. Discovery-driven graph summarization. In ICDE, pages 880--891, 2010.Google ScholarCross Ref
Index Terms
- Summarization-based mining bipartite graphs
Recommendations
Equistarable bipartite graphs
Recently, Milanič and Trotignon introduced the class of equistarable graphs as graphs without isolated vertices admitting positive weights on the edges such that a subset of edges is of total weight 1 if and only if it forms a maximal star. Based on ...
Interval Non-edge-Colorable Bipartite Graphs and Multigraphs
An edge-coloring of a graph G with colors 1,...,t is called an interval t-coloring if all colors are used, and the colors of edges incident to any vertex of G are distinct and form an interval of integers. In 1991, Erdï s constructed a bipartite graph ...
Hamiltonian and long paths in bipartite graphs with connectivity
AbstractLet G be a graph, ν ( G ) the order of G, κ ( G ) the connectivity of G and k a positive integer such that k ≤ ( ν ( G ) − 2 ) / 2. Then G is said to be k-extendable if it has a matching of size k and every matching of size k extends ...
Comments