article

RIC: Parameter-free noise-robust clustering

Authors:
Christian Böhm

University of Munich, Munich, Germany

University of Munich, Munich, Germany
View Profile

,
Christos Faloutsos

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Jia-Yu Pan

Google, Mountain View, CA

Google, Mountain View, CA
View Profile

,
Claudia Plant

University of Munich, Munich, Germany

University of Munich, Munich, Germany
View Profile

Authors Info & Claims

ACM Transactions on Knowledge Discovery from Data Volume 1 Issue 3pp 10–eshttps://doi.org/10.1145/1297332.1297334

Published:01 December 2007Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

How do we find a natural clustering of a real-world point set which contains an unknown number of clusters with different shapes, and which may be contaminated by noise? As most clustering algorithms were designed with certain assumptions (Gaussianity), they often require the user to give input parameters, and are sensitive to noise. In this article, we propose a robust framework for determining a natural clustering of a given dataset, based on the minimum description length (MDL) principle. The proposed framework, robust information-theoretic clustering (RIC), is orthogonal to any known clustering algorithm: Given a preliminary clustering, RIC purifies these clusters from noise, and adjusts the clusterings such that it simultaneously determines the most natural amount and shape (subspace) of the clusters. Our RIC method can be combined with any clustering technique ranging from K-means and K-medoids to advanced methods such as spectral clustering. In fact, RIC is even able to purify and improve an initial coarse clustering, even if we start with very simple methods. In an extension, we propose a fully automatic stand-alone clustering method and efficiency improvements. RIC scales well with the dataset size. Extensive experiments on synthetic and real-world datasets validate the proposed RIC framework.

References

Aggarwal, C. C. and Yu, P. S. 2000. Finding generalized projected clusters in high dimensional spaces. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 70--81. Google ScholarDigital Library
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 94--105. Google ScholarDigital Library
Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM International SIGMOD Conference on Management of Data. Google ScholarDigital Library
Banfield, J. D. and Raftery, A. E. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 3, 803--821.Google ScholarCross Ref
Bhattacharya, A., Ljosa, V., Pan, J.-Y., Verardo, M. R., Yang, H., Faloutsos, C., and Singh, A. K. 2005. ViVo: Visual vocabulary construction for mining biomedical images. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM). Google ScholarDigital Library
Böhm, C., Kailing, K., Kröger, P., and Zimek, A. 2004. Computing clusters of correlation connected objects. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 455--466. Google ScholarDigital Library
Chakrabarti, D., Papadimitriou, S., Modha, D. S., and Faloutsos, C. 2004. Fully automatic cross-associations. In Proceedings of the ACM SIGKDD Conference on International Knowledge Discovery and Data Mining. 79--88. Google ScholarDigital Library
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the ACM SIGKDD Conference on International Knowledge Discovery and Data Mining.Google Scholar
Grünwald, P. 2005. A tutorial introduction to the minimum description length principle. Advances in Minimum Description Length: Theory and Applications.Google Scholar
Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 73--84. Google ScholarDigital Library
Hamerly, G. and Elkan, C. 2003. Learning the k in k-means. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).Google Scholar
Hartigan, J. A. 1975. Clustering Algorithms. John Wiley. Google ScholarDigital Library
Jolliffe, I. 1986. Principal Component Analysis. Springer.Google Scholar
Murtagh, F. 1983. A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26, 4, 354--359.Google ScholarCross Ref
Ng, A., Jordan, M., and Weiss, Y. 2001. On spectral clustering: Analysis and an algorithm. In Proceedings of the Conference on Advances in Neural Information Processing Systems.Google Scholar
Ng, R. T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the Conference on Very Large Databases (VLDB), 144--155. Google ScholarDigital Library
Pelleg, D. and Moore, A. 2000. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conference on Machine Learning (ICML), 727--734. Google ScholarDigital Library
Slonim, N. and Tishby, N. 2000. Document clustering using word clusters via the information bottleneck method. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 208--215. Google ScholarDigital Library
Still, S. and Bialek, W. 2004. How many clusters&quest; An information theoretic perspective. Neural Comput. 16, 2483--2506. Google ScholarDigital Library
Tibshirani, R., Walther, G., and Hastie, T. 2000. Estimating the number of clusters in a dataset via the gap statistic. Tech. Rep., Stanford University.Google Scholar
Tishby, N., Pereira, F. C., and Bialek, W. 2000. The information bottleneck method. In Proceedings of the 37th Allerton Conference on Communication, Control and Computing.Google Scholar
Tung, A. K., Xu, X., and Ooi, B. C. 2005. CURLER: Finding and visualizing nonlinear correlation clusters. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 467--478. Google ScholarDigital Library
Van-Rijsbergen, C. 1979. Information Retrieval, 2nd ed. Butterworths, London. Google ScholarDigital Library
Zhang, B., Hsu, M., and Dayal, U. 2000. K-harmonic means---A spatial clustering algorithm with boosting. In Proceedings of the 1st International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining-Revised Papers (TSDM), 31--45. Google ScholarDigital Library
Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 103--114. Google ScholarDigital Library

Index Terms

RIC: Parameter-free noise-robust clustering
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression
  2. Information systems applications
    1. Data mining

Recommendations

Robust information-theoretic clustering
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

How do we find a natural clustering of a real world point set, which contains an unknown number of clusters with different shapes, and which may be contaminated by noise? Most clustering algorithms were designed with certain assumptions (Gaussianity), ...
Read More
Effective data summarization for hierarchical clustering in large datasets

Cluster analysis in a large dataset is an interesting challenge in many fields of Science and Engineering. One important clustering approach is hierarchical clustering, which outputs hierarchical (nested) structures of a given dataset. The single-link ...
Read More
Tolerance rough set theory based data summarization for clustering large datasets
Transactions on rough sets XIV

Finding clusters in large datasets is an interesting challenge in many fields of Science and Technology. Many clustering methods have been successfully developed over the years. However, most of the existing clustering methods need multiple data scans ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 1, Issue 3
December 2007
145 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1297332
Issue’s Table of Contents

Copyright © 2007 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2007
Published in tkdd Volume 1, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Clustering
data summarization
noise robustness
parameter-free data mining
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 842
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

RIC: Parameter-free noise-robust clustering

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Robust information-theoretic clustering

Effective data summarization for hierarchical clustering in large datasets

Tolerance rough set theory based data summarization for clustering large datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

RIC: Parameter-free noise-robust clustering

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Robust information-theoretic clustering

Effective data summarization for hierarchical clustering in large datasets

Tolerance rough set theory based data summarization for clustering large datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media