Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

RIC: Parameter-free noise-robust clustering

Published:01 December 2007Publication History
Skip Abstract Section

Abstract

How do we find a natural clustering of a real-world point set which contains an unknown number of clusters with different shapes, and which may be contaminated by noise? As most clustering algorithms were designed with certain assumptions (Gaussianity), they often require the user to give input parameters, and are sensitive to noise. In this article, we propose a robust framework for determining a natural clustering of a given dataset, based on the minimum description length (MDL) principle. The proposed framework, robust information-theoretic clustering (RIC), is orthogonal to any known clustering algorithm: Given a preliminary clustering, RIC purifies these clusters from noise, and adjusts the clusterings such that it simultaneously determines the most natural amount and shape (subspace) of the clusters. Our RIC method can be combined with any clustering technique ranging from K-means and K-medoids to advanced methods such as spectral clustering. In fact, RIC is even able to purify and improve an initial coarse clustering, even if we start with very simple methods. In an extension, we propose a fully automatic stand-alone clustering method and efficiency improvements. RIC scales well with the dataset size. Extensive experiments on synthetic and real-world datasets validate the proposed RIC framework.

References

  1. Aggarwal, C. C. and Yu, P. S. 2000. Finding generalized projected clusters in high dimensional spaces. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 70--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 94--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM International SIGMOD Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Banfield, J. D. and Raftery, A. E. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 3, 803--821.Google ScholarGoogle ScholarCross RefCross Ref
  5. Bhattacharya, A., Ljosa, V., Pan, J.-Y., Verardo, M. R., Yang, H., Faloutsos, C., and Singh, A. K. 2005. ViVo: Visual vocabulary construction for mining biomedical images. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Böhm, C., Kailing, K., Kröger, P., and Zimek, A. 2004. Computing clusters of correlation connected objects. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 455--466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chakrabarti, D., Papadimitriou, S., Modha, D. S., and Faloutsos, C. 2004. Fully automatic cross-associations. In Proceedings of the ACM SIGKDD Conference on International Knowledge Discovery and Data Mining. 79--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the ACM SIGKDD Conference on International Knowledge Discovery and Data Mining.Google ScholarGoogle Scholar
  9. Grünwald, P. 2005. A tutorial introduction to the minimum description length principle. Advances in Minimum Description Length: Theory and Applications.Google ScholarGoogle Scholar
  10. Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 73--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hamerly, G. and Elkan, C. 2003. Learning the k in k-means. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).Google ScholarGoogle Scholar
  12. Hartigan, J. A. 1975. Clustering Algorithms. John Wiley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jolliffe, I. 1986. Principal Component Analysis. Springer.Google ScholarGoogle Scholar
  14. Murtagh, F. 1983. A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26, 4, 354--359.Google ScholarGoogle ScholarCross RefCross Ref
  15. Ng, A., Jordan, M., and Weiss, Y. 2001. On spectral clustering: Analysis and an algorithm. In Proceedings of the Conference on Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  16. Ng, R. T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the Conference on Very Large Databases (VLDB), 144--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Pelleg, D. and Moore, A. 2000. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conference on Machine Learning (ICML), 727--734. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Slonim, N. and Tishby, N. 2000. Document clustering using word clusters via the information bottleneck method. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 208--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Still, S. and Bialek, W. 2004. How many clusters? An information theoretic perspective. Neural Comput. 16, 2483--2506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tibshirani, R., Walther, G., and Hastie, T. 2000. Estimating the number of clusters in a dataset via the gap statistic. Tech. Rep., Stanford University.Google ScholarGoogle Scholar
  21. Tishby, N., Pereira, F. C., and Bialek, W. 2000. The information bottleneck method. In Proceedings of the 37th Allerton Conference on Communication, Control and Computing.Google ScholarGoogle Scholar
  22. Tung, A. K., Xu, X., and Ooi, B. C. 2005. CURLER: Finding and visualizing nonlinear correlation clusters. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 467--478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Van-Rijsbergen, C. 1979. Information Retrieval, 2nd ed. Butterworths, London. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Zhang, B., Hsu, M., and Dayal, U. 2000. K-harmonic means---A spatial clustering algorithm with boosting. In Proceedings of the 1st International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining-Revised Papers (TSDM), 31--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 103--114. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. RIC: Parameter-free noise-robust clustering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader