Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1557019.1557042acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

CoCo: coding cost for parameter-free outlier detection

Published:28 June 2009Publication History

ABSTRACT

How can we automatically spot all outstanding observations in a data set? This question arises in a large variety of applications, e.g. in economy, biology and medicine. Existing approaches to outlier detection suffer from one or more of the following drawbacks: The results of many methods strongly depend on suitable parameter settings being very difficult to estimate without background knowledge on the data, e.g. the minimum cluster size or the number of desired outliers. Many methods implicitly assume Gaussian or uniformly distributed data, and/or their result is difficult to interpret. To cope with these problems, we propose CoCo, a technique for parameter-free outlier detection. The basic idea of our technique relates outlier detection to data compression: Outliers are objects which can not be effectively compressed given the data set. To avoid the assumption of a certain data distribution, CoCo relies on a very general data model combining the Exponential Power Distribution with Independent Components. We define an intuitive outlier factor based on the principle of the Minimum Description Length together with an novel algorithm for outlier detection. An extensive experimental evaluation on synthetic and real world data demonstrates the benefits of our technique. Availability: The source code of CoCo and the data sets used in the experiments are available at: http://www.dbs.ifi.lmu.de/Forschung/KDD/Boehm/CoCo.

Skip Supplemental Material Section

Supplemental Material

p149-haegler.mp4

mp4

51.6 MB

References

  1. C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant. Robust information-theoretic clustering. In KDD Conference, pages 65--75, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Böhm, C. Faloutsos, and C. Plant. Outlier-robust clustering using independent components. In SIGMOD Conference, pages 185--198, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In SIGMOD Conference, pages 93--104, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  6. A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. 2001.Google ScholarGoogle Scholar
  7. E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 206--215, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Kim and I.-S. Kweon. Simultaneous classification and visualword selection using entropy-based minimum description length. In ICPR (1), pages 650--653, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. M. Knorr. On digital money and card technologies. Technical Report Technical Report 97-02, University of British Columbia, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In KDD, pages 219--222, 1997.Google ScholarGoogle Scholar
  11. E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In VLDB, pages 211--222, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Mineo and M. Ruggieri. A software tool for the exponential power distribution: The normalp package. Journal of Statistical Software, 12(4), 1 2005.Google ScholarGoogle ScholarCross RefCross Ref
  14. S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In ICDE, pages 315--, 2003.Google ScholarGoogle Scholar
  15. D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In ICML Conference, pages 727--734, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Rissanen. Mdl denoising. IEEE Transactions on Information Theory, 46(7):2537--2543, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Robnik-Sikonja and I. Kononenko. Pruning regression trees with mdl. In ECAI, pages 455--459, 1998.Google ScholarGoogle Scholar
  18. J. Xie, D. Zhang, and W. Xu. Spatially adaptive wavelet denoising using the minimum description length principle. IEEE Transactions on Image Processing, 13(2):179--187, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Yoshida, H. Motoda, and T. Washio. Adaptive ripple down rules method based on minimum description length principle. In ICDM, pages 530--537, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CoCo: coding cost for parameter-free outlier detection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
      June 2009
      1426 pages
      ISBN:9781605584959
      DOI:10.1145/1557019

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 June 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader