research-article

CoCo: coding cost for parameter-free outlier detection

Authors:
Christian Böhm

University of Munich, Munich, Germany

University of Munich, Munich, Germany
View Profile

,
Katrin Haegler

University of Munich, Munich, Germany

University of Munich, Munich, Germany
View Profile

,
Nikola S. Müller

Max Planck Institute of Biochemistry, Martinsried, Germany

Max Planck Institute of Biochemistry, Martinsried, Germany
View Profile

,
Claudia Plant

Technische Universität München, Munich, Germany

Technische Universität München, Munich, Germany
View Profile

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningJune 2009Pages 149–158https://doi.org/10.1145/1557019.1557042

Published:28 June 2009Publication History

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 149–158

ABSTRACT

How can we automatically spot all outstanding observations in a data set? This question arises in a large variety of applications, e.g. in economy, biology and medicine. Existing approaches to outlier detection suffer from one or more of the following drawbacks: The results of many methods strongly depend on suitable parameter settings being very difficult to estimate without background knowledge on the data, e.g. the minimum cluster size or the number of desired outliers. Many methods implicitly assume Gaussian or uniformly distributed data, and/or their result is difficult to interpret. To cope with these problems, we propose CoCo, a technique for parameter-free outlier detection. The basic idea of our technique relates outlier detection to data compression: Outliers are objects which can not be effectively compressed given the data set. To avoid the assumption of a certain data distribution, CoCo relies on a very general data model combining the Exponential Power Distribution with Independent Components. We define an intuitive outlier factor based on the principle of the Minimum Description Length together with an novel algorithm for outlier detection. An extensive experimental evaluation on synthetic and real world data demonstrates the benefits of our technique. Availability: The source code of CoCo and the data sets used in the experiments are available at: http://www.dbs.ifi.lmu.de/Forschung/KDD/Boehm/CoCo.

Supplemental Material

p149-haegler.mp4

mp4

51.6 MB

Download

References

C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant. Robust information-theoretic clustering. In KDD Conference, pages 65--75, 2006. Google ScholarDigital Library
C. Böhm, C. Faloutsos, and C. Plant. Outlier-robust clustering using independent components. In SIGMOD Conference, pages 185--198, 2008. Google ScholarDigital Library
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In SIGMOD Conference, pages 93--104, 2000. Google ScholarDigital Library
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 2009. Google ScholarDigital Library
D. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980.Google ScholarCross Ref
A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. 2001.Google Scholar
E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 206--215, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
S. Kim and I.-S. Kweon. Simultaneous classification and visualword selection using entropy-based minimum description length. In ICPR (1), pages 650--653, 2006. Google ScholarDigital Library
E. M. Knorr. On digital money and card technologies. Technical Report Technical Report 97-02, University of British Columbia, 1997. Google ScholarDigital Library
E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In KDD, pages 219--222, 1997.Google Scholar
E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998. Google ScholarDigital Library
E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In VLDB, pages 211--222, 1999. Google ScholarDigital Library
A. Mineo and M. Ruggieri. A software tool for the exponential power distribution: The normalp package. Journal of Statistical Software, 12(4), 1 2005.Google ScholarCross Ref
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In ICDE, pages 315--, 2003.Google Scholar
D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In ICML Conference, pages 727--734, 2000. Google ScholarDigital Library
J. Rissanen. Mdl denoising. IEEE Transactions on Information Theory, 46(7):2537--2543, 2000. Google ScholarDigital Library
M. Robnik-Sikonja and I. Kononenko. Pruning regression trees with mdl. In ECAI, pages 455--459, 1998.Google Scholar
J. Xie, D. Zhang, and W. Xu. Spatially adaptive wavelet denoising using the minimum description length principle. IEEE Transactions on Image Processing, 13(2):179--187, 2004. Google ScholarDigital Library
T. Yoshida, H. Motoda, and T. Washio. Adaptive ripple down rules method based on minimum description length principle. In ICDM, pages 530--537, 2002. Google ScholarDigital Library

Index Terms

CoCo: coding cost for parameter-free outlier detection
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Enhancing Outlier Detection by an Outlier Indicator
Machine Learning and Data Mining in Pattern Recognition
Abstract
Outlier detection is an important task in data mining and has high practical value in numerous applications such as astronomical observation, text detection, fraud detection and so on. At present, a large number of popular outlier detection ...
Read More
A novel outlier cluster detection algorithm without top-n parameter

Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. Outlier detection has been widely focused and studied in recent years. The concept about outlier factor of ...
Read More
k-means clustering with outlier removal

We study the problem of data clustering with outlier detection.We propose a k-means-type algorithm by incorporating an additional cluster into the objective function.The algorithm is able to provide data clustering and outlier detection ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
June 2009
1426 pages
ISBN:9781605584959
DOI:10.1145/1557019
General Chairs:
John Elder
Elder Research, Inc., USA
,
Françoise Soulié Fogelman
KXEN, France
,
Program Chairs:
Peter Flach
University of Bristol, UK
,
Mohammed Zaki
RPI, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
coding costs
data compression
minimum description length
outlier detection
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 754
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CoCo: coding cost for parameter-free outlier detection

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Enhancing Outlier Detection by an Outlier Indicator

A novel outlier cluster detection algorithm without top-n parameter

k-means clustering with outlier removal

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

CoCo: coding cost for parameter-free outlier detection

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Enhancing Outlier Detection by an Outlier Indicator

A novel outlier cluster detection algorithm without top-n parameter

k-means clustering with outlier removal

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media