ABSTRACT
How can we automatically spot all outstanding observations in a data set? This question arises in a large variety of applications, e.g. in economy, biology and medicine. Existing approaches to outlier detection suffer from one or more of the following drawbacks: The results of many methods strongly depend on suitable parameter settings being very difficult to estimate without background knowledge on the data, e.g. the minimum cluster size or the number of desired outliers. Many methods implicitly assume Gaussian or uniformly distributed data, and/or their result is difficult to interpret. To cope with these problems, we propose CoCo, a technique for parameter-free outlier detection. The basic idea of our technique relates outlier detection to data compression: Outliers are objects which can not be effectively compressed given the data set. To avoid the assumption of a certain data distribution, CoCo relies on a very general data model combining the Exponential Power Distribution with Independent Components. We define an intuitive outlier factor based on the principle of the Minimum Description Length together with an novel algorithm for outlier detection. An extensive experimental evaluation on synthetic and real world data demonstrates the benefits of our technique. Availability: The source code of CoCo and the data sets used in the experiments are available at: http://www.dbs.ifi.lmu.de/Forschung/KDD/Boehm/CoCo.
Supplemental Material
- C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant. Robust information-theoretic clustering. In KDD Conference, pages 65--75, 2006. Google ScholarDigital Library
- C. Böhm, C. Faloutsos, and C. Plant. Outlier-robust clustering using independent components. In SIGMOD Conference, pages 185--198, 2008. Google ScholarDigital Library
- M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In SIGMOD Conference, pages 93--104, 2000. Google ScholarDigital Library
- V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 2009. Google ScholarDigital Library
- D. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980.Google ScholarCross Ref
- A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. 2001.Google Scholar
- E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 206--215, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- S. Kim and I.-S. Kweon. Simultaneous classification and visualword selection using entropy-based minimum description length. In ICPR (1), pages 650--653, 2006. Google ScholarDigital Library
- E. M. Knorr. On digital money and card technologies. Technical Report Technical Report 97-02, University of British Columbia, 1997. Google ScholarDigital Library
- E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In KDD, pages 219--222, 1997.Google Scholar
- E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998. Google ScholarDigital Library
- E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In VLDB, pages 211--222, 1999. Google ScholarDigital Library
- A. Mineo and M. Ruggieri. A software tool for the exponential power distribution: The normalp package. Journal of Statistical Software, 12(4), 1 2005.Google ScholarCross Ref
- S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In ICDE, pages 315--, 2003.Google Scholar
- D. Pelleg and A. Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In ICML Conference, pages 727--734, 2000. Google ScholarDigital Library
- J. Rissanen. Mdl denoising. IEEE Transactions on Information Theory, 46(7):2537--2543, 2000. Google ScholarDigital Library
- M. Robnik-Sikonja and I. Kononenko. Pruning regression trees with mdl. In ECAI, pages 455--459, 1998.Google Scholar
- J. Xie, D. Zhang, and W. Xu. Spatially adaptive wavelet denoising using the minimum description length principle. IEEE Transactions on Image Processing, 13(2):179--187, 2004. Google ScholarDigital Library
- T. Yoshida, H. Motoda, and T. Washio. Adaptive ripple down rules method based on minimum description length principle. In ICDM, pages 530--537, 2002. Google ScholarDigital Library
Index Terms
- CoCo: coding cost for parameter-free outlier detection
Recommendations
Enhancing Outlier Detection by an Outlier Indicator
Machine Learning and Data Mining in Pattern RecognitionAbstractOutlier detection is an important task in data mining and has high practical value in numerous applications such as astronomical observation, text detection, fraud detection and so on. At present, a large number of popular outlier detection ...
A novel outlier cluster detection algorithm without top-n parameter
Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. Outlier detection has been widely focused and studied in recent years. The concept about outlier factor of ...
k-means clustering with outlier removal
We study the problem of data clustering with outlier detection.We propose a k-means-type algorithm by incorporating an additional cluster into the objective function.The algorithm is able to provide data clustering and outlier detection ...
Comments