Article

Free Access

High performance clustering based on the similarity join

Authors:
Christian Böhm

Institute for Computer Science, University of Munich, Oettingenstr. 67, 80538 München, Germany

Institute for Computer Science, University of Munich, Oettingenstr. 67, 80538 München, Germany
View Profile

,
Bernhard Braunmüller

Institute for Computer Science, University of Munich, Oettingenstr. 67, 80538 München, Germany

Institute for Computer Science, University of Munich, Oettingenstr. 67, 80538 München, Germany
View Profile

,
Markus Breunig

Institute for Computer Science, University of Munich, Oettingenstr. 67, 80538 München, Germany

Institute for Computer Science, University of Munich, Oettingenstr. 67, 80538 München, Germany
View Profile

,
Hans-Peter Kriegel

Institute for Computer Science, University of Munich, Oettingenstr. 67, 80538 München, Germany

Institute for Computer Science, University of Munich, Oettingenstr. 67, 80538 München, Germany
View Profile

CIKM '00: Proceedings of the ninth international conference on Information and knowledge managementNovember 2000Pages 298–305https://doi.org/10.1145/354756.354832

Published:06 November 2000Publication History

CIKM '00: Proceedings of the ninth international conference on Information and knowledge management

Pages 298–305

References

1.Ankerst M., Breunig M. M., Kriegel H.-P., Sander J.: 'OP- TICS: Ordering Points To Identify the Clustering Structure', Proc. ACM SIGMOD'99 Int. Conf. on Management of Data, Philadelphia, PA, 1999, pp. 49-60. Google ScholarDigital Library
2.Agrawal R., Gehrke J., Gunopulos D., Raghavan P.: 'Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications', Proc. ACM SIGMOD'98 Int. Conf. on Management of Data, Seattle, WA, 1998, pp. 94-105. Google ScholarDigital Library
3.Agrawal R., Imielinski T., Swami A.: 'Mining Association Rules between Sets of Items in Large Databases', Proc. ACM SIGMOD'93 Int. Conf. on Management of Data, Washington, D.C., 1993, pp. 207-216. Google ScholarDigital Library
4.Berchtold S., Keim D., Kriegel H.-P.: 'The X-Tree: An Index Structure for High-Dimensional Data', 22nd Int. Conf. on Very Large DataBases, 1996, Bombay, India, pp. 28-39. Google ScholarDigital Library
5.van den Bercken J., Seeger B., Widmayer P.:'A General Approach to Bulk Loading Multidimensional Index Structures', 23rd Conf. on Very Large Databases, 1997, Athens, Greece. Google ScholarDigital Library
6.Berchtold S., B~hm C., Kriegel H.-P.: 'Improving the Query Performance of High-Dimensional Index Structures Using Bulk-Load Operations', 6th. Int. Conf. on Extending Database Technology, 1998. Google ScholarDigital Library
7.Breunig S., Kriegel H.-P., Ng R., Sander J.: 'LOF: Identifying Density-Based Local Outliers', ACM SIGMOD Int. Conf. on Management of Data, Dallas, TX, 2000. Google ScholarDigital Library
8.Brinkhoff T., Kriegel H.-P., Seeger B.: 'Efficient Processing of Spatial Joins Using R-trees', Proc. ACM SIGMOD Int. Conf. on Management of Data, Washington D.C., 1993, pp. 237-246. Google ScholarDigital Library
9.Brinkhoff T., Kriegel H.-P., Seeger B.: 'Parallel Processing of Spatial Joins Using R-trees', Proc. 12th Int. Conf. on Data Engineering, New Orleans, LA, 1996. Google ScholarDigital Library
10.Beckmann N., Kriegel H.-P., Schneider R., Seeger B.: 'The R*-tree: An Efficient and Robust Access Method for Points and Rectangles', Proc. ACM SIGMOD Int. Conf. on Management of Data, Atlantic City, NJ, 1990, pp. 322-331. Google ScholarDigital Library
11.Ester M., Frommelt A., Kriegel H.-P., Sander J.: 'Algorithms for Characterization and Trend Detection in Spatial Data-bases', Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining, New York, NY, 1998, pp. 44-50.Google Scholar
12.Ester M., Kriegel H.-P., Sander J., Wimmer M. Xu X.: 'Incremental Clustering for Mining in a Data Warehousing Environment', Proc. 24th Int. Conf. on Very Large Databases, New York, NY, 1998, pp. 323-333. Google ScholarDigital Library
13.Ester M., Kriegel H.-P., Sander J., Xu X.: 'A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise', Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, 1996, pp. 226-231.Google Scholar
14.Faloutsos C., Lin K.-I.: 'FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Data', Proc. ACM SIGMOD Int. Conf. on Management of Data, San Jose, CA, 1995, pp. 163-174. Google ScholarDigital Library
15.Gaede V., G~nther O.:'Multidimensional Access Methods', ACM Computing Surveys, Vol. 30, No. 2, 1998, pp.170-231. Google ScholarDigital Library
16.Guha S., Rastogi R., Shim K.: 'CURE: An Efficient Clustering Algorithms for Large Databases', Proc. ACM SIGMOD Int. Conf. on Management of Data, Seattle, WA, 1998, pp.73-84. Google ScholarDigital Library
17.Guttman A.: 'R-trees: A Dynamic Index Structure for Spatial Searching', Proc. ACM SIGMOD Int. Conf. on Management of Data, Boston, MA, 1984, pp. 47-57. Google ScholarDigital Library
18.Huang Y.-W., Jing N., Rundensteiner E. A.:'Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations', Proc. Int. Conf. on Very Large Databases, Athens, Greece, 1997, pp. 396-405. Google ScholarDigital Library
19.Hinneburg A., Keim D.A.: 'An Efficient Approach to Clustering in Large Multimedia Databases with Noise', Proc. 4th Int. Conf. on Knowledge Discovery & Data Mining, New York City, NY, 1998, pp. 58-65.Google Scholar
20.Hattori K., Torii Y.: 'Effective algorithms for the nearest neighbor method in the clustering problem'. Pattern Recognition, 1993, Vol. 26, No. 5, pp. 741-746.Google ScholarCross Ref
21.Huang, Z.: 'A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining'. In Proc. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tech. Report 97-07, UBC, Dept. of CS, 1997.Google Scholar
22.Jagadish H. V.: 'A Retrieval Technique for Similar Shapes', Proc. ACM SIGMOD Int. Conf. on Management of Data, Denver, CO, 1991, pp. 208-217. Google ScholarDigital Library
23.Jain A. K., Dubes R. C.: 'Algorithms for Clustering Data', Prentice-Hall, 1988. Google ScholarDigital Library
24.Keim D. A.: 'Visual Database Exploration Techniques', Proc. Tutorial Int. Conf. on Knowledge Discovery and Data Mining, Newport Beach, CA, 1997 (http://www.informatik.unihalle.de/~keim/PS/KDD97.pdf).Google Scholar
25.Koperski K., Han J.: 'Discovery of Spatial Association Rules in Geographic Information Databases', Proc. 4th Int. Symp. on Large Spatial Databases, Portland, ME, 1995, pp. 47-66. Google ScholarDigital Library
26.Knorr E.M., Ng R.T.: 'Finding Aggregate Proximity Relationships and Commonalities in Spatial Data Mining', IEEE Trans. on Knowledge and Data Engineering, Vol. 8, No. 6, 1996, pp. 884-897. Google ScholarDigital Library
27.Knorr E.M., Ng R.T.: 'Algorithms for Mining Distance- Based Outliers in Large Datasets', Proc. 24th Int. Conf. on Very Large DataBases, 1998, New York City, NY, pp. 392-403. Google ScholarDigital Library
28.Kaufman L., Rousseeuw P. J.: 'Finding Groups in Data: An Introduction to Cluster Analysis', John Wiley & Sons, 1990.Google Scholar
29.Koudas N., Sevcik C.: 'Size Separation Spatial Join', Proc. ACM SIGMOD Int. Conf. on Management of Data, 1997, pp. 324-335. Google ScholarDigital Library
30.Koudas N., Sevcik C.: 'High Dimensional Similarity Joins: Algorithms and Performance Evaluation', Proc. 14th Int. Conf on Data Engineering, Best Paper Award, Orlando, FL, 1998, pp. 466-475. Google ScholarDigital Library
31.Kriegel H.-P., Seidl T.: 'Approximation-Based Similarity Search for 3-D Surface Segments', GeoInformatica Journal, Kluwer Academic Publishers, 1998, Vol.2, No. 2, pp. 113-147. Google ScholarDigital Library
32.Korn F., Sidiropoulos N., Faloutsos C., Siegel E., Protopapas Z.: 'Fast Nearest Neighbor Search in Medical Image Databases', Proc. 22nd Int. Conf. on Very Large DataBases, Mumbai, India, 1996, pp. 215-226. Google ScholarDigital Library
33.Lin K., Jagadish H. V., Faloutsos C.: 'The TV-Tree: An Index Structure for High-Dimensional Data', VLDB Journal, 1995, Vol. 3, pp. 517-542. Google ScholarDigital Library
34.Lo M.-L., Ravishankar C. V.: 'Spatial Joins Using Seeded Trees', Proc. ACM SIGMOD Int. Conf. on Management of Data, Denver, 1994, pp. 517-542 Google ScholarDigital Library
35.Lo M.-L., Ravishankar C. V.: 'Spatial Hash Joins', Proc. ACM SIGMOD Int. Conf. on Management of Data, 1996, pp. 247-258. Google ScholarDigital Library
36.MacQueen, J.: 'Some Methods for Classification and Analysis of Multivariate Observations', 5th Berkeley Symp. Math. Statist. Prob., Vol. 1, pp. 281-297.Google Scholar
37.Mitchell T.M.: 'Machine Learning', McCraw-Hill, 1997. Google ScholarDigital Library
38.Murtagh F.: 'A Survey of Recent Advances in Hierarchical Clustering Algorithms', The Computer Journal Vol. 26, No. 4, 1983, pp.354-359.Google ScholarCross Ref
39.Ng R. T., Han J.: 'Efficient and Effective Clustering Methods for Spatial Data Mining', Proc. 20th Int. Conf. on Very Large DataBases, Santiago de Chile, Chile, 1994, pp. 144-155. Google ScholarDigital Library
40.Patel J.M., DeWitt D.J., 'Partition Based Spatial-Merge Join', Proc. ACM SIGMOD Int. Conf. on Management of Data, 1996, pp. 259-270. Google ScholarDigital Library
41.Piatetsky-Shapiro G., Frawley W. J.: 'Knowledge Discovery in Databases', AAAI/MIT Press, 1991. Google ScholarDigital Library
42.Richards A.J. 'Remote Sensing Digital Image Analysis. An Introduction', Berlin: Springer Verlag, 1983. Google ScholarDigital Library
43.Robinson J. T.: 'The K-D-B-tree: A Search Structure for Large Multidimensional Dynamic Indexes', Proc. ACM SIGMOD Int. Conf. on Management of Data, 1981, pp. 10-18. Google ScholarDigital Library
44.Sellis T., Roussopoulos N., Faloutsos C.: 'The R+-Tree: A Dynamic Index for Multi-Dimensional Objects', Proc. 13th Int. Conf. on Very Large Databases, Brighton, 1987, pp.507-518. Google ScholarDigital Library
45.Sheikholeslami G., Chatterjee S., Zhang A.: 'WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases', Proc. Int. Conf. on Very Large DataBases, New York, NY, 1998, pp. 428 - 439. Google ScholarDigital Library
46.Sibson R.: 'SLINK: an optimally efficient algorithm for the single-link cluster method', The Computer Journal Vol. 16, No. 1, 1973, pp.30-34.Google ScholarCross Ref
47.Shim K., Srikant R., Agrawal R.: 'The e-KDB tree: A Fast Index Structure for High-dimensional Similarity Joins', IEEE Int. Conf on Data Engineering, 1997, 301-311. Google ScholarDigital Library
48.Ullman J.D.: 'Database and Knowledge-Base System', Vol. II,Compute Science Press, Rockville, MD, 1989.Google Scholar

Index Terms

High performance clustering based on the similarity join
1. Information systems

Recommendations

The k-Nearest Neighbour Join: Turbo Charging the KDD Process

The similarity join has become an important database primitive for supporting similarity searches and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Two types of the ...
Read More
String similarity join with different similarity thresholds based on novel indexing techniques

String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a ...
Read More
High-Dimensional Similarity Joins

Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the $\epsilon$ tree, for fast spatial similarity joins on high-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '00: Proceedings of the ninth international conference on Information and knowledge management
November 2000
532 pages
ISBN:1581133200
DOI:10.1145/354756
Chairmen:
Arvin Agah
Univ. of Kansas, Lawrence
,
Jamie Callan
Carnegie Mellon Univ., Pittsburgh, PA
,
Elke Rundensteiner
Worcester Polytechnic Institute, Worcester, MA
,
Susan Gauch
Univ. of Kansas, Lawrence
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 November 2000
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering
data mining
database primitives
multidimensional index structure
similarity join
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 40
  Total Citations
  View Citations
- 996
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

High performance clustering based on the similarity join

CIKM '00: Proceedings of the ninth international conference on Information and knowledge management

References

Cited By

Index Terms

Recommendations

The k-Nearest Neighbour Join: Turbo Charging the KDD Process

String similarity join with different similarity thresholds based on novel indexing techniques

High-Dimensional Similarity Joins

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

High performance clustering based on the similarity join

CIKM '00: Proceedings of the ninth international conference on Information and knowledge management

References

Cited By

Index Terms

Recommendations

The k-Nearest Neighbour Join: Turbo Charging the KDD Process

String similarity join with different similarity thresholds based on novel indexing techniques

High-Dimensional Similarity Joins

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media