poster

Scalable clustering of news search results

Authors:
Srinivas Vadrevu

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Choon Hui Teo

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Suju Rajan

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Kunal Punera

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Byron Dom

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Alexander J. Smola

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Yi Chang

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Zhaohui Zheng

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data miningFebruary 2011Pages 675–684https://doi.org/10.1145/1935826.1935918

Published:09 February 2011Publication History

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

Pages 675–684

ABSTRACT

In this paper, we present a system for clustering the search results of a news search engine. The news search interface includes the relevant news articles to a given query organized in terms of related news stories. Here each cluster corresponds to a news story and the news articles are clustered into stories. We present a system that clusters the search results of a news search system in a fast and scalable manner. The clustering system is organized into three components including offline clustering, incremental clustering and realtime clustering. We propose novel techniques for clustering the search results in realtime. The experimental results with large collections of news documents reveal that our system is both scalable and also achieves good accuracy in clustering the news search results.

References

N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In MACHINE LEARNING, pages 238--247, 2002. Google ScholarDigital Library
A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy approach to natural language processing. Comput. Linguist., 22(1):39--71, 1996. Google ScholarDigital Library
K. Chakrabarti, S. Cauduri, and S. won Hwang. Automatic categorization of query results. In Proceedings of SIGMOD 2004, 2004. Google ScholarDigital Library
I. Dhillon and D. Modha. Concept decomposition for large sparse text data using clustering. Machine Learning, 1(42):143--175, 2001. Google ScholarDigital Library
B. Dom. An information-theoretic external cluster-validity measure. In UAI, pages 137--145, 2002. Google ScholarDigital Library
B. Dom. Q3 and Q4: A Complete-coding refinement to an information-theoretic external clustering validity measure. Technical Report TBD, Yahoo! Labs, 2010. to appear.Google Scholar
F. Gelgi, H. Davulcu, and S. Vadrevu. Term ranking for clustering web search results. In Proceedings of Tenth International Workshop on the Web and Databases (WebDB 2007), Beijing, China, 2007.Google Scholar
P. D. Grünwald. The Minimum Description Length Principle. The MIT Press, 2007.Google ScholarDigital Library
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In SIGIR '96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 76--84, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. pages 604--613, 1998. Google ScholarDigital Library
K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, and R. Krishnapuram. A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 658--665, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
D. Lee and S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, (401):788--791, 1999.Google Scholar
S. Li, X. Hou, H. Zhang, and Q. Cheng. Learning spatially localized, parts-based representation. CVPR, (1):207--212, 2001.Google Scholar
S. Osinski. Improving quality of search results clustering with approximate matrix factorisations. In In the Proceedings of the 28th European Conference on IR Research (ECIR 2006), London, UK, 2006. Springer Berlin. Google ScholarDigital Library
S. Osinski, J. Stefanowski, and D. Weiss. Lingo: Search results clustering algorithm based on singular value decomposition. In Advances in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of he International IIS: IIPWM 2004 Conference, pages 359--368, Zakopane, Poland, 2004.Google Scholar
J. Rissanen. Information and Complexity in Statistical Modeling. Information Science and Statistics. Springer, 2007. Google ScholarDigital Library
S. Robertson, H. Zaragoza, and M. Taylor. Simple bm25 extension to multiple weighted fields. In ACM international Conference on Information and Knowledge Management, pages 42--49, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
P. I. Taher H. Haveliwala, Aristides Gionis. Scalable techniques for clustering the web. In In Proc. of the WebDB Workshop, pages 129--134, 2000.Google Scholar
H. Toda and R. Kataoka. A search result clustering method using informatively named entities. In WIDM '05: Proceedings of the 7th annual ACM international workshop on Web information and data management, pages 81--86, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
S. G. Yiping Zhou, Lan Nie. Surface form resolution based on wikipedia. In To appear In Proc. of COLING 2010, 2010.Google Scholar
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR 1998, pages 46--84, 1996. Google ScholarDigital Library
H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma. Learning to cluster web search results. In SIGIR, pages 210--217, New York, NY, USA, 2004. ACM. Google ScholarDigital Library

Index Terms

Scalable clustering of news search results
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information retrieval

Recommendations

A clustering technique for news articles using WordNet

The Web is overcrowded with news articles, an overwhelming information source both with its amount and diversity. Document clustering is a powerful technique that has been widely used for organizing data into smaller and manageable information kernels. ...
Read More
Evolution-Based Tabu Search Approach to Automatic Clustering

Traditional clustering algorithms (e.g., the K-means algorithm and its variants) are used only for a fixed number of clusters. However, in many clustering applications, the actual number of clusters is unknown beforehand. The general solution to this ...
Read More
A novel hybrid K-harmonic means and gravitational search algorithm approach for clustering

Clustering is used to group data objects into sets of disjoint classes called clusters so that objects within the same class are highly similar to each other and dissimilar from the objects in other classes. K-harmonic means (KHM) is one of the most ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
February 2011
870 pages
ISBN:9781450304931
DOI:10.1145/1935826
General Chair:
Irwin King
CUHK, Hong Kong
,
Program Chairs:
Wolfgang Nejdl
L3S and University of Hannover, Germany
,
Hang Li
Microsoft Research Asia, China
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 February 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering
news clustering
news search
query based clustering
realtime clustering
Qualifiers
- poster
Conference

Acceptance Rates
WSDM '11 Paper Acceptance Rate83of372submissions,22%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 504
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable clustering of news search results

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

A clustering technique for news articles using WordNet

Evolution-Based Tabu Search Approach to Automatic Clustering

A novel hybrid K-harmonic means and gravitational search algorithm approach for clustering