ABSTRACT
In this paper, we present a system for clustering the search results of a news search engine. The news search interface includes the relevant news articles to a given query organized in terms of related news stories. Here each cluster corresponds to a news story and the news articles are clustered into stories. We present a system that clusters the search results of a news search system in a fast and scalable manner. The clustering system is organized into three components including offline clustering, incremental clustering and realtime clustering. We propose novel techniques for clustering the search results in realtime. The experimental results with large collections of news documents reveal that our system is both scalable and also achieves good accuracy in clustering the news search results.
- N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In MACHINE LEARNING, pages 238--247, 2002. Google ScholarDigital Library
- A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy approach to natural language processing. Comput. Linguist., 22(1):39--71, 1996. Google ScholarDigital Library
- K. Chakrabarti, S. Cauduri, and S. won Hwang. Automatic categorization of query results. In Proceedings of SIGMOD 2004, 2004. Google ScholarDigital Library
- I. Dhillon and D. Modha. Concept decomposition for large sparse text data using clustering. Machine Learning, 1(42):143--175, 2001. Google ScholarDigital Library
- B. Dom. An information-theoretic external cluster-validity measure. In UAI, pages 137--145, 2002. Google ScholarDigital Library
- B. Dom. Q3 and Q4: A Complete-coding refinement to an information-theoretic external clustering validity measure. Technical Report TBD, Yahoo! Labs, 2010. to appear.Google Scholar
- F. Gelgi, H. Davulcu, and S. Vadrevu. Term ranking for clustering web search results. In Proceedings of Tenth International Workshop on the Web and Databases (WebDB 2007), Beijing, China, 2007.Google Scholar
- P. D. Grünwald. The Minimum Description Length Principle. The MIT Press, 2007.Google ScholarDigital Library
- M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In SIGIR '96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 76--84, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
- P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. pages 604--613, 1998. Google ScholarDigital Library
- K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, and R. Krishnapuram. A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 658--665, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- D. Lee and S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, (401):788--791, 1999.Google Scholar
- S. Li, X. Hou, H. Zhang, and Q. Cheng. Learning spatially localized, parts-based representation. CVPR, (1):207--212, 2001.Google Scholar
- S. Osinski. Improving quality of search results clustering with approximate matrix factorisations. In In the Proceedings of the 28th European Conference on IR Research (ECIR 2006), London, UK, 2006. Springer Berlin. Google ScholarDigital Library
- S. Osinski, J. Stefanowski, and D. Weiss. Lingo: Search results clustering algorithm based on singular value decomposition. In Advances in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of he International IIS: IIPWM 2004 Conference, pages 359--368, Zakopane, Poland, 2004.Google Scholar
- J. Rissanen. Information and Complexity in Statistical Modeling. Information Science and Statistics. Springer, 2007. Google ScholarDigital Library
- S. Robertson, H. Zaragoza, and M. Taylor. Simple bm25 extension to multiple weighted fields. In ACM international Conference on Information and Knowledge Management, pages 42--49, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- P. I. Taher H. Haveliwala, Aristides Gionis. Scalable techniques for clustering the web. In In Proc. of the WebDB Workshop, pages 129--134, 2000.Google Scholar
- H. Toda and R. Kataoka. A search result clustering method using informatively named entities. In WIDM '05: Proceedings of the 7th annual ACM international workshop on Web information and data management, pages 81--86, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- S. G. Yiping Zhou, Lan Nie. Surface form resolution based on wikipedia. In To appear In Proc. of COLING 2010, 2010.Google Scholar
- O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR 1998, pages 46--84, 1996. Google ScholarDigital Library
- H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma. Learning to cluster web search results. In SIGIR, pages 210--217, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
Index Terms
- Scalable clustering of news search results
Recommendations
A clustering technique for news articles using WordNet
The Web is overcrowded with news articles, an overwhelming information source both with its amount and diversity. Document clustering is a powerful technique that has been widely used for organizing data into smaller and manageable information kernels. ...
Evolution-Based Tabu Search Approach to Automatic Clustering
Traditional clustering algorithms (e.g., the K-means algorithm and its variants) are used only for a fixed number of clusters. However, in many clustering applications, the actual number of clusters is unknown beforehand. The general solution to this ...
A novel hybrid K-harmonic means and gravitational search algorithm approach for clustering
Clustering is used to group data objects into sets of disjoint classes called clusters so that objects within the same class are highly similar to each other and dissimilar from the objects in other classes. K-harmonic means (KHM) is one of the most ...
Comments