Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1935826.1935918acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
poster

Scalable clustering of news search results

Published:09 February 2011Publication History

ABSTRACT

In this paper, we present a system for clustering the search results of a news search engine. The news search interface includes the relevant news articles to a given query organized in terms of related news stories. Here each cluster corresponds to a news story and the news articles are clustered into stories. We present a system that clusters the search results of a news search system in a fast and scalable manner. The clustering system is organized into three components including offline clustering, incremental clustering and realtime clustering. We propose novel techniques for clustering the search results in realtime. The experimental results with large collections of news documents reveal that our system is both scalable and also achieves good accuracy in clustering the news search results.

References

  1. N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In MACHINE LEARNING, pages 238--247, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy approach to natural language processing. Comput. Linguist., 22(1):39--71, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Chakrabarti, S. Cauduri, and S. won Hwang. Automatic categorization of query results. In Proceedings of SIGMOD 2004, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. I. Dhillon and D. Modha. Concept decomposition for large sparse text data using clustering. Machine Learning, 1(42):143--175, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Dom. An information-theoretic external cluster-validity measure. In UAI, pages 137--145, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Dom. Q3 and Q4: A Complete-coding refinement to an information-theoretic external clustering validity measure. Technical Report TBD, Yahoo! Labs, 2010. to appear.Google ScholarGoogle Scholar
  7. F. Gelgi, H. Davulcu, and S. Vadrevu. Term ranking for clustering web search results. In Proceedings of Tenth International Workshop on the Web and Databases (WebDB 2007), Beijing, China, 2007.Google ScholarGoogle Scholar
  8. P. D. Grünwald. The Minimum Description Length Principle. The MIT Press, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In SIGIR '96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 76--84, New York, NY, USA, 1996. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. pages 604--613, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, and R. Krishnapuram. A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 658--665, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Lee and S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, (401):788--791, 1999.Google ScholarGoogle Scholar
  13. S. Li, X. Hou, H. Zhang, and Q. Cheng. Learning spatially localized, parts-based representation. CVPR, (1):207--212, 2001.Google ScholarGoogle Scholar
  14. S. Osinski. Improving quality of search results clustering with approximate matrix factorisations. In In the Proceedings of the 28th European Conference on IR Research (ECIR 2006), London, UK, 2006. Springer Berlin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Osinski, J. Stefanowski, and D. Weiss. Lingo: Search results clustering algorithm based on singular value decomposition. In Advances in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of he International IIS: IIPWM 2004 Conference, pages 359--368, Zakopane, Poland, 2004.Google ScholarGoogle Scholar
  16. J. Rissanen. Information and Complexity in Statistical Modeling. Information Science and Statistics. Springer, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Robertson, H. Zaragoza, and M. Taylor. Simple bm25 extension to multiple weighted fields. In ACM international Conference on Information and Knowledge Management, pages 42--49, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. I. Taher H. Haveliwala, Aristides Gionis. Scalable techniques for clustering the web. In In Proc. of the WebDB Workshop, pages 129--134, 2000.Google ScholarGoogle Scholar
  19. H. Toda and R. Kataoka. A search result clustering method using informatively named entities. In WIDM '05: Proceedings of the 7th annual ACM international workshop on Web information and data management, pages 81--86, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. G. Yiping Zhou, Lan Nie. Surface form resolution based on wikipedia. In To appear In Proc. of COLING 2010, 2010.Google ScholarGoogle Scholar
  21. O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR 1998, pages 46--84, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma. Learning to cluster web search results. In SIGIR, pages 210--217, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable clustering of news search results

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
        February 2011
        870 pages
        ISBN:9781450304931
        DOI:10.1145/1935826

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 February 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • poster

        Acceptance Rates

        WSDM '11 Paper Acceptance Rate83of372submissions,22%Overall Acceptance Rate498of2,863submissions,17%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader