Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2783258.2788599acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open Access

Going In-Depth: Finding Longform on the Web

Published:10 August 2015Publication History

ABSTRACT

tl;dr: Longform articles are extended, in-depth pieces that often serve as feature stories in newspapers and magazines. In this work, we develop a system to automatically identify longform content across the web. Our novel classifier is highly accurate despite huge variation within longform in terms of topic, voice, and editorial taste. It is also scalable and interpretable, requiring a surprisingly small set of features based only on language and parse structures, length, and document interest. We implement our system at scale and use it to identify a corpus of several million longform documents. Using this corpus, we provide the first web-scale study with quantifiable and measurable information on longform, giving new insight into questions posed by the media on the past and current state of this famed literary medium.

Skip Supplemental Material Section

Supplemental Material

p2109.mp4

mp4

242 MB

References

  1. The state of news media, textitPew Research Center, 2013.Google ScholarGoogle Scholar
  2. S. Abbar et al. Real-time recommendation of diverse related articles. In WWW, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Ahmed et al. Unified analysis of streaming news. In WWW, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Argamon, M. Koppel, and G. Avneri. Routing documents according to style. In IIS Workshop, 1998.Google ScholarGoogle Scholar
  5. N. Arnold. The cautiously hopeful renaissance of longform journalism, 2014.Google ScholarGoogle Scholar
  6. V. Ashok et al. Success with style: Using writing style to predict the success of novels. Poetry, 2013.Google ScholarGoogle Scholar
  7. J. Bennet. Against 'long-form' journalism, textitThe Atlantic, 2013.Google ScholarGoogle Scholar
  8. P. N. Bennett, K. Svore, and S. T. Dumais. Classification-enhanced ranking. In WWW, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Berger, S. Pietra, and V. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Bollen, H. Mao, and A. Pepe. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In ICWSM, 2011.Google ScholarGoogle Scholar
  11. T. Capote. The duke in his domain, textitNew Yorker, 1957.Google ScholarGoogle Scholar
  12. C. Cooper. The death of slow journalism, textitAmerican Journalism Review, 2009.Google ScholarGoogle Scholar
  13. C. Danescu-Niculescu-Mizil et al. No country for old members: User lifecycle and linguistic change in online communities. In WWW, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M.-C. de Marneffe, B. MacCartney, and C. Manning. Generating typed dependency parses from phrase structure parses. In LREC, 2006.Google ScholarGoogle Scholar
  15. L. DVorkin. Inside forbes: How longform journalism is finding its digital audience, textitForbes, 2012.Google ScholarGoogle Scholar
  16. A. Finn and N. Kushmerick. Learning to classify documents according to genre. JASIST, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Finn, N. Kushmerick, and B. Smyth. Genre classification and domain transfer for information filtering. In Advances in Information Retrieval. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Ganchev and M. Dredze. Small statistical models by random feature mixing. In Workshop on Mobile NLP at ACL, 2008.Google ScholarGoogle Scholar
  19. M. Garber. Sit back, relax, and read that long story on your phone,textitThe Atlantic, 2014.Google ScholarGoogle Scholar
  20. M. Gaulon-Brain. Print media and television: Is longform bound for extinction?, Ina Global, 2013.Google ScholarGoogle Scholar
  21. S. Gollapalli et al. Researcher homepage classification using unlabeled data. In WWW, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Greenwald and E. MacAskill. Nsa prism program taps in to user data of apple, google and others,textitThe Guardian, 2013.Google ScholarGoogle Scholar
  23. M. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In SIGIR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel. Ontonotes: the 90% solution. In NAACL-HLT, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Johnston. What buzzfeed's evolution says about the future of longform journalism,textitPoynter, 2012.Google ScholarGoogle Scholar
  26. Kaggle. Stumbleupon evergreen classification challenge, 2013. https://www.kaggle.com/c/stumbleupon.Google ScholarGoogle Scholar
  27. S. Kamdar. Highlighting content creators in search results. Inside Search, Google Search Blog, 2011.Google ScholarGoogle Scholar
  28. S. Kandell. What i learned from a year of doing longform at buzzfeed,textitThe Big Round Table, 2013.Google ScholarGoogle Scholar
  29. H. Kwak et al. What is twitter, a social network or a news media? In WWW, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Lewis. Obama's way,textitVanity Fair, 2012.Google ScholarGoogle Scholar
  31. J. Liu, P. Dolan, and E. Pedersen. Personalized news recommendation based on click behavior. In IUI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Louis and A. Nenkova. What makes writing great? first experiments on article quality prediction in the science journalism domain. TACL, 1:341--352, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  33. Y. Lv et al. Learning to model relatedness for news recommendation. In WWW, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Mahler. When 'long-form' is bad form, The New York Times, 2014.Google ScholarGoogle Scholar
  35. F. Manjoo. You won't finish this article,textitSlate, 2013.Google ScholarGoogle Scholar
  36. K. McBride. Jill abramson startup to advance writers up to$100k for longform work,textitPoynter, 2014.Google ScholarGoogle Scholar
  37. I. Meuret. A short history of long-form journalism,textitIna Global, 2013.Google ScholarGoogle Scholar
  38. P. Nayak. In-depth articles in search results. Inside Search, Google Search Blog, 2013.Google ScholarGoogle Scholar
  39. J. Nivre et al. The conll 2007 shared task on dependency parsing. In EMNLP-CoNLL, 2007.Google ScholarGoogle Scholar
  40. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.Google ScholarGoogle Scholar
  41. S. Parker. Buzzfeed's success does not mean we should be slaves to clicks,textitThe Guardian, 2014.Google ScholarGoogle Scholar
  42. N. Paumgarten. Up and then down,textitNew Yorker, 2008.Google ScholarGoogle Scholar
  43. C. Plante. Street fighter: The movie - what went wrong,textitPolygon Magazine, 2014.Google ScholarGoogle Scholar
  44. S. W. Raudenbush and A. S. Bryk. Hierarchical linear models: Applications and data analysis methods. 2002.Google ScholarGoogle Scholar
  45. R. Rieder. Long-form journalism makes a comeback,textitUSA Today, 2013.Google ScholarGoogle Scholar
  46. Salmon. Jeff bezos and his journalists. Reuters, 2013.Google ScholarGoogle Scholar
  47. M. Santini et al. Implementing a characterization of genre for automatic genre identification of web pages. In COLING/ACL, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Sappell and R. W. Welkos. The scientology story. The Los Angeles Times, 1990.Google ScholarGoogle Scholar
  49. S. Sharoff. Classifying web corpora into domain and genre using automatic feature identification. In Web as Corpus Workshop, 2007.Google ScholarGoogle Scholar
  50. B. Smith. What the longform backlash is all about,textitMedium.com, 2014.Google ScholarGoogle Scholar
  51. D. Starkman. Major papers' longform meltdown,textitColumbia Journalism Review, 2013.Google ScholarGoogle Scholar
  52. G. Talese. Frank sinatra has a cold,textitEsquire, 1966.Google ScholarGoogle Scholar
  53. A. Tumasjan et al. Predicting elections with twitter. In ICWSM, 2010.Google ScholarGoogle Scholar
  54. D. F. Wallace. The string theory,textitEsquire, 1996.Google ScholarGoogle Scholar
  55. H. Wang et al. Joint relevance and freshness learning from clickthroughs for news search. In WWW, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. G. Wong and W. Mason. The hierarchical logistic regression model for multilevel analysis. Journal of the American Statistical Association, 1985.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Going In-Depth: Finding Longform on the Web

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
        August 2015
        2378 pages
        ISBN:9781450336642
        DOI:10.1145/2783258

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 August 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '15 Paper Acceptance Rate160of819submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24
      • Article Metrics

        • Downloads (Last 12 months)108
        • Downloads (Last 6 weeks)25

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader