ABSTRACT
tl;dr: Longform articles are extended, in-depth pieces that often serve as feature stories in newspapers and magazines. In this work, we develop a system to automatically identify longform content across the web. Our novel classifier is highly accurate despite huge variation within longform in terms of topic, voice, and editorial taste. It is also scalable and interpretable, requiring a surprisingly small set of features based only on language and parse structures, length, and document interest. We implement our system at scale and use it to identify a corpus of several million longform documents. Using this corpus, we provide the first web-scale study with quantifiable and measurable information on longform, giving new insight into questions posed by the media on the past and current state of this famed literary medium.
Supplemental Material
- The state of news media, textitPew Research Center, 2013.Google Scholar
- S. Abbar et al. Real-time recommendation of diverse related articles. In WWW, 2013. Google ScholarDigital Library
- A. Ahmed et al. Unified analysis of streaming news. In WWW, 2011. Google ScholarDigital Library
- S. Argamon, M. Koppel, and G. Avneri. Routing documents according to style. In IIS Workshop, 1998.Google Scholar
- N. Arnold. The cautiously hopeful renaissance of longform journalism, 2014.Google Scholar
- V. Ashok et al. Success with style: Using writing style to predict the success of novels. Poetry, 2013.Google Scholar
- J. Bennet. Against 'long-form' journalism, textitThe Atlantic, 2013.Google Scholar
- P. N. Bennett, K. Svore, and S. T. Dumais. Classification-enhanced ranking. In WWW, 2010. Google ScholarDigital Library
- A. Berger, S. Pietra, and V. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 1996. Google ScholarDigital Library
- J. Bollen, H. Mao, and A. Pepe. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In ICWSM, 2011.Google Scholar
- T. Capote. The duke in his domain, textitNew Yorker, 1957.Google Scholar
- C. Cooper. The death of slow journalism, textitAmerican Journalism Review, 2009.Google Scholar
- C. Danescu-Niculescu-Mizil et al. No country for old members: User lifecycle and linguistic change in online communities. In WWW, 2013. Google ScholarDigital Library
- M.-C. de Marneffe, B. MacCartney, and C. Manning. Generating typed dependency parses from phrase structure parses. In LREC, 2006.Google Scholar
- L. DVorkin. Inside forbes: How longform journalism is finding its digital audience, textitForbes, 2012.Google Scholar
- A. Finn and N. Kushmerick. Learning to classify documents according to genre. JASIST, 2006. Google ScholarDigital Library
- A. Finn, N. Kushmerick, and B. Smyth. Genre classification and domain transfer for information filtering. In Advances in Information Retrieval. 2002. Google ScholarDigital Library
- K. Ganchev and M. Dredze. Small statistical models by random feature mixing. In Workshop on Mobile NLP at ACL, 2008.Google Scholar
- M. Garber. Sit back, relax, and read that long story on your phone,textitThe Atlantic, 2014.Google Scholar
- M. Gaulon-Brain. Print media and television: Is longform bound for extinction?, Ina Global, 2013.Google Scholar
- S. Gollapalli et al. Researcher homepage classification using unlabeled data. In WWW, 2013. Google ScholarDigital Library
- G. Greenwald and E. MacAskill. Nsa prism program taps in to user data of apple, google and others,textitThe Guardian, 2013.Google Scholar
- M. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In SIGIR, 2006. Google ScholarDigital Library
- E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel. Ontonotes: the 90% solution. In NAACL-HLT, 2006. Google ScholarDigital Library
- C. Johnston. What buzzfeed's evolution says about the future of longform journalism,textitPoynter, 2012.Google Scholar
- Kaggle. Stumbleupon evergreen classification challenge, 2013. https://www.kaggle.com/c/stumbleupon.Google Scholar
- S. Kamdar. Highlighting content creators in search results. Inside Search, Google Search Blog, 2011.Google Scholar
- S. Kandell. What i learned from a year of doing longform at buzzfeed,textitThe Big Round Table, 2013.Google Scholar
- H. Kwak et al. What is twitter, a social network or a news media? In WWW, 2010. Google ScholarDigital Library
- M. Lewis. Obama's way,textitVanity Fair, 2012.Google Scholar
- J. Liu, P. Dolan, and E. Pedersen. Personalized news recommendation based on click behavior. In IUI, 2010. Google ScholarDigital Library
- A. Louis and A. Nenkova. What makes writing great? first experiments on article quality prediction in the science journalism domain. TACL, 1:341--352, 2013.Google ScholarCross Ref
- Y. Lv et al. Learning to model relatedness for news recommendation. In WWW, 2011. Google ScholarDigital Library
- J. Mahler. When 'long-form' is bad form, The New York Times, 2014.Google Scholar
- F. Manjoo. You won't finish this article,textitSlate, 2013.Google Scholar
- K. McBride. Jill abramson startup to advance writers up to$100k for longform work,textitPoynter, 2014.Google Scholar
- I. Meuret. A short history of long-form journalism,textitIna Global, 2013.Google Scholar
- P. Nayak. In-depth articles in search results. Inside Search, Google Search Blog, 2013.Google Scholar
- J. Nivre et al. The conll 2007 shared task on dependency parsing. In EMNLP-CoNLL, 2007.Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.Google Scholar
- S. Parker. Buzzfeed's success does not mean we should be slaves to clicks,textitThe Guardian, 2014.Google Scholar
- N. Paumgarten. Up and then down,textitNew Yorker, 2008.Google Scholar
- C. Plante. Street fighter: The movie - what went wrong,textitPolygon Magazine, 2014.Google Scholar
- S. W. Raudenbush and A. S. Bryk. Hierarchical linear models: Applications and data analysis methods. 2002.Google Scholar
- R. Rieder. Long-form journalism makes a comeback,textitUSA Today, 2013.Google Scholar
- Salmon. Jeff bezos and his journalists. Reuters, 2013.Google Scholar
- M. Santini et al. Implementing a characterization of genre for automatic genre identification of web pages. In COLING/ACL, 2006. Google ScholarDigital Library
- J. Sappell and R. W. Welkos. The scientology story. The Los Angeles Times, 1990.Google Scholar
- S. Sharoff. Classifying web corpora into domain and genre using automatic feature identification. In Web as Corpus Workshop, 2007.Google Scholar
- B. Smith. What the longform backlash is all about,textitMedium.com, 2014.Google Scholar
- D. Starkman. Major papers' longform meltdown,textitColumbia Journalism Review, 2013.Google Scholar
- G. Talese. Frank sinatra has a cold,textitEsquire, 1966.Google Scholar
- A. Tumasjan et al. Predicting elections with twitter. In ICWSM, 2010.Google Scholar
- D. F. Wallace. The string theory,textitEsquire, 1996.Google Scholar
- H. Wang et al. Joint relevance and freshness learning from clickthroughs for news search. In WWW, 2012. Google ScholarDigital Library
- G. Wong and W. Mason. The hierarchical logistic regression model for multilevel analysis. Journal of the American Statistical Association, 1985.Google ScholarCross Ref
Index Terms
- Going In-Depth: Finding Longform on the Web
Recommendations
Efficient Algorithms for Public-Private Social Networks
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningWe introduce the public-private model of graphs. In this model, we have a public graph and each node in the public graph has an associated private graph. The motivation for studying this model stems from social networks, where the nodes are the users, ...
Unified and Contrasting Cuts in Multiple Graphs: Application to Medical Imaging Segmentation
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningThe analysis of data represented as graphs is common having wide scale applications from social networks to medical imaging. A popular analysis is to cut the graph so that the disjoint subgraphs can represent communities (for social network) or ...
Stream Sampling for Frequency Cap Statistics
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningUnaggregated data, in a streamed or distributed form, is prevalent and comes from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries) and elements with different keys ...
Comments