research-article

Open Access

Going In-Depth: Finding Longform on the Web

Authors:
Virginia Smith

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Miriam Connor

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

,
Isabelle Stanton

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2015Pages 2109–2118https://doi.org/10.1145/2783258.2788599

Published:10 August 2015Publication History

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 2109–2118

ABSTRACT

tl;dr: Longform articles are extended, in-depth pieces that often serve as feature stories in newspapers and magazines. In this work, we develop a system to automatically identify longform content across the web. Our novel classifier is highly accurate despite huge variation within longform in terms of topic, voice, and editorial taste. It is also scalable and interpretable, requiring a surprisingly small set of features based only on language and parse structures, length, and document interest. We implement our system at scale and use it to identify a corpus of several million longform documents. Using this corpus, we provide the first web-scale study with quantifiable and measurable information on longform, giving new insight into questions posed by the media on the past and current state of this famed literary medium.

Supplemental Material

p2109.mp4

mp4

242 MB

Download

References

The state of news media, textitPew Research Center, 2013.Google Scholar
S. Abbar et al. Real-time recommendation of diverse related articles. In WWW, 2013. Google ScholarDigital Library
A. Ahmed et al. Unified analysis of streaming news. In WWW, 2011. Google ScholarDigital Library
S. Argamon, M. Koppel, and G. Avneri. Routing documents according to style. In IIS Workshop, 1998.Google Scholar
N. Arnold. The cautiously hopeful renaissance of longform journalism, 2014.Google Scholar
V. Ashok et al. Success with style: Using writing style to predict the success of novels. Poetry, 2013.Google Scholar
J. Bennet. Against 'long-form' journalism, textitThe Atlantic, 2013.Google Scholar
P. N. Bennett, K. Svore, and S. T. Dumais. Classification-enhanced ranking. In WWW, 2010. Google ScholarDigital Library
A. Berger, S. Pietra, and V. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 1996. Google ScholarDigital Library
J. Bollen, H. Mao, and A. Pepe. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In ICWSM, 2011.Google Scholar
T. Capote. The duke in his domain, textitNew Yorker, 1957.Google Scholar
C. Cooper. The death of slow journalism, textitAmerican Journalism Review, 2009.Google Scholar
C. Danescu-Niculescu-Mizil et al. No country for old members: User lifecycle and linguistic change in online communities. In WWW, 2013. Google ScholarDigital Library
M.-C. de Marneffe, B. MacCartney, and C. Manning. Generating typed dependency parses from phrase structure parses. In LREC, 2006.Google Scholar
L. DVorkin. Inside forbes: How longform journalism is finding its digital audience, textitForbes, 2012.Google Scholar
A. Finn and N. Kushmerick. Learning to classify documents according to genre. JASIST, 2006. Google ScholarDigital Library
A. Finn, N. Kushmerick, and B. Smyth. Genre classification and domain transfer for information filtering. In Advances in Information Retrieval. 2002. Google ScholarDigital Library
K. Ganchev and M. Dredze. Small statistical models by random feature mixing. In Workshop on Mobile NLP at ACL, 2008.Google Scholar
M. Garber. Sit back, relax, and read that long story on your phone,textitThe Atlantic, 2014.Google Scholar
M. Gaulon-Brain. Print media and television: Is longform bound for extinction?, Ina Global, 2013.Google Scholar
S. Gollapalli et al. Researcher homepage classification using unlabeled data. In WWW, 2013. Google ScholarDigital Library
G. Greenwald and E. MacAskill. Nsa prism program taps in to user data of apple, google and others,textitThe Guardian, 2013.Google Scholar
M. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In SIGIR, 2006. Google ScholarDigital Library
E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel. Ontonotes: the 90% solution. In NAACL-HLT, 2006. Google ScholarDigital Library
C. Johnston. What buzzfeed's evolution says about the future of longform journalism,textitPoynter, 2012.Google Scholar
Kaggle. Stumbleupon evergreen classification challenge, 2013. https://www.kaggle.com/c/stumbleupon.Google Scholar
S. Kamdar. Highlighting content creators in search results. Inside Search, Google Search Blog, 2011.Google Scholar
S. Kandell. What i learned from a year of doing longform at buzzfeed,textitThe Big Round Table, 2013.Google Scholar
H. Kwak et al. What is twitter, a social network or a news media? In WWW, 2010. Google ScholarDigital Library
M. Lewis. Obama's way,textitVanity Fair, 2012.Google Scholar
J. Liu, P. Dolan, and E. Pedersen. Personalized news recommendation based on click behavior. In IUI, 2010. Google ScholarDigital Library
A. Louis and A. Nenkova. What makes writing great? first experiments on article quality prediction in the science journalism domain. TACL, 1:341--352, 2013.Google ScholarCross Ref
Y. Lv et al. Learning to model relatedness for news recommendation. In WWW, 2011. Google ScholarDigital Library
J. Mahler. When 'long-form' is bad form, The New York Times, 2014.Google Scholar
F. Manjoo. You won't finish this article,textitSlate, 2013.Google Scholar
K. McBride. Jill abramson startup to advance writers up to$100k for longform work,textitPoynter, 2014.Google Scholar
I. Meuret. A short history of long-form journalism,textitIna Global, 2013.Google Scholar
P. Nayak. In-depth articles in search results. Inside Search, Google Search Blog, 2013.Google Scholar
J. Nivre et al. The conll 2007 shared task on dependency parsing. In EMNLP-CoNLL, 2007.Google Scholar
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.Google Scholar
S. Parker. Buzzfeed's success does not mean we should be slaves to clicks,textitThe Guardian, 2014.Google Scholar
N. Paumgarten. Up and then down,textitNew Yorker, 2008.Google Scholar
C. Plante. Street fighter: The movie - what went wrong,textitPolygon Magazine, 2014.Google Scholar
S. W. Raudenbush and A. S. Bryk. Hierarchical linear models: Applications and data analysis methods. 2002.Google Scholar
R. Rieder. Long-form journalism makes a comeback,textitUSA Today, 2013.Google Scholar
Salmon. Jeff bezos and his journalists. Reuters, 2013.Google Scholar
M. Santini et al. Implementing a characterization of genre for automatic genre identification of web pages. In COLING/ACL, 2006. Google ScholarDigital Library
J. Sappell and R. W. Welkos. The scientology story. The Los Angeles Times, 1990.Google Scholar
S. Sharoff. Classifying web corpora into domain and genre using automatic feature identification. In Web as Corpus Workshop, 2007.Google Scholar
B. Smith. What the longform backlash is all about,textitMedium.com, 2014.Google Scholar
D. Starkman. Major papers' longform meltdown,textitColumbia Journalism Review, 2013.Google Scholar
G. Talese. Frank sinatra has a cold,textitEsquire, 1966.Google Scholar
A. Tumasjan et al. Predicting elections with twitter. In ICWSM, 2010.Google Scholar
D. F. Wallace. The string theory,textitEsquire, 1996.Google Scholar
H. Wang et al. Joint relevance and freshness learning from clickthroughs for news search. In WWW, 2012. Google ScholarDigital Library
G. Wong and W. Mason. The hierarchical logistic regression model for multilevel analysis. Journal of the American Statistical Association, 1985.Google ScholarCross Ref

Index Terms

Going In-Depth: Finding Longform on the Web
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Recommendations

Efficient Algorithms for Public-Private Social Networks
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

We introduce the public-private model of graphs. In this model, we have a public graph and each node in the public graph has an associated private graph. The motivation for studying this model stems from social networks, where the nodes are the users, ...
Read More
Unified and Contrasting Cuts in Multiple Graphs: Application to Medical Imaging Segmentation
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

The analysis of data represented as graphs is common having wide scale applications from social networks to medical imaging. A popular analysis is to cut the graph so that the disjoint subgraphs can represent communities (for social network) or ...
Read More
Stream Sampling for Frequency Cap Statistics
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Unaggregated data, in a streamed or distributed form, is prevalent and comes from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries) and elements with different keys ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2015
2378 pages
ISBN:9781450336642
DOI:10.1145/2783258
General Chairs:
Longbing Cao
University of Technology, Sydney
,
Chengqi Zhang
University of Technology, Sydney
,
Program Chairs:
Thorsten Joachims
Cornell University
,
Geoff Webb
Monash University
,
Dragos D. Margineantu
Boeing Research
,
Graham Williams
Australian Taxation Office
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
feature engineering
machine learning
natural language processing
web mining
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '15 Paper Acceptance Rate160of819submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 1,261
  Total Downloads
- Downloads (Last 12 months)108
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Going In-Depth: Finding Longform on the Web

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Efficient Algorithms for Public-Private Social Networks

Unified and Contrasting Cuts in Multiple Graphs: Application to Medical Imaging Segmentation

Stream Sampling for Frequency Cap Statistics