Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–27 of 27 results for author: Steorts, R C

Searching in archive stat. Search in all archives.
.
  1. Bayesian Graphical Entity Resolution Using Exchangeable Random Partition Priors

    Authors: Neil G. Marchant, Benjamin I. P. Rubinstein, Rebecca C. Steorts

    Abstract: Entity resolution (record linkage or deduplication) is the process of identifying and linking duplicate records in databases. In this paper, we propose a Bayesian graphical approach for entity resolution that links records to latent entities, where the prior representation on the linkage structure is exchangeable. First, we adopt a flexible and tractable set of priors for the linkage structure, wh… ▽ More

    Submitted 7 January, 2023; originally announced January 2023.

    Comments: 27 pages, 4 figures, 3 tables. Includes 37 pages of appendices. This is an accepted manuscript to be published in the Journal of Survey Statistics and Methodology

  2. arXiv:2112.01594  [pdf, other

    stat.ME stat.AP

    On the Reliability of Multiple Systems Estimation for the Quantification of Modern Slavery

    Authors: Olivier Binette, Rebecca C. Steorts

    Abstract: The quantification of modern slavery has received increased attention recently as organizations have come together to produce global estimates, where multiple systems estimation (MSE) is often used to this end. Echoing a long-standing controversy, disagreements have re-surfaced regarding the underlying MSE assumptions, the robustness of MSE methodology, and the accuracy of MSE estimates in this ap… ▽ More

    Submitted 2 December, 2021; originally announced December 2021.

    Journal ref: Journal of the Royal Statistical Society: Series A (Statistics in Society), 1 - 37 (2022)

  3. arXiv:2103.04025  [pdf, other

    stat.ME

    Transformed Fay-Herriot Model with Measurement Error in Covariates

    Authors: Sepideh Mosaferi, Malay Ghosh, Rebecca C. Steorts

    Abstract: Statistical agencies are often asked to produce small area estimates (SAEs) for positively skewed variables. When domain sample sizes are too small to support direct estimators, effects of skewness of the response variable can be large. As such, it is important to appropriately account for the distribution of the response variable given available auxiliary information. Motivated by this issue and… ▽ More

    Submitted 5 March, 2021; originally announced March 2021.

  4. arXiv:2008.04443  [pdf, other

    stat.ME cs.DB stat.ML

    (Almost) All of Entity Resolution

    Authors: Olivier Binette, Rebecca C. Steorts

    Abstract: Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrat… ▽ More

    Submitted 17 January, 2022; v1 submitted 10 August, 2020; originally announced August 2020.

  5. arXiv:2004.02008  [pdf, other

    stat.ME math.ST

    Random Partition Models for Microclustering Tasks

    Authors: Brenda Betancourt, Giacomo Zanella, Rebecca C. Steorts

    Abstract: Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the… ▽ More

    Submitted 4 April, 2020; originally announced April 2020.

  6. arXiv:1909.06039  [pdf, other

    stat.CO cs.DB cs.LG stat.ML

    d-blink: Distributed End-to-End Bayesian Entity Resolution

    Authors: Neil G. Marchant, Andee Kaplan, Daniel N. Elazar, Benjamin I. P. Rubinstein, Rebecca C. Steorts

    Abstract: Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing mode… ▽ More

    Submitted 22 September, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: 32 pages, 6 figures, 5 tables. Includes 22 pages of supplementary material. This revision incorporates a case study on the 2010 U.S. Decennial Census

    MSC Class: 62F15; 65C40; 68W15

  7. arXiv:1810.05497  [pdf, other

    cs.DB cs.LG stat.AP stat.ML

    Probabilistic Blocking with An Application to the Syrian Conflict

    Authors: Rebecca C. Steorts, Anshumali Shrivastava

    Abstract: Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce $k$-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into… ▽ More

    Submitted 10 October, 2018; originally announced October 2018.

    Comments: 16 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:1510.07714, arXiv:1710.02690

    Journal ref: Steorts R.C., Shrivastava A. (2018) Probabilistic Blocking with an Application to the Syrian Conflict. PSD (2018)

  8. arXiv:1810.04808  [pdf, other

    stat.ME

    Generalized Bayesian Record Linkage and Regression with Exact Error Propagation

    Authors: Rebecca C. Steorts, Andrea Tancredi, Brunero Liseo

    Abstract: Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive, or post-linkage task on the linked data. One goal of the downstream task is obtaining a larger reference data set, allowing one to perform more accurate s… ▽ More

    Submitted 10 October, 2018; originally announced October 2018.

    Comments: 18 pages, 5 figures

    Journal ref: Steorts, R.C., Tancredi, A., Liseo, B. Generalized Bayesian Record Linkage and Regression with Exact Error Propagation, PSD (2018)

  9. arXiv:1810.01538  [pdf, other

    stat.ME cs.DB cs.LG

    A Practical Approach to Proper Inference with Linked Data

    Authors: Andee Kaplan, Brenda Betancourt, Rebecca C. Steorts

    Abstract: Entity resolution (ER), comprising record linkage and de-duplication, is the process of merging noisy databases in the absence of unique identifiers to remove duplicate entities. One major challenge of analysis with linked data is identifying a representative record among determined matches to pass to an inferential or predictive task, referred to as the \emph{downstream task}. Additionally, incor… ▽ More

    Submitted 8 February, 2022; v1 submitted 2 October, 2018; originally announced October 2018.

    Comments: 31 pages, 2 figures

  10. arXiv:1710.02690  [pdf, other

    stat.AP cs.DB cs.DS

    Unique Entity Estimation with Application to the Syrian Conflict

    Authors: Beidi Chen, Anshumali Shrivastava, Rebecca C. Steorts

    Abstract: Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus o… ▽ More

    Submitted 7 October, 2017; originally announced October 2017.

    Comments: 35 pages, 6 figures, 2 tables

  11. arXiv:1703.02679  [pdf, other

    math.ST cs.IT stat.ME stat.ML

    Performance Bounds for Graphical Record Linkage

    Authors: Rebecca C. Steorts, Matt Barnes, Willie Neiswanger

    Abstract: Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a res… ▽ More

    Submitted 7 March, 2017; originally announced March 2017.

    Comments: 11 pages with supplement; 4 figures and 2 tables; to appear in AISTATS 2017

  12. arXiv:1610.09780  [pdf, other

    stat.ME math.ST stat.AP stat.ML

    Flexible Models for Microclustering with Application to Entity Resolution

    Authors: Giacomo Zanella, Brenda Betancourt, Hanna Wallach, Jeffrey Miller, Abbas Zaidi, Rebecca C. Steorts

    Abstract: Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. F… ▽ More

    Submitted 31 October, 2016; originally announced October 2016.

    Comments: 15 pages, 3 figures, 1 table, to appear NIPS 2016. arXiv admin note: text overlap with arXiv:1512.00792

  13. arXiv:1608.02209  [pdf, other

    stat.ML stat.ME

    Bayesian Learning of Dynamic Multilayer Networks

    Authors: Daniele Durante, Nabanita Mukherjee, Rebecca C. Steorts

    Abstract: A plethora of networks is being collected in a growing number of fields, including disease transmission, international relations, social interactions, and others. As data streams continue to grow, the complexity associated with these highly multidimensional connectivity data presents novel challenges. In this paper, we focus on the time-varying interconnections among a set of actors in multiple co… ▽ More

    Submitted 30 December, 2016; v1 submitted 7 August, 2016; originally announced August 2016.

    Journal ref: Journal of Machine Learning Research (2017). 18, 1-29

  14. arXiv:1512.00792  [pdf, other

    stat.ME stat.AP stat.CO stat.ML

    Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

    Authors: Jeffrey Miller, Brenda Betancourt, Abbas Zaidi, Hanna Wallach, Rebecca C. Steorts

    Abstract: Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For exampl… ▽ More

    Submitted 2 December, 2015; originally announced December 2015.

    Comments: 8 pages, 3 figures, NIPS Bayesian Nonparametrics: The Next Generation Workshop Series

  15. arXiv:1510.07714  [pdf, other

    stat.AP cs.DB

    Blocking Methods Applied to Casualty Records from the Syrian Conflict

    Authors: Peter Sadosky, Anshumali Shrivastava, Megan Price, Rebecca C. Steorts

    Abstract: Estimation of death counts and associated standard errors is of great importance in armed conflict such as the ongoing violence in Syria, as well as historical conflicts in Guatemala, Perú, Colombia, Timor Leste, and Kosovo. For example, statistical estimates of death counts were cited as important evidence in the trial of General Efraín Ríos Montt for acts of genocide in Guatemala. Estimation rel… ▽ More

    Submitted 26 October, 2015; originally announced October 2015.

    Comments: 25 pages, 6 figures

  16. arXiv:1410.7056  [pdf, other

    stat.ME stat.AP

    Smoothing, Clustering, and Benchmarking for Small Area Estimation

    Authors: Rebecca C. Steorts

    Abstract: We develop constrained Bayesian estimation methods for small area problems: those requiring smoothness with respect to similarity across areas, such as geographic proximity or clustering by covariates; and benchmarking constraints, requiring (weighted) means of estimates to agree across levels of aggregation. We develop methods for constrained estimation decision-theoretically and discuss their ge… ▽ More

    Submitted 26 October, 2014; originally announced October 2014.

    Comments: 24 pages, 4 figures, Submitted

  17. arXiv:1410.4792  [pdf, ps, other

    stat.ME stat.ML

    Variational Bayes for Merging Noisy Databases

    Authors: Tamara Broderick, Rebecca C. Steorts

    Abstract: Bayesian entity resolution merges together multiple, noisy databases and returns the minimal collection of unique individuals represented, together with their true, latent record values. Bayesian methods allow flexible generative models that share power across databases as well as principled quantification of uncertainty for queries of the final, resolved database. However, existing Bayesian metho… ▽ More

    Submitted 17 October, 2014; originally announced October 2014.

    Comments: 12 pages

  18. arXiv:1409.0643  [pdf, other

    stat.ME

    Entity Resolution with Empirically Motivated Priors

    Authors: Rebecca C. Steorts

    Abstract: Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records t… ▽ More

    Submitted 27 April, 2015; v1 submitted 2 September, 2014; originally announced September 2014.

    Comments: 30 pages, 12 figures

  19. arXiv:1407.3191  [pdf, other

    cs.DB stat.AP

    A Comparison of Blocking Methods for Record Linkage

    Authors: Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, Stephen E. Fienberg

    Abstract: Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sens… ▽ More

    Submitted 11 July, 2014; originally announced July 2014.

    Comments: 22 pages, 2 tables, 7 figures

  20. arXiv:1405.6416  [pdf, other

    stat.OT stat.ME

    Discussion of "Single and Two-Stage Cross-Sectional and Time Series Benchmarking Procedures for SAE"

    Authors: Rebecca C. Steorts, M. Delores Ugarte

    Abstract: We congratulate the authors for a stimulating and valuable manuscript, providing a careful review of the state-of the-art in cross-sectional and time-series benchmarking procedures for small area estimation. They develop a novel two-stage benchmarking method for hierarchical time series models, where they evaluate their procedure by estimating monthly total unemployment using data from the U.S. Ce… ▽ More

    Submitted 25 May, 2014; originally announced May 2014.

    Comments: 6 pages, 1 figure

  21. Discussion of "Estimating the Distribution of Dietary Consumption Patterns"

    Authors: Stephen E. Fienberg, Rebecca C. Steorts

    Abstract: Discussion of "Estimating the Distribution of Dietary Consumption Patterns" by Raymond J. Carroll [arXiv:1405.4667].

    Submitted 20 May, 2014; v1 submitted 3 March, 2014; originally announced March 2014.

    Comments: Published in at http://dx.doi.org/10.1214/13-STS448 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-STS-STS448

    Journal ref: Statistical Science 2014, Vol. 29, No. 1, 95-96

  22. arXiv:1403.0211  [pdf, other

    stat.CO stat.AP

    SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication

    Authors: Rebecca C. Steorts, Rob Hall, Stephen E. Fienberg

    Abstract: We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of… ▽ More

    Submitted 2 March, 2014; originally announced March 2014.

    Comments: AISTATS (2014), to appear; 9 pages with references, 2 page supplement, 4 figures. Shorter version of arXiv:1312.4645

  23. Regularized brain reading with shrinkage and smoothing

    Authors: Leila Wehbe, Aaditya Ramdas, Rebecca C. Steorts, Cosma Rohilla Shalizi

    Abstract: Functional neuroimaging measures how the brain responds to complex stimuli. However, sample sizes are modest, noise is substantial, and stimuli are high dimensional. Hence, direct estimates are inherently imprecise and call for regularization. We compare a suite of approaches which regularize via shrinkage: ridge regression, the elastic net (a generalization of ridge regression and the lasso), and… ▽ More

    Submitted 4 February, 2016; v1 submitted 25 January, 2014; originally announced January 2014.

    Comments: Published at http://dx.doi.org/10.1214/15-AOAS837 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOAS-AOAS837

    Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 4, 1997-2022

  24. arXiv:1312.4645  [pdf, other

    stat.ME

    A Bayesian Approach to Graphical Record Linkage and De-duplication

    Authors: Rebecca C. Steorts, Rob Hall, Stephen E. Fienberg

    Abstract: We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of… ▽ More

    Submitted 30 October, 2015; v1 submitted 17 December, 2013; originally announced December 2013.

    Comments: 39 pages, 8 figures, 8 tables. Longer version of arXiv:1403.0211, In press, Journal of the American Statistical Association: Theory and Methods (2015)

  25. arXiv:1305.6657  [pdf, other

    stat.ME

    Two-stage Benchmarking as Applied to Small Area Estimation

    Authors: Malay Ghosh, Rebecca C. Steorts

    Abstract: There has been recent growth in small area estimation due to the need for more precise estimation of small geographic areas, which has led to groups such as the U.S. Census Bureau, Google, and the RAND corporation utilizing small area estimation procedures. We develop novel two-stage benchmarking methodology using a single weighted squared error loss function that combines the loss at the unit lev… ▽ More

    Submitted 15 July, 2013; v1 submitted 28 May, 2013; originally announced May 2013.

    MSC Class: 62D05; 62F15

  26. arXiv:1304.1756  [pdf, other

    stat.AP

    Trouble With The Curve: Improving MLB Pitch Classification

    Authors: Michael A. Pane, Samuel L. Ventura, Rebecca C. Steorts, A. C. Thomas

    Abstract: The PITCHf/x database has allowed the statistical analysis of of Major League Baseball (MLB) to flourish since its introduction in late 2006. Using PITCHf/x, pitches have been classified by hand, requiring considerable effort, or using neural network clustering and classification, which is often difficult to interpret. To address these issues, we use model-based clustering with a multivariate Gaus… ▽ More

    Submitted 5 April, 2013; originally announced April 2013.

  27. On estimation of mean squared errors of benchmarked empirical Bayes estimators

    Authors: Rebecca C. Steorts, Malay Ghosh

    Abstract: We consider benchmarked empirical Bayes (EB) estimators under the basic area-level model of Fay and Herriot while requiring the standard benchmarking constraint. In this paper we determine the excess mean squared error (MSE) from constraining the estimates through benchmarking. We show that the increase due to benchmarking is O(m^{-1}), where m is the number of small areas. Furthermore, we find an… ▽ More

    Submitted 4 April, 2013; originally announced April 2013.

    Journal ref: Statistica Sinica 23 (2013), 749-767