Search | arXiv e-print repository

doi 10.1093/jssam/smac030

Bayesian Graphical Entity Resolution Using Exchangeable Random Partition Priors

Authors: Neil G. Marchant, Benjamin I. P. Rubinstein, Rebecca C. Steorts

Abstract: Entity resolution (record linkage or deduplication) is the process of identifying and linking duplicate records in databases. In this paper, we propose a Bayesian graphical approach for entity resolution that links records to latent entities, where the prior representation on the linkage structure is exchangeable. First, we adopt a flexible and tractable set of priors for the linkage structure, wh… ▽ More Entity resolution (record linkage or deduplication) is the process of identifying and linking duplicate records in databases. In this paper, we propose a Bayesian graphical approach for entity resolution that links records to latent entities, where the prior representation on the linkage structure is exchangeable. First, we adopt a flexible and tractable set of priors for the linkage structure, which corresponds to a special class of random partition models. Second, we propose a more realistic distortion model for categorical/discrete record attributes, which corrects a logical inconsistency with the standard hit-miss model. Third, we incorporate hyperpriors to improve flexibility. Fourth, we employ a partially collapsed Gibbs sampler for inferential speedups. Using a selection of private and nonprivate data sets, we investigate the impact of our modeling contributions and compare our model with two alternative Bayesian models. In addition, we conduct a simulation study for household survey data, where we vary distortion, duplication rates and data set size. We find that our model performs more consistently than the alternatives across a variety of scenarios and typically achieves the highest entity resolution accuracy (F1 score). Open source software is available for our proposed methodology, and we provide a discussion regarding our work and future directions. △ Less

Submitted 7 January, 2023; originally announced January 2023.

Comments: 27 pages, 4 figures, 3 tables. Includes 37 pages of appendices. This is an accepted manuscript to be published in the Journal of Survey Statistics and Methodology

arXiv:2112.01594 [pdf, other]

On the Reliability of Multiple Systems Estimation for the Quantification of Modern Slavery

Authors: Olivier Binette, Rebecca C. Steorts

Abstract: The quantification of modern slavery has received increased attention recently as organizations have come together to produce global estimates, where multiple systems estimation (MSE) is often used to this end. Echoing a long-standing controversy, disagreements have re-surfaced regarding the underlying MSE assumptions, the robustness of MSE methodology, and the accuracy of MSE estimates in this ap… ▽ More The quantification of modern slavery has received increased attention recently as organizations have come together to produce global estimates, where multiple systems estimation (MSE) is often used to this end. Echoing a long-standing controversy, disagreements have re-surfaced regarding the underlying MSE assumptions, the robustness of MSE methodology, and the accuracy of MSE estimates in this application. Our goal is to help address and move past these controversies. To do so, we review MSE, its assumptions, and commonly used models for modern slavery applications. We introduce all of the publicly available modern slavery datasets in the literature, providing a reproducible analysis and highlighting current issues. Specifically, we utilize an internal consistency approach that constructs subsets of data for which ground truth is available, allowing us to evaluate the accuracy of MSE estimators. Next, we propose a characterization of the large sample bias of estimators as a function of misspecified assumptions. Then, we propose an alternative to traditional (e.g., bootstrap-based) assessments of reliability, which allows us to visualize trajectories of MSE estimates to illustrate the robustness of estimates. Finally, our complementary analyses are used to provide guidance regarding the application and reliability of MSE methodology. △ Less

Submitted 2 December, 2021; originally announced December 2021.

Journal ref: Journal of the Royal Statistical Society: Series A (Statistics in Society), 1 - 37 (2022)

arXiv:2103.04025 [pdf, other]

Transformed Fay-Herriot Model with Measurement Error in Covariates

Authors: Sepideh Mosaferi, Malay Ghosh, Rebecca C. Steorts

Abstract: Statistical agencies are often asked to produce small area estimates (SAEs) for positively skewed variables. When domain sample sizes are too small to support direct estimators, effects of skewness of the response variable can be large. As such, it is important to appropriately account for the distribution of the response variable given available auxiliary information. Motivated by this issue and… ▽ More Statistical agencies are often asked to produce small area estimates (SAEs) for positively skewed variables. When domain sample sizes are too small to support direct estimators, effects of skewness of the response variable can be large. As such, it is important to appropriately account for the distribution of the response variable given available auxiliary information. Motivated by this issue and in order to stabilize the skewness and achieve normality in the response variable, we propose an area-level log-measurement error model on the response variable. Then, under our proposed modeling framework, we derive an empirical Bayes (EB) predictor of positive small area quantities subject to the covariates containing measurement error. We propose a corresponding mean squared prediction error (MSPE) of EB predictor using both a jackknife and a bootstrap method. We show that the order of the bias is $O(m^{-1})$, where $m$ is the number of small areas. Finally, we investigate the performance of our methodology using both design-based and model-based simulation studies. △ Less

Submitted 5 March, 2021; originally announced March 2021.

arXiv:2008.04443 [pdf, other]

(Almost) All of Entity Resolution

Authors: Olivier Binette, Rebecca C. Steorts

Abstract: Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrat… ▽ More Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance. △ Less

Submitted 17 January, 2022; v1 submitted 10 August, 2020; originally announced August 2020.

arXiv:2004.02008 [pdf, other]

Random Partition Models for Microclustering Tasks

Authors: Brenda Betancourt, Giacomo Zanella, Rebecca C. Steorts

Abstract: Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the… ▽ More Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points -- the microclustering property. Motivated by these issues, we propose a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties. Our proposed models overcome major limitations in the existing literature on microclustering models, namely a lack of interpretability, identifiability, and full characterization of model asymptotic properties. Crucially, we drop the classical assumption of having an exchangeable sequence of data points, and instead assume an exchangeable sequence of clusters. In addition, our framework provides flexibility in terms of the prior distribution of cluster sizes, computational tractability, and applicability to a large number of microclustering tasks. We establish theoretical properties of the resulting class of priors, where we characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size. Our framework allows a simple and efficient Markov chain Monte Carlo algorithm to perform statistical inference. We illustrate our proposed methodology on the microclustering task of entity resolution, where we provide a simulation study and real experiments on survey panel data. △ Less

Submitted 4 April, 2020; originally announced April 2020.

arXiv:1909.06039 [pdf, other]

doi 10.1080/10618600.2020.1825451

d-blink: Distributed End-to-End Bayesian Entity Resolution

Authors: Neil G. Marchant, Andee Kaplan, Daniel N. Elazar, Benjamin I. P. Rubinstein, Rebecca C. Steorts

Abstract: Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing mode… ▽ More Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing models are severely limited in practice, as standard inference algorithms scale quadratically in the number of records. While scaling can be managed by fitting the model on separate blocks of the data, such a naïve approach may induce significant error in the posterior. In this paper, we propose a principled model for scalable Bayesian ER, called "distributed Bayesian linkage" or d-blink, which jointly performs blocking and ER without compromising posterior correctness. Our approach relies on several key ideas, including: (i) an auxiliary variable representation that induces a partition of the entities and records into blocks; (ii) a method for constructing well-balanced blocks based on k-d trees; (iii) a distributed partially-collapsed Gibbs sampler with improved mixing; and (iv) fast algorithms for performing Gibbs updates. Empirical studies on six data sets---including a case study on the 2010 Decennial Census---demonstrate the scalability and effectiveness of our approach. △ Less

Submitted 22 September, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

Comments: 32 pages, 6 figures, 5 tables. Includes 22 pages of supplementary material. This revision incorporates a case study on the 2010 U.S. Decennial Census

MSC Class: 62F15; 65C40; 68W15

arXiv:1810.05497 [pdf, other]

Probabilistic Blocking with An Application to the Syrian Conflict

Authors: Rebecca C. Steorts, Anshumali Shrivastava

Abstract: Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce $k$-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into… ▽ More Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce $k$-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic variant of LSH to the literature, known as Densified One Permutation Hashing (DOPH). Third, we propose a weighted variant of DOPH. We illustrate each method on an application to a subset of the ongoing Syrian conflict, giving a discussion of each method. △ Less

Submitted 10 October, 2018; originally announced October 2018.

Comments: 16 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:1510.07714, arXiv:1710.02690

Journal ref: Steorts R.C., Shrivastava A. (2018) Probabilistic Blocking with an Application to the Syrian Conflict. PSD (2018)

arXiv:1810.04808 [pdf, other]

Generalized Bayesian Record Linkage and Regression with Exact Error Propagation

Authors: Rebecca C. Steorts, Andrea Tancredi, Brunero Liseo

Abstract: Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive, or post-linkage task on the linked data. One goal of the downstream task is obtaining a larger reference data set, allowing one to perform more accurate s… ▽ More Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive, or post-linkage task on the linked data. One goal of the downstream task is obtaining a larger reference data set, allowing one to perform more accurate statistical analyses. In addition, there is inherent record linkage uncertainty passed to the downstream task. Motivated by the above, we propose a generalized Bayesian record linkage method and consider multiple regression analysis as the downstream task. Records are linked via a random partition model, which allows for a wide class to be considered. In addition, we jointly model the record linkage and downstream task, which allows one to account for the record linkage uncertainty exactly. Moreover, one is able to generate a feedback propagation mechanism of the information from the proposed Bayesian record linkage model into the downstream task. This feedback effect is essential to eliminate potential biases that can jeopardize resulting downstream task. We apply our methodology to multiple linear regression, and illustrate empirically that the "feedback effect" is able to improve the performance of record linkage. △ Less

Submitted 10 October, 2018; originally announced October 2018.

Comments: 18 pages, 5 figures

Journal ref: Steorts, R.C., Tancredi, A., Liseo, B. Generalized Bayesian Record Linkage and Regression with Exact Error Propagation, PSD (2018)

arXiv:1810.01538 [pdf, other]

A Practical Approach to Proper Inference with Linked Data

Authors: Andee Kaplan, Brenda Betancourt, Rebecca C. Steorts

Abstract: Entity resolution (ER), comprising record linkage and de-duplication, is the process of merging noisy databases in the absence of unique identifiers to remove duplicate entities. One major challenge of analysis with linked data is identifying a representative record among determined matches to pass to an inferential or predictive task, referred to as the \emph{downstream task}. Additionally, incor… ▽ More Entity resolution (ER), comprising record linkage and de-duplication, is the process of merging noisy databases in the absence of unique identifiers to remove duplicate entities. One major challenge of analysis with linked data is identifying a representative record among determined matches to pass to an inferential or predictive task, referred to as the \emph{downstream task}. Additionally, incorporating uncertainty from ER in the downstream task is critical to ensure proper inference. To bridge the gap between ER and the downstream task in an analysis pipeline, we propose five methods to choose a representative (or canonical) record from linked data, referred to as canonicalization. Our methods are scalable in the number of records, appropriate in general data scenarios, and provide natural error propagation via a Bayesian canonicalization stage. The proposed methodology is evaluated on three simulated data sets and one application -- determining the relationship between demographic information and party affiliation in voter registration data from the North Carolina State Board of Elections. We first perform Bayesian ER and evaluate our proposed methods for canonicalization before considering the downstream tasks of linear and logistic regression. Bayesian canonicalization methods are empirically shown to improve downstream inference in both settings through prediction and coverage. △ Less

Submitted 8 February, 2022; v1 submitted 2 October, 2018; originally announced October 2018.

Comments: 31 pages, 2 figures

arXiv:1710.02690 [pdf, other]

Unique Entity Estimation with Application to the Syrian Conflict

Authors: Beidi Chen, Anshumali Shrivastava, Rebecca C. Steorts

Abstract: Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus o… ▽ More Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus on a related problem of unique entity estimation, which is the task of estimating the unique number of entities and associated standard errors in a data set with duplicate entities. Unique entity estimation shares many fundamental challenges of entity resolution, namely, that the computational cost of all-to-all entity comparisons is intractable for large databases. To circumvent this computational barrier, we propose an efficient (near-linear time) estimation algorithm based on locality sensitive hashing. Our estimator, under realistic assumptions, is unbiased and has provably low variance compared to existing random sampling based approaches. In addition, we empirically show its superiority over the state-of-the-art estimators on three real applications. The motivation for our work is to derive an accurate estimate of the documented, identifiable deaths in the ongoing Syrian conflict. Our methodology, when applied to the Syrian data set, provides an estimate of $191,874 \pm 1772$ documented, identifiable deaths, which is very close to the Human Rights Data Analysis Group (HRDAG) estimate of 191,369. Our work provides an example of challenges and efforts involved in solving a real, noisy challenging problem where modeling assumptions may not hold. △ Less

Submitted 7 October, 2017; originally announced October 2017.

Comments: 35 pages, 6 figures, 2 tables

arXiv:1703.02679 [pdf, other]

Performance Bounds for Graphical Record Linkage

Authors: Rebecca C. Steorts, Matt Barnes, Willie Neiswanger

Abstract: Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a res… ▽ More Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a result, it is increasingly common for researchers to treat record linkage as a clustering task, in which each latent entity is associated with one or more noisy database records. We critically assess performance bounds using the Kullback-Leibler (KL) divergence under a Bayesian record linkage framework, making connections to Kolchin partition models. We provide an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity. We give insights for when our bounds hold using simulated data and provide practical user guidance. △ Less

Submitted 7 March, 2017; originally announced March 2017.

Comments: 11 pages with supplement; 4 figures and 2 tables; to appear in AISTATS 2017

arXiv:1610.09780 [pdf, other]

Flexible Models for Microclustering with Application to Entity Resolution

Authors: Giacomo Zanella, Brenda Betancourt, Hanna Wallach, Jeffrey Miller, Abbas Zaidi, Rebecca C. Steorts

Abstract: Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. F… ▽ More Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets. △ Less

Submitted 31 October, 2016; originally announced October 2016.

Comments: 15 pages, 3 figures, 1 table, to appear NIPS 2016. arXiv admin note: text overlap with arXiv:1512.00792

arXiv:1608.02209 [pdf, other]

Bayesian Learning of Dynamic Multilayer Networks

Authors: Daniele Durante, Nabanita Mukherjee, Rebecca C. Steorts

Abstract: A plethora of networks is being collected in a growing number of fields, including disease transmission, international relations, social interactions, and others. As data streams continue to grow, the complexity associated with these highly multidimensional connectivity data presents novel challenges. In this paper, we focus on the time-varying interconnections among a set of actors in multiple co… ▽ More A plethora of networks is being collected in a growing number of fields, including disease transmission, international relations, social interactions, and others. As data streams continue to grow, the complexity associated with these highly multidimensional connectivity data presents novel challenges. In this paper, we focus on the time-varying interconnections among a set of actors in multiple contexts, called layers. Current literature lacks flexible statistical models for dynamic multilayer networks, which can enhance quality in inference and prediction by efficiently borrowing information within each network, across time, and between layers. Motivated by this gap, we develop a Bayesian nonparametric model leveraging latent space representations. Our formulation characterizes the edge probabilities as a function of shared and layer-specific actors positions in a latent space, with these positions changing in time via Gaussian processes. This representation facilitates dimensionality reduction and incorporates different sources of information in the observed data. In addition, we obtain tractable procedures for posterior computation, inference, and prediction. We provide theoretical results on the flexibility of our model. Our methods are tested on simulations and infection studies monitoring dynamic face-to-face contacts among individuals in multiple days, where we perform better than current methods in inference and prediction. △ Less

Submitted 30 December, 2016; v1 submitted 7 August, 2016; originally announced August 2016.

Journal ref: Journal of Machine Learning Research (2017). 18, 1-29

arXiv:1512.00792 [pdf, other]

Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

Authors: Jeffrey Miller, Brenda Betancourt, Abbas Zaidi, Hanna Wallach, Rebecca C. Steorts

Abstract: Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For exampl… ▽ More Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For example, when performing entity resolution, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks therefore require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the \emph{microclustering property} and introducing a new model that exhibits this property. We compare this model to several commonly used clustering models by checking model fit using real and simulated data sets. △ Less

Submitted 2 December, 2015; originally announced December 2015.

Comments: 8 pages, 3 figures, NIPS Bayesian Nonparametrics: The Next Generation Workshop Series

arXiv:1510.07714 [pdf, other]

Blocking Methods Applied to Casualty Records from the Syrian Conflict

Authors: Peter Sadosky, Anshumali Shrivastava, Megan Price, Rebecca C. Steorts

Abstract: Estimation of death counts and associated standard errors is of great importance in armed conflict such as the ongoing violence in Syria, as well as historical conflicts in Guatemala, Perú, Colombia, Timor Leste, and Kosovo. For example, statistical estimates of death counts were cited as important evidence in the trial of General Efraín Ríos Montt for acts of genocide in Guatemala. Estimation rel… ▽ More Estimation of death counts and associated standard errors is of great importance in armed conflict such as the ongoing violence in Syria, as well as historical conflicts in Guatemala, Perú, Colombia, Timor Leste, and Kosovo. For example, statistical estimates of death counts were cited as important evidence in the trial of General Efraín Ríos Montt for acts of genocide in Guatemala. Estimation relies on both record linkage and multiple systems estimation. A key first step in this process is identifying ways to partition the records such that they are computationally manageable. This step is referred to as blocking and is a major challenge for the Syrian database since it is sparse in the number of duplicate records and feature poor in its attributes. As a consequence, we propose locality sensitive hashing (LSH) methods to overcome these challenges. We demonstrate the computational superiority and error rates of these methods by comparing our proposed approach with others in the literature. We conclude with a discussion of many challenges of merging LSH with record linkage to achieve an estimate of the number of uniquely documented deaths in the Syrian conflict. △ Less

Submitted 26 October, 2015; originally announced October 2015.

Comments: 25 pages, 6 figures

arXiv:1410.7056 [pdf, other]

Smoothing, Clustering, and Benchmarking for Small Area Estimation

Authors: Rebecca C. Steorts

Abstract: We develop constrained Bayesian estimation methods for small area problems: those requiring smoothness with respect to similarity across areas, such as geographic proximity or clustering by covariates; and benchmarking constraints, requiring (weighted) means of estimates to agree across levels of aggregation. We develop methods for constrained estimation decision-theoretically and discuss their ge… ▽ More We develop constrained Bayesian estimation methods for small area problems: those requiring smoothness with respect to similarity across areas, such as geographic proximity or clustering by covariates; and benchmarking constraints, requiring (weighted) means of estimates to agree across levels of aggregation. We develop methods for constrained estimation decision-theoretically and discuss their geometric interpretation. Our constrained estimators are the solutions to tractable optimization problems and have closed-form solutions. Mean squared errors of the constrained estimators are calculated via bootstrapping. Our techniques are free of distributional assumptions and apply whether the estimator is linear or non-linear, univariate or multivariate. We illustrate our methods using data from the U.S. Census's Small Area Income and Poverty Estimates program. △ Less

Submitted 26 October, 2014; originally announced October 2014.

Comments: 24 pages, 4 figures, Submitted

arXiv:1410.4792 [pdf, ps, other]

Variational Bayes for Merging Noisy Databases

Authors: Tamara Broderick, Rebecca C. Steorts

Abstract: Bayesian entity resolution merges together multiple, noisy databases and returns the minimal collection of unique individuals represented, together with their true, latent record values. Bayesian methods allow flexible generative models that share power across databases as well as principled quantification of uncertainty for queries of the final, resolved database. However, existing Bayesian metho… ▽ More Bayesian entity resolution merges together multiple, noisy databases and returns the minimal collection of unique individuals represented, together with their true, latent record values. Bayesian methods allow flexible generative models that share power across databases as well as principled quantification of uncertainty for queries of the final, resolved database. However, existing Bayesian methods for entity resolution use Markov monte Carlo method (MCMC) approximations and are too slow to run on modern databases containing millions or billions of records. Instead, we propose applying variational approximations to allow scalable Bayesian inference in these models. We derive a coordinate-ascent approximation for mean-field variational Bayes, qualitatively compare our algorithm to existing methods, note unique challenges for inference that arise from the expected distribution of cluster sizes in entity resolution, and discuss directions for future work in this domain. △ Less

Submitted 17 October, 2014; originally announced October 2014.

Comments: 12 pages

arXiv:1409.0643 [pdf, other]

Entity Resolution with Empirically Motivated Priors

Authors: Rebecca C. Steorts

Abstract: Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records t… ▽ More Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian--type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but also by allowing both categorical and string-valued variables. Our extension to string-valued variables also involves the proposal of a new probabilistic mechanism by which observed record values for string fields can deviate from the values of their associated latent entities. Categorical fields that deviate from their corresponding true value are simply drawn from the empirical distribution function. We apply our proposed methodology to a simulated data set of German names and an Italian household survey, showing our method performs favorably compared to several standard methods in the literature. We also consider the robustness of our methods to changes in the hyper-parameters. △ Less

Submitted 27 April, 2015; v1 submitted 2 September, 2014; originally announced September 2014.

Comments: 30 pages, 12 figures

arXiv:1407.3191 [pdf, other]

A Comparison of Blocking Methods for Record Linkage

Authors: Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, Stephen E. Fienberg

Abstract: Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sens… ▽ More Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as "private blocking." We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues. △ Less

Submitted 11 July, 2014; originally announced July 2014.

Comments: 22 pages, 2 tables, 7 figures

arXiv:1405.6416 [pdf, other]

Discussion of "Single and Two-Stage Cross-Sectional and Time Series Benchmarking Procedures for SAE"

Authors: Rebecca C. Steorts, M. Delores Ugarte

Abstract: We congratulate the authors for a stimulating and valuable manuscript, providing a careful review of the state-of the-art in cross-sectional and time-series benchmarking procedures for small area estimation. They develop a novel two-stage benchmarking method for hierarchical time series models, where they evaluate their procedure by estimating monthly total unemployment using data from the U.S. Ce… ▽ More We congratulate the authors for a stimulating and valuable manuscript, providing a careful review of the state-of the-art in cross-sectional and time-series benchmarking procedures for small area estimation. They develop a novel two-stage benchmarking method for hierarchical time series models, where they evaluate their procedure by estimating monthly total unemployment using data from the U.S. Census Bureau. We discuss three topics: linearity and model misspecification, computational complexity and model comparisons, and, some aspects on small area estimation in practice. More specifically, we pose the following questions to the authors, that they may wish to answer: How robust is their model to misspecification? Is it time to perhaps move away from linear models of the type considered by (Battese et al. 1988; Fay and Herriot 1979)? What is the asymptotic computational complexity and what comparisons can be made to other models? Should the benchmarking constraints be inherently fixed or should they be random? △ Less

Submitted 25 May, 2014; originally announced May 2014.

Comments: 6 pages, 1 figure

arXiv:1403.0566 [pdf, ps, other]

doi 10.1214/13-STS448

Discussion of "Estimating the Distribution of Dietary Consumption Patterns"

Authors: Stephen E. Fienberg, Rebecca C. Steorts

Abstract: Discussion of "Estimating the Distribution of Dietary Consumption Patterns" by Raymond J. Carroll [arXiv:1405.4667]. Discussion of "Estimating the Distribution of Dietary Consumption Patterns" by Raymond J. Carroll [arXiv:1405.4667]. △ Less

Submitted 20 May, 2014; v1 submitted 3 March, 2014; originally announced March 2014.

Comments: Published in at http://dx.doi.org/10.1214/13-STS448 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS448

Journal ref: Statistical Science 2014, Vol. 29, No. 1, 95-96

arXiv:1403.0211 [pdf, other]

SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication

Authors: Rebecca C. Steorts, Rob Hall, Stephen E. Fienberg

Abstract: We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of… ▽ More We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate $k$-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data. △ Less

Submitted 2 March, 2014; originally announced March 2014.

Comments: AISTATS (2014), to appear; 9 pages with references, 2 page supplement, 4 figures. Shorter version of arXiv:1312.4645

arXiv:1401.6595 [pdf, ps, other]

doi 10.1214/15-AOAS837

Regularized brain reading with shrinkage and smoothing

Authors: Leila Wehbe, Aaditya Ramdas, Rebecca C. Steorts, Cosma Rohilla Shalizi

Abstract: Functional neuroimaging measures how the brain responds to complex stimuli. However, sample sizes are modest, noise is substantial, and stimuli are high dimensional. Hence, direct estimates are inherently imprecise and call for regularization. We compare a suite of approaches which regularize via shrinkage: ridge regression, the elastic net (a generalization of ridge regression and the lasso), and… ▽ More Functional neuroimaging measures how the brain responds to complex stimuli. However, sample sizes are modest, noise is substantial, and stimuli are high dimensional. Hence, direct estimates are inherently imprecise and call for regularization. We compare a suite of approaches which regularize via shrinkage: ridge regression, the elastic net (a generalization of ridge regression and the lasso), and a hierarchical Bayesian model based on small area estimation (SAE). We contrast regularization with spatial smoothing and combinations of smoothing and shrinkage. All methods are tested on functional magnetic resonance imaging (fMRI) data from multiple subjects participating in two different experiments related to reading, for both predicting neural response to stimuli and decoding stimuli from responses. Interestingly, when the regularization parameters are chosen by cross-validation independently for every voxel, low/high regularization is chosen in voxels where the classification accuracy is high/low, indicating that the regularization intensity is a good tool for identification of relevant voxels for the cognitive task. Surprisingly, all the regularization methods work about equally well, suggesting that beating basic smoothing and shrinkage will take not only clever methods, but also careful modeling. △ Less

Submitted 4 February, 2016; v1 submitted 25 January, 2014; originally announced January 2014.

Comments: Published at http://dx.doi.org/10.1214/15-AOAS837 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS837

Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 4, 1997-2022

arXiv:1312.4645 [pdf, other]

A Bayesian Approach to Graphical Record Linkage and De-duplication

Authors: Rebecca C. Steorts, Rob Hall, Stephen E. Fienberg

Abstract: We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of… ▽ More We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture-recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. △ Less

Submitted 30 October, 2015; v1 submitted 17 December, 2013; originally announced December 2013.

Comments: 39 pages, 8 figures, 8 tables. Longer version of arXiv:1403.0211, In press, Journal of the American Statistical Association: Theory and Methods (2015)

arXiv:1305.6657 [pdf, other]

Two-stage Benchmarking as Applied to Small Area Estimation

Authors: Malay Ghosh, Rebecca C. Steorts

Abstract: There has been recent growth in small area estimation due to the need for more precise estimation of small geographic areas, which has led to groups such as the U.S. Census Bureau, Google, and the RAND corporation utilizing small area estimation procedures. We develop novel two-stage benchmarking methodology using a single weighted squared error loss function that combines the loss at the unit lev… ▽ More There has been recent growth in small area estimation due to the need for more precise estimation of small geographic areas, which has led to groups such as the U.S. Census Bureau, Google, and the RAND corporation utilizing small area estimation procedures. We develop novel two-stage benchmarking methodology using a single weighted squared error loss function that combines the loss at the unit level and the area level without any specific distributional assumptions. We consider this loss while benchmarking the weighted means at each level or both the weighted means and weighted variability at the unit level. Multivariate extensions are immediate. We analyze the behavior of our methods using a complex study from the National Health Interview Survey (NHIS) from 2000, which estimates the proportion of people that do not have health insurance for many domains of an Asian subpopulation. Finally, the methodology is explored via simulated data under the proposed model. We ultimately conclude that three proposed benchmarked Bayes estimators do not dominate each other, leaving much exploration for future research. △ Less

Submitted 15 July, 2013; v1 submitted 28 May, 2013; originally announced May 2013.

MSC Class: 62D05; 62F15

arXiv:1304.1756 [pdf, other]

Trouble With The Curve: Improving MLB Pitch Classification

Authors: Michael A. Pane, Samuel L. Ventura, Rebecca C. Steorts, A. C. Thomas

Abstract: The PITCHf/x database has allowed the statistical analysis of of Major League Baseball (MLB) to flourish since its introduction in late 2006. Using PITCHf/x, pitches have been classified by hand, requiring considerable effort, or using neural network clustering and classification, which is often difficult to interpret. To address these issues, we use model-based clustering with a multivariate Gaus… ▽ More The PITCHf/x database has allowed the statistical analysis of of Major League Baseball (MLB) to flourish since its introduction in late 2006. Using PITCHf/x, pitches have been classified by hand, requiring considerable effort, or using neural network clustering and classification, which is often difficult to interpret. To address these issues, we use model-based clustering with a multivariate Gaussian mixture model and an appropriate adjustment factor as an alternative to current methods. Furthermore, we describe a new pitch classification algorithm based on our clustering approach to address the problems of pitch misclassification. We illustrate our methods for various pitchers from the PITCHf/x database that covers a wide variety of pitch types. △ Less

Submitted 5 April, 2013; originally announced April 2013.

arXiv:1304.1600 [pdf, other]

doi 10.5705/ss.2012.053

On estimation of mean squared errors of benchmarked empirical Bayes estimators

Authors: Rebecca C. Steorts, Malay Ghosh

Abstract: We consider benchmarked empirical Bayes (EB) estimators under the basic area-level model of Fay and Herriot while requiring the standard benchmarking constraint. In this paper we determine the excess mean squared error (MSE) from constraining the estimates through benchmarking. We show that the increase due to benchmarking is O(m^{-1}), where m is the number of small areas. Furthermore, we find an… ▽ More We consider benchmarked empirical Bayes (EB) estimators under the basic area-level model of Fay and Herriot while requiring the standard benchmarking constraint. In this paper we determine the excess mean squared error (MSE) from constraining the estimates through benchmarking. We show that the increase due to benchmarking is O(m^{-1}), where m is the number of small areas. Furthermore, we find an asymptotically unbiased estimator of this MSE and compare it to the second-order approximation of the MSE of the EB estimator or, equivalently, of the MSE of the empirical best linear unbiased predictor (EBLUP), that was derived by Prasad and Rao (1990). Morever, using methods similar to those of Butar and Lahiri (2003), we compute a parametric bootstrap estimator of the MSE of the benchmarked EB estimator under the Fay-Herriot model and compare it to the MSE of the benchmarked EB estimator found by a second-order approximation. Finally, we illustrate our methods using SAIPE data from the U.S. Census Bureau, and in a simulation study. △ Less

Submitted 4 April, 2013; originally announced April 2013.

Journal ref: Statistica Sinica 23 (2013), 749-767

Showing 1–27 of 27 results for author: Steorts, R C