Dependencies revisited for improving data quality

Wenfei Fan

doi:10.1145/1376916.1376940

Dependency theory is almost as old as relational databases themselves, and has traditionally been used to improve the quality of schema, among other things. Recently there has been renewed interest in dependencies for improving the quality of data. The increasing demand for data quality technology has also motivated revisions of classical dependencies, to capture more inconsistencies in real-life data, and to match, repair and query the inconsistent data. This paper aims to provide an overview

more »

... f recent advances in revising classical dependencies for improving data quality. Conditional dependencies. We begin with an extension of traditional FDs and INDs that capture more of the inconsistencies in reallife data. Consider, for example, a relation consisting of records of customers in the US and UK. While in the UK, zip code determines street, it is not the case in the US; thus one cannot detect errors in the UK records by enforcing zip → street as an FD on the entire customer relation. To remedy the limitations, extensions of functional and inclusion dependencies have been introduced [36, 20], referred to as conditional functional dependencies and conditional inclusion dependencies (CFDs and CINDs), respectively. Conditional dependencies add to their traditional counterparts a specification of patterns of data values and variables. The semantics is obtained by restricting the traditional semantics to only those tuples that match the patterns, rather than on the entire relation(s). These dependencies make a weaker assertion than traditional FDs and INDs, and hence are more widely applicable. Matching dependencies. Another longstanding line of research associated with data quality is object identification, a.k.a. data deduplication, record linkage, merge-purge and record matching. Given one or more relations, we want to identify tuples from those relations that refer to the same real-world object. This is essential to, among other things, data cleaning, data integration, and creditcard fraud detection. Prior approaches to object identification are often seen as orthogonal to dependency-based ones. Central to those approaches is to determine comparison vectors and matching rules, i.e., what attributes should be selected and how they should be compared in order to identify tuples; these rules are typically given in a procedural way, and heavily rely on domain-specific heuristics (see [32] for a recent survey on object identification). We show that matching rules can be incorporated into the framework of dependencies, by introducing matching dependencies, an extension of FDs, defined across multiple relations and by incorporating domain-specific similarity and matching operators [38] . For example, a matching rule (taken from [48]) can be expressed as a matching dependency to assure that if two customer tuples have the same address and last name, and moreover, their first names are similar (but may not be identical), then the two tuples refer to the same person. These rules could then be combined with other dependencies (traditional or conditional) for data cleaning. This allows us to study the interaction between matching and cleaning rules in a uniform framework, and automatically deduce new matching rules via implication analysis of the dependencies.

doi:10.1145/1376916.1376940 dblp:conf/pods/Fan08 fatcat:uyfve5cp4vfwfph3i7mfrus3yq

Dependencies revisited for improving data quality

Preserved Fulltext