ABSTRACT
In recent years, a large amount of data is collected from multiple sources and the demands for analyzing these data have increased enormously. Data sharing is a valuable part of this data-intensive and collaborative environment due to the synergies and added values created by multi-modal datasets generated from different sources. In this work, we introduce a technique that can be used for quantifying the degree of information gain (IG) that may be obtained over data sharing. Our method captures both where- (to compute the IG over values) and how-provenance (to find matching records) and accurately computes the IG based on them. We conduct a preliminary evaluation to show the runtime of our approach over a real-world dataset.
- E. Ainy, P. Bourhis, SB. Davidson, D. Deutch, and Tova Milo. 2015. Approximated Summarization of Data Provenance. In CIKM. 483--492.Google Scholar
- Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. 2001. Why and Where: A Characterization of Data Provenance. In ICDT. Springer, 316--330.Google ScholarDigital Library
- Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In SIGMOD. ACM, 511--519.Google Scholar
- Shihyen Chen, Bin Ma, and Kaizhong Zhang. 2009. On the similarity metric and the distance metric. Theor. Comput. Sci. 410, 24--25 (2009), 2365--2376.Google ScholarDigital Library
- James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in databases: Why, how, and where. Now Publishers Inc.Google Scholar
- Per-Erik Danielsson. 1980. Euclidean distance mapping. Computer Graphics and image processing 14, 3 (1980), 227--248.Google Scholar
- Daniel Deutch, Yuval Moskovitch, and Noam Rinetzky. 2019. Hypothetical Reasoning via Provenance Abstraction. In SIGMOD. 537--554.Google Scholar
- Erich Grädel and Val Tannen. 2017. Semiring Provenance for First-Order Model Checking. arXiv preprint arXiv.1712.01980 (2017).Google Scholar
- T.J. Green, G. Karvounarakis, and V. Tannen. 2007. Provenance semirings. In PODS. 31--40.Google Scholar
- Taeho Jung, Seokki Lee, and Wenyi Tang. 2021. Using Provenance to Evaluate Risk and Benefit of Data Sharing. In TaPP.Google Scholar
- Taeho Jung, Xiang-Yang Li, Wenchao Huang, Jianwei Qian, Linlin Chen, Junze Han, Jiahui Hou, and Cheng Su. 2017. AccountTrade: Accountable protocols for big data trading against dishonest consumers. In INFOCOM. IEEE, 1--9.Google Scholar
- Sven Köhler, Bertram Ludäscher, and Daniel Zinn. 2013. First-Order Provenance Games. In In Search of Elegance in the Theory and Practice of Computation. Springer, 382--399.Google Scholar
- Seokki Lee, Bertram Ludäscher, and Boris Glavic. 2018. Provenance summaries for answers and non-answers. Proceedings of the VLDB Endowment 11, 12 (2018), 1954--1957.Google ScholarDigital Library
- Seokki Lee, Bertram Ludäscher, and Boris Glavic. 2018. PUG: a framework and practical implementation for why and why-not provenance. The VLDB Journal (2018), 1--25.Google Scholar
- Seokki Lee, Bertram Ludäscher, and Boris Glavic. 2020. Approximate Summaries for Why and Why-not Provenance. Proc. VLDB Endow. 13, 6 (2020), 912--924.Google ScholarDigital Library
- Kaiyu Li and Guoliang Li. 2018. Approximate Query Processing: What is New and Where to Go? - A Survey on Approximate Query Processing. Data Sci. Eng. 3, 4 (2018), 379--397.Google Scholar
- Tobias Müller, Benjamin Dietrich, and Torsten Grust. 2018. You Say'What', I Hear'Where'and'Why':(Mis-) Interpreting SQL to Derive Fine-Grained Provenance. arXiv preprint arXiv:1805.11517 (2018).Google Scholar
- Mohammad Norouzi, David J Fleet, and Russ R Salakhutdinov. 2012. Hamming distance metric learning. In Advances in neural information processing systems. 1061--1069.Google Scholar
- Jane Xu, Waley Zhang, Abdussalam Alawini, and Val Tannen. 2018. Provenance Analysis for Missing Answers and Integrity Repairs. Data Engineering (2018), 39.Google Scholar
- Liu Yang and Rong Jin. 2006. Distance metric learning: A comprehensive survey. Michigan State Universiy 2, 2 (2006), 4.Google Scholar
Index Terms
- Measuring information gain using provenance
Recommendations
Feature selection using Information Gain and decision information in neighborhood decision system
AbstractFeature selection is a significant preprocessing technique for data mining, which can promote the accuracy of data classification and shrink feature space by eliminating redundant features. Since traditional feature selection ...
Highlights- The joint information granule considers more possibilities.
- The neighborhood ...
Information gain-based semi-supervised feature selection for hybrid data
AbstractInformation gain, as an important feature measure, plays a vital role in the process of feature selection. Most of existing information gain-based feature selection algorithms are developed on data with single type features. However, in practical ...
Maximum entropy model for mobile text classification in cloud computing using improved information gain algorithm
With the rapid popularization of the Internet and the multimedia that be deemed to a new information transmission mode, people can not only get the information you want easily, but also post the information that you have in the world. At the same time, ...
Comments