The Detection Algorithms for Similar Duplicate Data.

A novel method for XML duplicate detection, called XMLDup. ... Although there is a long line of work on identifying duplicates in relational data, only a few solutions focus on duplicate detection in more complex hierarchical structures, like XML data. ... To overcome this problem, we implement the proposed Priority algorithm for detect duplicate data's in large XML data. ...

doi:10.9790/0661-1628101105 fatcat:wzub2vvw6vaz7d5xeslctainj4

Therefore, data cleaning is essential for on-site programming big data. Duplicate data detection is an important step in data cleaning, which can save storage resources and enhance data consistency. ... Due to the insufficiency in traditional Sorted Neighborhood Method (SNM) and the difficulty of high-dimensional data detection, an optimized algorithm based on random forests with the dynamic and adaptive ... Experiments for duplicate data detection have proved the effectiveness and advantages of optimized algorithm proposed. ...

doi:10.32604/jihpp.2020.016299 fatcat:6orhh5esefd5beb6255uxihy5y

The new method offers more accuracy dis-similarity measure for each cluster data without manual intervention at the time of duplicate deduction. ... the volume of data for text comparisons. ... Limitation At present algorithm will not support for detect the duplicate of two diffent image and two different video file. ...

doi:10.3844/jcssp.2013.1514.1518 fatcat:ehls5zls35hp3kikxgknbvigya

The similarity metrics that are commonly used to detect similar field entries are covered with some algorithm used for duplicate detection to find approximately duplicates records in a database. ... Linking data to detect duplicates is good in improving the quality and integrity of data which allow re-uses of existing data sources for future research work [1]. ... Duplicate Detection Algorithm There are several numbers of duplicate detection algorithms but this study discusses the few of them that are effective and commonly.  Jaccard Similarity Algorithm. ...

doi:10.21817/ijcse/2018/v10i2/181002013 fatcat:pbaohn4ywrhp7ovitwcarbsany

Against this backdrop, we developed an automated configurable data cleaning environment based on training and physical-semantic data similarity, aiming to provide a more efficient and extensible tool for ... Approaches were also demonstrated to show that besides detecting and treating information inconsistencies and duplication of positive cases, they also addressed cases of detected false-positives and the ... The first module provides several algorithms for detecting duplicate and inconsistent data. ...

doi:10.5935/jetia.v6i25.685 fatcat:4hxsg3z2ijduvge6u5qqumu35e

DOAJ OJS

Several different methods of data analysis are studied here with various approaches for duplicate detection. ... The efficiency can be doubled over the conventional duplicate detection method using this algorithm. ... The techniques for duplicate record detection are very essential to improve the extracted data quality. U. ...

doi:10.15623/ijret.2016.0503082 fatcat:tbmxipuqfzcvxo4d4y7r6p7vrm

Open Access

Duplicate detection is the major important task in the data mining, in order to find duplicate in the original data as well as data object. ... In this method the number of XML Data is considered as input and the predicts the conditional probability value for each data in the hierarchical structure. ... The subsequent measurement discerns among three methods used to execute duplicate detection: machine learning and similarity measures are performed to learn duplicate data objects, clustering algorithms ...

doi:10.14445/22312803/ijctt-v7p105 fatcat:ixqtxbybyfbzln2tilqp4zsivq

Open Access

There are lots of works already presented in the past for finding the duplicates in the relational data. But nowadays there is more focus on finding duplicates in the XML data. ... Because of XML is very popular for data storing and extensively used for data exchange between the organizations. ... We would also like to thank our department for giving us the resources and the freedom to pursue this project. ...

doi:10.17148/ijarcce.2015.44142 fatcat:hre3v4cponbulihwnzyhz6ytta

The detection of similar duplicate records was a key link in database data cleaning. ... The proposed method also solved the problem of database similar duplicate record detection effectively. ... For example, Song et al. proposed a big data similar duplicate record detection algorithm based on the MapReduce model. ...

doi:10.23940/ijpe.19.02.p35.710718 fatcat:6iuota7s6bgt5mioemm3md7l2m

We scale up duplicate detection in graph data (DDG) to large amounts of data using the support of a relational database system. ... Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. ... We observe that for duplicate detection in graph data, no methods for scalable iterative duplicate detection have been proposed, a shortcoming we address in this paper. ...

doi:10.1145/1458082.1458259 dblp:conf/cikm/HerschelN08 fatcat:7jlts2o6jzc6vgelesgsen4tom

Also due to differences between various data models, the algorithms which are for single relations cannot be applied on XML data. ... Here Bayesian network is used with modified pruning algorithm for duplicate detection, and experiments are performed on both artificial and real world datasets. ... Algorithm for Proposed Pruning Method Algorithm: XMLMulDup(N) Input: The node or subtree N for which algorithm will detect duplicates. ...

doi:10.5120/21751-5018 fatcat:r6a7x6xzofcqnc6enuoma7npsa

framework of different similarity detection algorithms for researchers. ... Among the main problems of requirements engineering, the detection and management of duplicated requirements is highlighted. ... Acknowledgments The work presented in this paper has been supported by the GENESIS project under the National Spanish Program for Research Aimed at the Challenges of Society (RETOS) 2016, contract TIN2016 ...

dblp:conf/refsq/MotgerPM20 fatcat:vzozvhjd6ncsjb7fvyjlxcfk5i

The accuracy obtained for the proposed Duplicate Record Detection is found to be 89%. ... In the previous work duplicate record detection was done using Q-gram concept and the fuzzy classifier. ... The data values from distance calculation can be used in Feature selection using PSO algorithm and the fitness function to commutate should be the precise and accurate value for detecting the duplicate ...

doi:10.5120/14166-9829 fatcat:n3omlmrhl5bi5kzz73ydocuo7m

The experimental results show that our algorithm outperforms in terms of similarity measures. The near duplicate and duplicate document identification has resulted reduced memory in repositories. ... Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like "internet". ... The future work will be research for more robust and accurate methods for near duplicate detection and elimination on basis of the detection. ...

doi:10.5121/ijdkp.2014.4604 fatcat:66jmy6xqhrbqdistbyawfrfo6u

Open Access

Duplicate records are broad problem in many of the databases. There are wide efforts focusing on elimination of duplicate in data sets, because is it important part of data cleaning. ... This paper focuses on discovery and removing duplication by using fuzzy logic technique. ... Atiqur Rahaman presented A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates. ...

doi:10.21928/juhd.v1n4y2015.pp423-426 fatcat:vr5d7mauxbhynilyk2xzkqbnhy

Szczepanski

An Effective Solution to Adequate and Operative Duplicate Detection in Stratified Data

Preserved Fulltext

Random Forests Algorithm Based Duplicate Detection in On-Site Programming Big Data Environment

Preserved Fulltext

CLUSTER BASED DUPLICATE DETECTION

Preserved Fulltext

Database Record Duplicate Detection System using Simil Algorithm

Preserved Fulltext

A system proposal for automated data cleaning environment

Preserved Fulltext

A STUDY AND SURVEY ON VARIOUS PROGRESSIVE DUPLICATE DETECTION MECHANISMS

Preserved Fulltext

Deriving the Probability with Machine Learning and Efficient Duplicate Detection in Hierarchical Objects English

Preserved Fulltext

EDDDS: An Efficient Duplicate Data Detection System

Preserved Fulltext

Database Repeat Record Detection based on Improved Quantum Particle Swarm Optimization Algorithm

Preserved Fulltext

Scaling up duplicate detection in graph data

Preserved Fulltext

Ef?cient Duplicate Detection and Elimination in Hierarchical Multimedia Data

Preserved Fulltext

RESim - Automated Detection of Duplicated Requirements in Software Engineering Projects

Preserved Fulltext

PSO Algorithm to Select Subsets of Q-Gram Features for Record Duplicate Detection

Preserved Fulltext

A Near-Duplicate Detection Algorithm to Facilitate Document Clustering

Preserved Fulltext

Using Fuzzy Logic Technique to Eliminate the Duplicates in Large Database

Preserved Fulltext

Deriving the Probability with Machine Learning and Efficient Duplicate Detection in Hierarchical Objects
English