Automatic General-Purpose Sanitization of Textual Documents.

In this paper, we present a general-purpose method to automatically detect sensitive information from textual documents in a domain-independent way. ... In this paper, we tackle the problem of automatic detection of sensitive text for sanitization purposes. ... Disclaimer and acknowledgments Authors are solely responsible for the views expressed in this paper, which do not necessarily reflect the position of UNESCO nor commit that organization. This ...

doi:10.1007/978-3-642-34620-0_17 fatcat:lcxymnrqmjcubhgcnxh6ozayaa

In this paper, we present a general-purpose method to automatically detect semantically related terms that may enable disclosure of sensitive data. ... Several automatic sanitization mechanisms can be found in the literature; however, most of them evaluate the sensitivity of the textual terms considering them as independent variables. ... Our proposal relies on the foundations of the information theory and a corpus as global as the Web to offer a general-purpose solution that can be automatically applied to heterogeneous textual documents ...

dblp:conf/pacis/0001BV13 fatcat:afuqrvar3zelljp7fmzjnzof24

This paper presents a generalized sanitization method that discovers the sensitive information based on the concept of information content. ... So before document publishing, sanitization operations are performed on the document for preserving the privacy and inorder to retain the utility of the document. ... CONCLUSION Publishing of textual documents is essential for various purposes such as research, decision making and due to regulations. ...

doi:10.5120/17626-8390 fatcat:g3rkbm7g3rclfaambq75h4t6ti

To do so, human experts are usually requested to redact or sanitize document contents. ... The sensitive nature of much of this information causes a serious privacy threat when documents are uncontrollably made available to untrusted third parties. ... Its goal is to mimic and, hence, automatize the reasoning of human sanitizers with regard to semantic inferences, disclosure analysis and protection of textual documents. ...

doi:10.1002/asi.23363 fatcat:5rq7pnxknjarnavnfxcajfyy6y

Multiple Versions

In fact, textual documents are usually protected manually, in a process known as document redaction or sanitization. ... automatic document redaction/sanitization algorithms and offers clear and a priori privacy guarantees on data protection; even though its potential benefits C-sanitization still presents some limitations ... Its goal is to mimic and, hence, automatize the analysis of semantic inferences that human experts perform for document sanitization. ...

doi:10.1016/j.engappai.2016.12.013 fatcat:bivycjh7fvfmhkt4djlc6gy2ki

Multiple Versions

Several semi-automatic and automatic methods are used for identifying sensitive information and thereby sanitizing the document by removing such terms. ... So before publishing or sharing documents, the sensitive information should be removed or masked. This is the major goal of Text sanitization. ... general-purpose knowledge bases/corpora such as web. ...

doi:10.5120/16749-6916 fatcat:jr22xjtgqbc5didgv4mtsh547q

Privacy is going to be an integral part of data science and analytics in the coming years. ... Privacy preservation becomes more challenging specially in the context of unstructured data. ... The purpose of this research is to come up with a framework to sanitize data and preserve privacy, which can be utilized before publishing textual social media data to any analytical 3rd party. ...

dblp:conf/lrec/AbeywardanaT20 fatcat:j6vxa2j34ncqbdf26cmeavgtxy

In the context of information systems, text sanitization techniques are used to identify and remove sensitive data to comply with security and regulatory requirements. ... generality and requiring customization for each desirable domain. ... A general-purpose sanitization method exploiting knowledge bases to compute term frequency for sensitive term substitution is proposed in [27] . ...

arXiv:2311.10785v1 fatcat:nzcgiepdcbdfzj7z4miktmi26e

In this paper we use information retrieval metrics to evaluate the effect of a document sanitization process, measuring information loss and risk of disclosure. ... In order to sanitize the documents we have developed a semiautomatic anonymization process following the guidelines of Executive Order 13526 (2009) of the US Administration. ... The work contributed by the second author was carried out as part of the Computer Science Ph.D. program of the Universitat Autònoma de Barcelona (UAB). ...

doi:10.1007/978-3-319-09885-2_9 fatcat:26c2tedf65g75ad26bwrdqwv4q

cloud storage locations we need; to show its potential and generality, we have applied it to the least structured and most challenging data type: plain textual documents. ... We propose a semantically-grounded data splitting mechanism that is able to automatically detect pieces of data that may cause privacy risks and split them on local premises, so that each chunk does not ... As far as we know, the only privacy model that fits with this scenario and these requirements is C-sanitization [25, 26] , a general privacy model for (textual) document sanitization. ...

doi:10.1016/j.comcom.2017.06.012 fatcat:3fmwhsr6ijdb3m2zfmi6t5yrwi

Multiple Versions

To foster this trend, in the context of the ESWC2014 Semantic Publishing Challenge, we present a system that automatically generates rich RDF datasets from CEUR-WS workshop proceedings. ... Our system is provided as an on-line Web service to support on-the-fly RDF generation. ... PHASE 2: Semantic annotator: this component automatically adds semantic annotations to the textual contents of HTML documents without semantic markups. ...

doi:10.1007/978-3-319-12024-9_10 fatcat:cymj5lrgszable4csh7gylnzem

To reduce the time and effort needed to repair such situations, this paper presents the first approach to automatically identify inconsistencies between a process model and a corresponding textual description ... When considering that hundreds of such descriptions may be in use in a particular organization by dozens of people, using a variety of editors, there is a clear risk that such models become misaligned. ... Textual process descriptions generally describe process steps in a chronological order [33] . ...

doi:10.1007/978-3-319-23063-4_6 fatcat:k7mbvp4rzngtjgrdk7ftrncc3q

In order to sanitize the documents we have developed a semi-automatic anonymization process following the guidelines of Executive Order 13526 (2009) of the US Administration, by (i) identifying and anonymizing ... In this paper we evaluate the effect of a document sanitization process on a set of information retrieval metrics, in order to measure information loss and risk of disclosure. ... By manual inspection of the docu-ments, we can conclude in general that a worse value is due to the loss of key textual information relevant to the query. ...

doi:10.1007/978-3-642-33627-0_24 fatcat:ecreun3syjgavcfbnrijo6taba

We provide recommendations for application of the approach and demonstrate them on a case study of a language for definition of graphs. ... In spite of its popularity, XML provides poor user experience and a lot of domain-specific languages can be improved by introducing custom, more humanfriendly notation. ... This work was supported by projects KEGA 047TUKE-4/2016 "Integrating software processes into the teaching of programming" and FEI-2015-23 "Pattern based domainspecific language development". ...

doi:10.2298/csis170116036c fatcat:axuuywgggncw3me4wquik2ui74

DOAJ Szczepanski

Unstructured textual data are at the heart of health systems: liaison letters between doctors, operating reports, coding of procedures according to the ICD-10 standard, etc. ... The result is an approach that effectively protects the privacy of the patients at the heart of these medical documents. ... For example in [36] , the authors present a new state-of-the-art for French Named Entity Recognition in a general purpose context (not medical). ...

arXiv:2209.09631v2 fatcat:hummxvhotncnlliprkh776bli4

Open Access Multiple Versions

Detecting Sensitive Information from Textual Documents: An Information-Theoretic Approach [chapter]

Preserved Fulltext

Detecting Term Relationships to Improve Textual Document Sanitization

Preserved Fulltext

Automatic Declassification of Textual Documents by Generalizing Sensitive Terms

Preserved Fulltext

C-sanitized: A privacy model for document redaction and sanitization

Preserved Fulltext

Other Versions

Toward sensitive document release with privacy guarantees

Preserved Fulltext

A Review on Text Sanitization

Preserved Fulltext

A Privacy Preserving Data Publishing Middleware for Unstructured, Textual Social Media Data

Preserved Fulltext

Text Sanitization Beyond Specific Domains: Zero-Shot Redaction Substitution with Large Language Models [article]

Preserved Fulltext

An Information Retrieval Approach to Document Sanitization [chapter]

Preserved Fulltext

Privacy-preserving data outsourcing in the cloud via semantic data splitting

Preserved Fulltext

Semantify CEUR-WS Proceedings: Towards the Automatic Generation of Highly Descriptive Scholarly Publishing Linked Datasets [chapter]

Preserved Fulltext

Detecting Inconsistencies Between Process Models and Textual Descriptions [chapter]

Preserved Fulltext

Document Sanitization: Measuring Search Engine Information Loss and Risk of Disclosure for the Wikileaks cables [chapter]

Preserved Fulltext

Development of custom notation for XML-based language: A model-driven approach

Preserved Fulltext

De-Identification of French Unstructured Clinical Notes for Machine Learning Tasks [article]

Preserved Fulltext