A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository.

We report on creation of the decease subject Integrated Data Repository (dsIDR) at National Institutes of Health, Clinical Center and a pilot methodology to remove secondary protected health information ... We characterize available structured coded data in dsIDR and report the estimated frequencies of secondary PxI, ranging from 12.9% (sensitive token presence) to 1.1% (using stricter criteria). ... Acknowledgments: This work has been supported by intramural research funds from the NIH Clinical Center and the National Library of Medicine. ...

pmid:25954378 pmcid:PMC4420001 fatcat:xesdrfhaxnb27krmwblhvyg6k4

Open Access

To build a new de-identification tool, we created the largest manually annotated clinical note corpus for PHI and develop a customizable open-source de-identification software called Philter ("Protected ... of data available in structured medical record data. ... We developed a privacy-centric approach to removing PHI from free-text clinical notes using both rule-based and statistical natural language processing (NLP) approaches. ...

doi:10.1038/s41746-020-0258-y pmid:32337372 pmcid:PMC7156708 fatcat:loudfjimwra4zkiddjima3s2xy

DOAJ

Citation

Beau Norgeot, Kathleen Muenzen, Thomas A. Peterson, Xuancheng Fan, Benjamin S. Glicksberg, Gundolf Schenk, Eugenia Rutenberg, Boris Oskotsky, Marina Sirota, Jinoos Yazdany, Gabriela Schmajuk, Dana Ludwig, Theodore Goldstein, Atul J. Butte. "Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes." npj Digital Medicine 3.1 (2020) 57

This raises a lot of concerns about the ways the data is acquired and the potential information leaks. ... A considerable portion of user-contributed data is in natural language, and in the past few years, many researchers have proposed NLP-based methods to address these data privacy challenges. ... [29] investigated the use of a frequency-filtering approach where they filter out rare sentences (frequency < 3) and sentences containing bigrams under a certain frequency threshold (frequency < 256 ...

doi:10.1109/access.2021.3124163 fatcat:pb5kf2yv7jbyfhtmpxwvrmtfdm

DOAJ

Acknowledgments The work presented in this paper comes from a 3 year project (ALADIN) started in 2009 and funded by the French Agence Nationale de la Recherche (National Research Agency -ANR) in the context ... Acknowledgments We would like to thank all members of our research group, IT for Health, for their support and input. ... Materials and Method The ever-increasing amount of biomedical (molecular biology, genetics, proteomics) and clinical data repositories increase in a dramatic manner. ...

doi:10.1162/coli_r_00281 fatcat:6abwqppd3bgyvkzfc2mq6kzo6u

DOAJ Szczepanski

Citation

Jin-Dong Kim. "Biomedical Natural Language ProcessingKevin Bretonnel Cohen and Dina Demner-Fushman (University of Colorado School of Medicine, and National Library of Medicine)John Benjamins Publishing (Book series on Natural Language Processing, edited by Ruslan Mitkov, volume 11), 2014, 160 pp; hardbound, ISBN 978-90-272-4997-5." Computational Linguistics 43.1 (2017) 265-267

A cornerstone for these programs is the establishment of enterprise-wide Clinical Data Warehouses. ... information originating from data sources, including Electronic Medical Records, Clinical Trial Management Systems, Tumor Registries, Biospecimen Repositories, Radiology and Pathology archives, and Next ... RSD and LR managed administrative and clinical adoption. SG, KH, LAG, ES, and GR contributed in data elements selection and clinical evaluation. ...

doi:10.1177/1176935117694349 pmid:28469389 pmcid:PMC5392017 fatcat:azzuk5zt2zh7tjynwituv4khdy

DOAJ

Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data ... Background Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. ... Acknowledgements We would like to thank the Medical Image Bank of the Valencian Community, from which the data used in this publication come from. ...

doi:10.1186/s13326-021-00236-2 pmid:33781334 pmcid:PMC8006627 fatcat:qi2ra7z7frhhbdvcprevcklbta

DOAJ

De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. ... Objectives: Our study aims to provide systematic evidence on how the de-identification of clinical free text has evolved in the last thirteen years, and to report on the performances and limitations of ... ., “A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing,” in Medinfo 2015: Ehealth-Enabled Health, I. N. Sarkar, A. Georgiou, and P. M. D. ...

arXiv:2312.03736v1 fatcat:gd5oci3z7nbd3bpmvmxt6unbry

Open Access

We evaluated the system with a publicly available dataset of 515 notes from the I2B2 2014 de-identification challenge and a dataset of 10,000 notes from the Mayo Clinic. ... The results indicated a recall of 0.992 and 0.994 and a precision of 0.979 and 0.967 on the I2B2 and the Mayo Clinic data, respectively. ... that was used for testing the performance of the system, the Mayo Data Team of Ahmed Hadad, Connie Nehls and Salena Tong for preparing and helping us understand the Mayo EHR data and Andy Danielsen for ...

doi:10.1101/2020.12.22.20248270 fatcat:z2akx4pz5fcopa5towq7h4w6py

We conducted a literature review of clinical NLP research from 2008 to 2014, emphasizing recent publications (2012-2014), based on PubMed and ACL proceedings as well as relevant referenced publications ... We present a review of recent advances in clinical Natural Language Processing (NLP), with a focus on semantic analysis and key subtasks that support such analysis. ... runtime and complexity to support knowledge discovery efforts from a large-scale clinical repository. ...

doi:10.15265/iy-2015-009 pmid:26293867 pmcid:PMC4587060 fatcat:4cqwat2q2jhkphvrfytfowmyge

The Pile is constructed from 22 diverse high-quality subsets – both existing and newly constructed – many of which derive from academic or professional sources. ... Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction. ... Terms of Service (ToS) compliant data is data which is obtained and used in a fashion that is known to be consistent with the terms of service of the data host. ...

arXiv:2101.00027v1 fatcat:74dgmcl55rdupks3kzygosjlca

Open Access

Background Transfer learning is a common practice in image classification with deep learning where the available data is often limited for training a complex model with millions of parameters. ... Conclusion We conclude that transfer learning along with fine-tuning the discriminative model is often more effective for performing shared targeted tasks than the training for a language space from scratch ... To obtain a representative sample of benign cases from the MR studies (which represented 1% of the LI-RADS coded reports), two radiologists manually annotated 537 benign cases from the EUH MRI dataset. ...

doi:10.1186/s13326-022-00262-8 pmid:35197110 pmcid:PMC8867666 fatcat:zsh5chrib5am5i6xnrrxoludye

DOAJ

In the first analysis, kinematic and coordination data from error-free fluent speech samples were compared to the same type of data from a group of six age-matched control speakers (males & females). ... In this study, movement data from lips, jaw and tongue were acquired using the AG-100 EMMA system from a relatively young individual with apraxia of speech (AOS) and Broca's aphasia. ... Acknowledgements This study was supported by a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC), awarded to the first author. The authors wish to thank Dr. ...

doi:10.1080/02699200600812331 pmid:17364624 fatcat:5a3wl55fjraupnxhaomgqlcsvi

Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. ... Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. ... Smashwords is a large repository of free ebooks, containing over 500K electronic books. ...

arXiv:2402.18041v1 fatcat:lizkw5xllrfthn6efbggy7jhh4

It is traditionally a participatory form of music with no distinction between performers and audience, a characteristic that makes for acoustical requirements that differ considerably from those of a concert ... Sacred Harp singing, a common type of shape-note singing, is a centuries-old tradition of American community choral music. ... from a clinical scanner. ...

doi:10.1121/1.4830794 fatcat:2sexsmbypjcozjctnorzoiy7te

After several decades of the study of language and the brain from a linguistic angle, there is now a relatively dense body of facts that can be seriously evaluated. ... An outlook on language derived from current linguistic theory can lead to a new and more precise picture of language and the brain. ... I would like to thank Michal Ben-Shachar for her invaluable comments and help and Danny Fox for saving me from several pitfalls. ...

doi:10.1017/s0140525x00002399 fatcat:gqirknf7f5cvpc5vjyz3yv6gmq

Szczepanski

Piloting a deceased subject integrated data repository and protecting privacy of relatives

Preserved Fulltext

Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes

Preserved Fulltext

Review: Privacy-preservation in the context of Natural Language Processing

Preserved Fulltext

Preserved Fulltext

Roadmap to a Comprehensive Clinical Data Warehouse for Precision Medicine Applications in Oncology

Preserved Fulltext

De-identifying Spanish medical texts - named entity recognition applied to radiology reports

Preserved Fulltext

De-identification of clinical free text using natural language processing: A systematic review of current approaches [article]

Preserved Fulltext

Building a Best-in-Class De-identification Tool for Electronic Medical Records Through Ensemble Learning [article]

Preserved Fulltext

Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis

Preserved Fulltext

The Pile: An 800GB Dataset of Diverse Text for Language Modeling [article]

Preserved Fulltext

Transfer language space with similar domain adaptation: a case study with hepatocellular carcinoma

Preserved Fulltext

Speech motor control in fluent and dysfluent speech production of an individual with apraxia of speech and Broca's aphasia

Preserved Fulltext

Datasets for Large Language Models: A Comprehensive Survey [article]

Preserved Fulltext

Grafting acoustic instruments and signal processing: Creative control and augmented expressivity

Preserved Fulltext

The neurology of syntax: Language use without Broca's area

Preserved Fulltext