Extraction of Type Style based Meta-Information from Imaged Documents.

Information extraction (IE) from documents is an intensive area of research with a large set of industrial applications. ... Our experiments on three real-world complex datasets demonstrate that using token style attributes based embedding instead of a raw visual embedding in LayoutLM model is beneficial. ... In the task of information extraction, one need to get textual information from the document. ...

arXiv:2111.04045v2 fatcat:visu7gjpyravzkp4kwtq26t7uy

Open Access Multiple Versions

Then the meta-information (e.g. title, author) and the logical structure (e.g. section, theorem) of the documents are automatically extracted. ... The purpose of this paper is to show the extraction method of logical structure specialized for mathematical documents. ... Conclusion A method of extracting meta-information and logical structure from mathematical documents was presented and implemented on the base of the INFTY system. ...

doi:10.1007/978-3-540-27818-4_20 fatcat:bljntj4l5nb6bn3bhkv2pcvfqu

Several types of queries are permitted: (i) entire document image; (ii) a region of interest (ROI) of a document; (iii) a word image; and (iv) textual. ... The design and performance of a content-based information retrieval system for handwritten documents is described. ... Acknowledgments CEDAR-FOX is the result of the effort of many students and researchers at CEDAR. ...

doi:10.1007/978-3-540-28640-0_28 fatcat:dih37sul6nf5tm5irpkqehlnlm

semantics of image information. ... Therefore, with the rapid development of internet technology, the number of internet users and the amount of web image information on the internet is ever increasing. ... Feature vectors are extracted by image preprocessing and meta information, such as keywords, semantic information, and visual information, are manually or automatically added. ...

doi:10.48009/1_iis_2010_483-490 fatcat:j7vx6kcr35cytf7cj6m4wcmolq

DOAJ

We introduce meta-metadata, a language and software architecture addressing a metadata semantics lifecycle: (1) data structures for representation of metadata in programs; (2) metadata extraction from ... Collecting, organizing, and thinking about diverse information resources is the keystone of meaningful digital information experiences, from research to education to leisure. ... Future parsers will handle other types of documents, such as images, and various formats of audio and video. ...

doi:10.1145/1871437.1871580 dblp:conf/cikm/KerneQWDLM10 fatcat:tamcv6sjzvfurpa7d2axu46p4a

We developed a multi-dimensional, ontology-based framework, and a collection of problem-solving methods, which enable to characterise DWM applications at an abstract level. ... We show that the heterogeneity and unboundedness of the web demands for some modifications of the problem-solving method paradigm used in the context of traditional artificial intelligence. ... ... Acknowledgements The research is partially supported by the grant no.201/03/1318 of the Czech Science Foundation. ...

doi:10.1007/978-3-540-30202-5_23 fatcat:edk55kn3tjahjkzguj2udsnyau

By converting the data to such files, the readability of the data are guaranteed, and the meta-data of the documents, e.g. timestamp, patient ID, document type etc., are used as key information of search ... Our method can cover this problem as much of these meta-data are automatically extracted from the images, which would contribute to improve DACS. ... .), ISBN: 978-953-51-0647-0, InTech, Available from: http://www.intechopen.com/books/modern-informationsystems/document-image-processing-for-hospital-information-systems ...

doi:10.5772/37400 fatcat:adqfd6tmefeqlon7a7yccqx7rm

In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. ... We employ an XML representation for storage of the annotation information. ... Authors also thank members of the consortium working on development of Indian language OCRs for their inputs and cooperation. ...

doi:10.1109/icdar.2007.4377025 dblp:conf/icdar/KumarJ07 fatcat:pl74yrflzjd2jm7jxltwjzkleq

This type of image analysis is readily addressed by a map-reduce approach. Examples include document skew detection and multiple face detection and tracking. ... However, there are three less trivial categories of parallel processing that will be considered in this paper: parallel processing (1) by task; (2) by image region; and (3) by meta-algorithm. ... This type of document classification is roughly equivalent to template matching. I term this document classification by style. ...

doi:10.1117/12.879645 dblp:conf/ppia/Simske11 fatcat:oad7gfwshfhz5pwh3ghwpq5d3a

This paper describes a technique for embedding document metadata, and potentially other semantic references inline in word processing documents, which the authors have implemented with the help of a software ... Several assumptions underly the approach; It must be available across computing platforms and work with both Microsoft Word (because of its user base) and OpenOffice.org (because of its free availability ... Thanks to Peter Murray-Rust and Joe Townsend at the The Unilever Centre for Molecular Science Informatics for their input into this document and to Linda Octalina at the University of Southern Queensland ...

doi:10.2218/ijdc.v4i2.96 fatcat:nxnki5nl5bazvdn57pn6vauypm

DOAJ Szczepanski

Previous research efforts on optical font recognition have mostly limited applications since they deal with only a few types of font attributes and estimate them from a line or block of text. ... At the word-level, it has the advantages of obtaining more detailed font attributes including the following: script (Korean and English), font style (regular, bold, italic, and underlined), typeface (Myung-jo ... Acknowledgment This work was supported by grant number R05-2003-000-10396-0 from the Program for Regional Scientists of the Korea Science and Engineering Foundation (KOSEF). ...

doi:10.1142/s0218001404003307 fatcat:pnafajmfnnbszdwtqm3r54oude

Meticulous extraction is further needed when evaluating the similarity of documents and calculating their citation impact. ... This meta-data can be difficult to acquire accurately due to the thousands of different styles and noise that can be applied to a bibliography to create the citation string. ... Particularly the documents structure and layout which impede upon meta data extraction. This information usually needs to be reverse engineered for it to be recovered. ...

arXiv:1805.04798v2 fatcat:dhbatuekzvf77nehg6juapkucu

Open Access Multiple Versions

Structure Based Mark-up Languages One of the major problems that information managers face with digital information is the need to identify the type of documents being maintained on the myriad of disks ... , which func- tions differently from the caption of a graphic or the tabular data extracted from a database table, etc.). ...

Table of Content The document is re-created based on extracted content and a list of heading is created. ... extract the content from web page as shown below: HTML Parser HTML represents a certain range of hypertext information, it is a simple markup language used to create hypertext documents that are platform ... The following conclusions reached from the implementation of the proposed adaptation system. 7. ...

doi:10.5120/2978-3817 fatcat:dmq3jpwww5h2hojsljmzb5lsm4

For each dataset, we provide the page images, bounding box annotations, PDF files, and the rendering layers extracted from the PDF files. ... We establish a benchmark suite consisting of different types of PDF document datasets that can be utilized for cross-domain DOD model training and evaluation. ... We thank Richard Cohn and Kana Sethu for coding the tool and instructing how to use it for synthesizing documents. ...

arXiv:2003.13197v1 fatcat:t46n3ompnvbc7p2vb2x2j35ebu

Information Extraction from Visually Rich Documents with Font Style Embeddings [article]

Preserved Fulltext

Other Versions

Extraction of Logical Structure from Articles in Mathematics [chapter]

Preserved Fulltext

Information Retrieval System for Handwritten Documents [chapter]

Preserved Fulltext

A SMART WEB IMAGE RETRIEVAL SYSTEM BASED ON SEMANTIC MARKUP FOR INTELLIGENT E-BUSINESS

Preserved Fulltext

Meta-metadata

Preserved Fulltext

Knowledge Modelling for Deductive Web Mining [chapter]

Preserved Fulltext

Document Image Processing for Hospital Information Systems [chapter]

Preserved Fulltext

Content-level Annotation of Large Collection of Printed Document Images

Preserved Fulltext

Parallel processing considerations for image recognition tasks

Preserved Fulltext

Embedding Metadata and Other Semantics in Word Processing Documents

Preserved Fulltext

WORD-LEVEL OPTICAL FONT RECOGNITION USING TYPOGRAPHICAL FEATURES

Preserved Fulltext

Citation Data-set for Machine Learning Citation Styles and Entity Extraction from Citation Strings [article]

Preserved Fulltext

Other Versions

Page 12 of The Information Management Journal Vol. 33, Issue 2 [page]

Preserved Fulltext

Web Content Adaptation System

Preserved Fulltext

Cross-Domain Document Object Detection: Benchmark Suite and Method [article]

Preserved Fulltext