A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
Information Extraction from Visually Rich Documents with Font Style Embeddings
[article]
2022
arXiv
pre-print
Information extraction (IE) from documents is an intensive area of research with a large set of industrial applications. ...
Our experiments on three real-world complex datasets demonstrate that using token style attributes based embedding instead of a raw visual embedding in LayoutLM model is beneficial. ...
In the task of information extraction, one need to get textual information from the document. ...
arXiv:2111.04045v2
fatcat:visu7gjpyravzkp4kwtq26t7uy
Extraction of Logical Structure from Articles in Mathematics
[chapter]
2004
Lecture Notes in Computer Science
Then the meta-information (e.g. title, author) and the logical structure (e.g. section, theorem) of the documents are automatically extracted. ...
The purpose of this paper is to show the extraction method of logical structure specialized for mathematical documents. ...
Conclusion A method of extracting meta-information and logical structure from mathematical documents was presented and implemented on the base of the INFTY system. ...
doi:10.1007/978-3-540-27818-4_20
fatcat:bljntj4l5nb6bn3bhkv2pcvfqu
Information Retrieval System for Handwritten Documents
[chapter]
2004
Lecture Notes in Computer Science
Several types of queries are permitted: (i) entire document image; (ii) a region of interest (ROI) of a document; (iii) a word image; and (iv) textual. ...
The design and performance of a content-based information retrieval system for handwritten documents is described. ...
Acknowledgments CEDAR-FOX is the result of the effort of many students and researchers at CEDAR. ...
doi:10.1007/978-3-540-28640-0_28
fatcat:dih37sul6nf5tm5irpkqehlnlm
A SMART WEB IMAGE RETRIEVAL SYSTEM BASED ON SEMANTIC MARKUP FOR INTELLIGENT E-BUSINESS
2010
Issues in Information Systems
semantics of image information. ...
Therefore, with the rapid development of internet technology, the number of internet users and the amount of web image information on the internet is ever increasing. ...
Feature vectors are extracted by image preprocessing and meta information, such as keywords, semantic information, and visual information, are manually or automatically added. ...
doi:10.48009/1_iis_2010_483-490
fatcat:j7vx6kcr35cytf7cj6m4wcmolq
Meta-metadata
2010
Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM '10
We introduce meta-metadata, a language and software architecture addressing a metadata semantics lifecycle: (1) data structures for representation of metadata in programs; (2) metadata extraction from ...
Collecting, organizing, and thinking about diverse information resources is the keystone of meaningful digital information experiences, from research to education to leisure. ...
Future parsers will handle other types of documents, such as images, and various formats of audio and video. ...
doi:10.1145/1871437.1871580
dblp:conf/cikm/KerneQWDLM10
fatcat:tamcv6sjzvfurpa7d2axu46p4a
Knowledge Modelling for Deductive Web Mining
[chapter]
2004
Lecture Notes in Computer Science
We developed a multi-dimensional, ontology-based framework, and a collection of problem-solving methods, which enable to characterise DWM applications at an abstract level. ...
We show that the heterogeneity and unboundedness of the web demands for some modifications of the problem-solving method paradigm used in the context of traditional artificial intelligence. ... ...
Acknowledgements The research is partially supported by the grant no.201/03/1318 of the Czech Science Foundation. ...
doi:10.1007/978-3-540-30202-5_23
fatcat:edk55kn3tjahjkzguj2udsnyau
Document Image Processing for Hospital Information Systems
[chapter]
2012
Modern Information Systems
By converting the data to such files, the readability of the data are guaranteed, and the meta-data of the documents, e.g. timestamp, patient ID, document type etc., are used as key information of search ...
Our method can cover this problem as much of these meta-data are automatically extracted from the images, which would contribute to improve DACS. ...
.), ISBN: 978-953-51-0647-0, InTech, Available from: http://www.intechopen.com/books/modern-informationsystems/document-image-processing-for-hospital-information-systems ...
doi:10.5772/37400
fatcat:adqfd6tmefeqlon7a7yccqx7rm
Content-level Annotation of Large Collection of Printed Document Images
2007
Proceedings of the International Conference on Document Analysis and Recognition
In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. ...
We employ an XML representation for storage of the annotation information. ...
Authors also thank members of the consortium working on development of Indian language OCRs for their inputs and cooperation. ...
doi:10.1109/icdar.2007.4377025
dblp:conf/icdar/KumarJ07
fatcat:pl74yrflzjd2jm7jxltwjzkleq
Parallel processing considerations for image recognition tasks
2011
Parallel Processing for Imaging Applications
This type of image analysis is readily addressed by a map-reduce approach. Examples include document skew detection and multiple face detection and tracking. ...
However, there are three less trivial categories of parallel processing that will be considered in this paper: parallel processing (1) by task; (2) by image region; and (3) by meta-algorithm. ...
This type of document classification is roughly equivalent to template matching. I term this document classification by style. ...
doi:10.1117/12.879645
dblp:conf/ppia/Simske11
fatcat:oad7gfwshfhz5pwh3ghwpq5d3a
Embedding Metadata and Other Semantics in Word Processing Documents
2009
International Journal of Digital Curation
This paper describes a technique for embedding document metadata, and potentially other semantic references inline in word processing documents, which the authors have implemented with the help of a software ...
Several assumptions underly the approach; It must be available across computing platforms and work with both Microsoft Word (because of its user base) and OpenOffice.org (because of its free availability ...
Thanks to Peter Murray-Rust and Joe Townsend at the The Unilever Centre for Molecular Science Informatics for their input into this document and to Linda Octalina at the University of Southern Queensland ...
doi:10.2218/ijdc.v4i2.96
fatcat:nxnki5nl5bazvdn57pn6vauypm
WORD-LEVEL OPTICAL FONT RECOGNITION USING TYPOGRAPHICAL FEATURES
2004
International journal of pattern recognition and artificial intelligence
Previous research efforts on optical font recognition have mostly limited applications since they deal with only a few types of font attributes and estimate them from a line or block of text. ...
At the word-level, it has the advantages of obtaining more detailed font attributes including the following: script (Korean and English), font style (regular, bold, italic, and underlined), typeface (Myung-jo ...
Acknowledgment This work was supported by grant number R05-2003-000-10396-0 from the Program for Regional Scientists of the Korea Science and Engineering Foundation (KOSEF). ...
doi:10.1142/s0218001404003307
fatcat:pnafajmfnnbszdwtqm3r54oude
Citation Data-set for Machine Learning Citation Styles and Entity Extraction from Citation Strings
[article]
2018
arXiv
pre-print
Meticulous extraction is further needed when evaluating the similarity of documents and calculating their citation impact. ...
This meta-data can be difficult to acquire accurately due to the thousands of different styles and noise that can be applied to a bibliography to create the citation string. ...
Particularly the documents structure and layout which impede upon meta data extraction. This information usually needs to be reverse engineered for it to be recovered. ...
arXiv:1805.04798v2
fatcat:dhbatuekzvf77nehg6juapkucu
Page 12 of The Information Management Journal Vol. 33, Issue 2
[page]
1999
The Information Management Journal
Structure Based Mark-up Languages
One of the major problems that information managers face with digital information is the need to identify the type of documents being maintained on the myriad of disks ...
, which func- tions differently from the caption of a graphic or the tabular data extracted from a database table, etc.). ...
Web Content Adaptation System
2011
International Journal of Computer Applications
Table of Content The document is re-created based on extracted content and a list of heading is created. ...
extract the content from web page as shown below:
HTML Parser HTML represents a certain range of hypertext information, it is a simple markup language used to create hypertext documents that are platform ...
The following conclusions reached from the implementation of the proposed adaptation system. 7. ...
doi:10.5120/2978-3817
fatcat:dmq3jpwww5h2hojsljmzb5lsm4
Cross-Domain Document Object Detection: Benchmark Suite and Method
[article]
2020
arXiv
pre-print
For each dataset, we provide the page images, bounding box annotations, PDF files, and the rendering layers extracted from the PDF files. ...
We establish a benchmark suite consisting of different types of PDF document datasets that can be utilized for cross-domain DOD model training and evaluation. ...
We thank Richard Cohn and Kana Sethu for coding the tool and instructing how to use it for synthesizing documents. ...
arXiv:2003.13197v1
fatcat:t46n3ompnvbc7p2vb2x2j35ebu
« Previous
Showing results 1 — 15 out of 28,250 results