TabbyPDF: Web-Based System for PDF Table Extraction
Авторы: Shigarov A., Altaev A., Mikhailov A., Paramonov V., Cherkashin E.
Журнал: Communications in Computer and Information Science
Отчётный год: 2018
Аннотация: PDF is one of the most widespread ways to represent non-editable documents. Many of PDF documents are machine-readable but remain untagged. They have no tags for identifying layout items such as paragraphs, columns, or tables. One of the important challenges with these documents is how to extract tabular data from them. The paper presents a novel web-based system for extracting tables located in untagged PDF documents with a complex layout, for recovering their cell structures, and for exporting them into a tagged form (e.g. in CSV or HTML format). The system uses a heuristic-based approach to table detection and structure recognition. It mainly relies on recovering a human reading order of text, including document paragraphs and table cells. A prototype of the system was evaluated, using the methodology and dataset of “ICDAR 2013 Table Competition”. The standard metric F-score is 93.64% for the structure recognition phase and 83.18% for the table extraction with automatic table detection. The results are comparable with the state-of-the-art academic solutions.
Индексируется WOS: 1
Индексируется Scopus: 1
Индексируется РИНЦ: 1
Публикация в печати: 0
Добавил в систему: