Страница публикации

PyTabby: A Docreader’s module for extracting text and tables from PDF with a text layer

Авторы: Mikhailov A.A., Shigarov A., Kozlov I.S.

Журнал: CEUR Workshop Proceedings: 4th Scientific-Practical Workshop Information Technologies: Algorithms, Models, Systems (ITAMS 2021, Irkutsk, 14 September 2021)

Том: 2984

Номер:

Год: 2021

Отчётный год: 2021

Издательство:

Местоположение издательства:

URL:

Аннотация: This paper presents a complete solution for extraction of textual information and tables from PDF with a text layer. The presented solution consist of two parts: PyTabby is a tool for extracting text and tables from PDF with a complex background and layout, and Python wrapper module for Docreader tool. The PyTabby tool extracts text and tables from the low level representation of the PDF format. It enables employment of the additional information excluded in scanned documents and provides improvement of quality and performance compared with Optical Character Recognition (OCR) methods. The presented solution is incorporated into Docreader tool to parse PDF files with a text layer and is used as a part of the TALISMAN technology for social analytics.

Индексируется WOS: 0

Индексируется Scopus: 1

Индексируется РИНЦ: 1

Публикация в печати: 0

Добавил в систему: