Страница публикации
PyTabby: A Docreader’s module for extracting text and tables from PDF with a text layer
Авторы: Mikhailov A.A., Shigarov A., Kozlov I.S.
Журнал: CEUR Workshop Proceedings: 4th Scientific-Practical Workshop Information Technologies: Algorithms, Models, Systems (ITAMS 2021, Irkutsk, 14 September 2021)
Том: 2984
Номер:
Год: 2021
Отчётный год: 2021
Издательство:
Местоположение издательства:
URL:
Аннотация: This paper presents a complete solution for extraction of textual information and tables from PDF with a text layer. The presented solution consist of two parts: PyTabby is a tool for extracting text and tables from PDF with a complex background and layout, and Python wrapper module for Docreader tool. The PyTabby tool extracts text and tables from the low level representation of the PDF format. It enables employment of the additional information excluded in scanned documents and provides improvement of quality and performance compared with Optical Character Recognition (OCR) methods. The presented solution is incorporated into Docreader tool to parse PDF files with a text layer and is used as a part of the TALISMAN technology for social analytics.
Индексируется WOS: 0
Индексируется Scopus: 1
Индексируется РИНЦ: 1
Публикация в печати: 0
Добавил в систему: