Extracting information from PDF documents for use in automatic indexing of e-books

Gil-leiva, Isidoro; Fujita, Mariangela Spotti Lopes [UNESP]; Redigolo, Franciele Marques; Saran, Jordan Ferreira [UNESP]

Publicação:
Extracting information from PDF documents for use in automatic indexing of e-books

Data

2022-01-01

Autores

Gil-leiva, Isidoro

Fujita, Mariangela Spotti Lopes

Redigolo, Franciele Marques

Saran, Jordan Ferreira

Editor

Pontificia Universidade Catolica Campinas

Tipo

Artigo

Resumo

The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation oftools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation offive software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.

Palavras-chave

Software evaluation, PDFMiner, six, PDFAct, PDF-extract, PDFExtract, Grobib, Automatic indexing

Idioma

Inglês

Como citar

Transinformacao. Campinas: Pontificia Universidade Catolica Campinas, v. 34, 11 p., 2022.

URI

http://hdl.handle.net/11449/237794

Coleções

Marília - FFC - Faculdade de Filosofia e Ciências

Página do item completo

Publicação:
Extracting information from PDF documents for use in automatic indexing of e-books

Data

Autores

Orientador

Coorientador

Pós-graduação

Curso de graduação

Título da Revista

ISSN da Revista

Título de Volume

Editor

Tipo

Direito de acesso

Resumo

Descrição

Palavras-chave

Idioma

Como citar

URI

Itens relacionados

Financiadores

Coleções

Unidades

Departamentos

Cursos de graduação

Programas de pós-graduação

Publicação: Extracting information from PDF documents for use in automatic indexing of e-books

Data

Autores

Orientador

Coorientador

Pós-graduação

Curso de graduação

Título da Revista

ISSN da Revista

Título de Volume

Editor

Tipo

Direito de acesso

Resumo

Descrição

Palavras-chave

Idioma

Como citar

URI

Itens relacionados

Financiadores

Coleções

Unidades

Departamentos

Cursos de graduação

Programas de pós-graduação

Publicação:
Extracting information from PDF documents for use in automatic indexing of e-books