Logotipo do repositório
 

Publicação:
Extracting information from PDF documents for use in automatic indexing of e-books

dc.contributor.authorGil-leiva, Isidoro
dc.contributor.authorFujita, Mariangela Spotti Lopes [UNESP]
dc.contributor.authorRedigolo, Franciele Marques
dc.contributor.authorSaran, Jordan Ferreira [UNESP]
dc.contributor.institutionUniv Murcia
dc.contributor.institutionUniversidade Estadual Paulista (UNESP)
dc.contributor.institutionUniv Fed Para
dc.date.accessioned2022-11-30T13:45:11Z
dc.date.available2022-11-30T13:45:11Z
dc.date.issued2022-01-01
dc.description.abstractThe number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation oftools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation offive software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.en
dc.description.affiliationUniv Murcia, Fac Comunicac & Documentac, Campus Univ Espinardo s n, Murcia 30100, Spain
dc.description.affiliationUniv Estadual Paulista, Fac Filosofia & Ciencias, Programa Posgrad Ciencia Informacao, Marilia, SP, Brazil
dc.description.affiliationUniv Fed Para, Fac Bibliotecon, Programa Posgrad Ciencia Informacao, Belem, PA, Brazil
dc.description.affiliationUnespUniv Estadual Paulista, Fac Filosofia & Ciencias, Programa Posgrad Ciencia Informacao, Marilia, SP, Brazil
dc.format.extent11
dc.identifierhttp://dx.doi.org/10.1590/2318-0889202234e210069
dc.identifier.citationTransinformacao. Campinas: Pontificia Universidade Catolica Campinas, v. 34, 11 p., 2022.
dc.identifier.doi10.1590/2318-0889202234e210069
dc.identifier.issn0103-3786
dc.identifier.urihttp://hdl.handle.net/11449/237794
dc.identifier.wosWOS:000830903000001
dc.language.isoeng
dc.publisherPontificia Universidade Catolica Campinas
dc.relation.ispartofTransinformacao
dc.sourceWeb of Science
dc.subjectSoftware evaluation
dc.subjectPDFMiner
dc.subjectsix
dc.subjectPDFAct
dc.subjectPDF-extract
dc.subjectPDFExtract
dc.subjectGrobib
dc.subjectAutomatic indexing
dc.titleExtracting information from PDF documents for use in automatic indexing of e-booksen
dc.typeArtigo
dcterms.rightsHolderPontificia Universidade Catolica Campinas
dspace.entity.typePublication
unesp.author.orcid0000-0002-8239-7114[2]
unesp.campusUniversidade Estadual Paulista (UNESP), Faculdade de Filosofia e Ciências, Maríliapt
unesp.departmentCiência da Informação - FFCpt

Arquivos