GROBID in Focus: Machine Learning-Driven Extraction of Structured  Bibliographic Data from Scientific Literature

Authors

  • Sailendra Malik PhD Scholar Department of Library and Information Science The University of Burdwan
  • Sukumar Mandal Assistant Professor Department of Library and Information Science The University of Burdwan

DOI:

https://doi.org/10.48165/lt.2025.11.1.4

Keywords:

GROBID, Bibliographic Data Extraction, Machine Learning, Open Source, PDF Parsing

Abstract

This paper has discussed the GROBID as a machine learning tool for extracting  bibliographic metadata from PDFs. It provides Ubuntu installation guidance, explains  core extraction features, demonstrates applications like metadata and citation analysis,  and supports deployment across environments, aiming to help users integrate it into  their workflows. Installation on Ubuntu/Debian requires sequentially setting up JDK,  Maven, and Git. The GROBID repository, cloned via Git, is built using Gradle and  compiled by Maven. Users can configure it as a persistent system service. Customization  involves editing the grobid.properties file. Functionality was tested via the web interface  using sample PDFs. It effectively transforms unstructured PDFs into structured XML/ TEI, extracting titles, authors, abstracts, and references. It processes standard scholarly  PDFs rapidly (2-5 seconds per page) with over 90% accuracy. Its flexibility comes from  configurable PDF parsing, citation extraction, and memory management settings.  Deployment is simplified via a standalone JAR, and system service setup enables  continuous operation in production. The tool is based on open-source machine learning  foundation allows deep operational customization, making it exceptional for automated  document processing. It operates offline and supports multiple languages, leveraging  ML to handle diverse document formats and languages. Its adaptable architecture  serves varied domain needs, proving significant for research, library digitization, and  enhancing search capabilities.

References

Agrawal, K., Mittal, A., & Pudi, V. (2019). Scalable, semi-supervised extraction of structured information from scientific literature. https://doi.org/10.18653/V1/W19-2602

Dagdelen, J., Dunn, A., Lee, S., Walker, N., Rosen, A. S., Ceder, G., Persson, K. A., & Jain, A. (2024). Structured information extraction from scientific text with large language models. Nature Communications. https://doi.org/10.1038/s41467-024-45563-x

Guo, J., Ibanez-Lopez, A. S., Gao, H., Quach, V., Coley, C. W., Jensen, K. F., & Barzilay, R. (2021). Automated chemical reaction extraction from scientific literature. Journal of Chemical Information and Modeling. https://doi.org/10.1021/ACS.JCIM.1C00284

Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. https://doi.org/10.1007/978-3-642-04346-8_62

Romary, L., & Lopez, P. (2015). Grobid—Information extraction from scientific publications. ERCIM News, 100. https://inria.hal.science/hal-01673305

Rettenberger, L., Münker, M. F., Schutera, M., Niemeyer, C. M., Rabe, K. S., & Reischl, M. (2024). Using large language models for extracting structured information from scientific texts. Current Directions in Biomedical Engineering. https://doi.org/10.1515/cdbme-2024-2129

Sebastian, Y. (2017). Literature-based discovery by learning heterogeneous bibliographic information networks. https://doi.org/10.1145/3130332.3130347

Sebastian, Y., Siew, E.-G., & Orimaye, S. O. (2017). Learning the heterogeneous bibliographic information network for literature-based discovery. Knowledge Based Systems. https://doi.org/10.1016/J.KNOSYS.2016.10.015

Yang, H., Aguirre, C., Torre, M. F. D. L., Christensen, D., Bobadilla, L., Davich, E., Roth, J., Luo, L., Theis, Y., Lam, A., Han, T. Y.-J., Buttler, D., & Hsu, W. H. (2019). Pipelines for procedural information extraction from scientific literature: Towards recipes using machine learning and data science. https://doi.org/10.1109/ICDARW.2019.10037

Downloads

Published

2025-07-11

Issue

Section

Research Article

How to Cite

GROBID in Focus: Machine Learning-Driven Extraction of Structured  Bibliographic Data from Scientific Literature. (2025). LIS TODAY, 11(1), 31-37. https://doi.org/10.48165/lt.2025.11.1.4