GROBID in Focus: Machine Learning-Driven Extraction of Structured  Bibliographic Data from Scientific Literature

Sailendra Malik; Sukumar Mandal

doi:10.48165/lt.2025.11.1.4

Authors

Sailendra Malik PhD Scholar Department of Library and Information Science The University of Burdwan
Sukumar Mandal Assistant Professor Department of Library and Information Science The University of Burdwan

DOI:

https://doi.org/10.48165/lt.2025.11.1.4

Keywords:

GROBID, Bibliographic Data Extraction, Machine Learning, Open Source, PDF Parsing

Abstract

This paper has discussed the GROBID as a machine learning tool for extracting bibliographic metadata from PDFs. It provides Ubuntu installation guidance, explains core extraction features, demonstrates applications like metadata and citation analysis, and supports deployment across environments, aiming to help users integrate it into their workflows. Installation on Ubuntu/Debian requires sequentially setting up JDK, Maven, and Git. The GROBID repository, cloned via Git, is built using Gradle and compiled by Maven. Users can configure it as a persistent system service. Customization involves editing the grobid.properties file. Functionality was tested via the web interface using sample PDFs. It effectively transforms unstructured PDFs into structured XML/ TEI, extracting titles, authors, abstracts, and references. It processes standard scholarly PDFs rapidly (2-5 seconds per page) with over 90% accuracy. Its flexibility comes from configurable PDF parsing, citation extraction, and memory management settings. Deployment is simplified via a standalone JAR, and system service setup enables continuous operation in production. The tool is based on open-source machine learning foundation allows deep operational customization, making it exceptional for automated document processing. It operates offline and supports multiple languages, leveraging ML to handle diverse document formats and languages. Its adaptable architecture serves varied domain needs, proving significant for research, library digitization, and enhancing search capabilities.

References

Agrawal, K., Mittal, A., & Pudi, V. (2019). Scalable, semi-supervised extraction of structured information from scientific literature. https://doi.org/10.18653/V1/W19-2602

Dagdelen, J., Dunn, A., Lee, S., Walker, N., Rosen, A. S., Ceder, G., Persson, K. A., & Jain, A. (2024). Structured information extraction from scientific text with large language models. Nature Communications. https://doi.org/10.1038/s41467-024-45563-x

Guo, J., Ibanez-Lopez, A. S., Gao, H., Quach, V., Coley, C. W., Jensen, K. F., & Barzilay, R. (2021). Automated chemical reaction extraction from scientific literature. Journal of Chemical Information and Modeling. https://doi.org/10.1021/ACS.JCIM.1C00284

Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. https://doi.org/10.1007/978-3-642-04346-8_62

Romary, L., & Lopez, P. (2015). Grobid—Information extraction from scientific publications. ERCIM News, 100. https://inria.hal.science/hal-01673305

Rettenberger, L., Münker, M. F., Schutera, M., Niemeyer, C. M., Rabe, K. S., & Reischl, M. (2024). Using large language models for extracting structured information from scientific texts. Current Directions in Biomedical Engineering. https://doi.org/10.1515/cdbme-2024-2129

Sebastian, Y. (2017). Literature-based discovery by learning heterogeneous bibliographic information networks. https://doi.org/10.1145/3130332.3130347

Sebastian, Y., Siew, E.-G., & Orimaye, S. O. (2017). Learning the heterogeneous bibliographic information network for literature-based discovery. Knowledge Based Systems. https://doi.org/10.1016/J.KNOSYS.2016.10.015

Yang, H., Aguirre, C., Torre, M. F. D. L., Christensen, D., Bobadilla, L., Davich, E., Roth, J., Luo, L., Theis, Y., Lam, A., Han, T. Y.-J., Buttler, D., & Hsu, W. H. (2019). Pipelines for procedural information extraction from scientific literature: Towards recipes using machine learning and data science. https://doi.org/10.1109/ICDARW.2019.10037

GROBID in Focus: Machine Learning-Driven Extraction of Structured Bibliographic Data from Scientific Literature

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

LIS Today Journal

GROBID in Focus: Machine Learning-Driven Extraction of Structured Bibliographic Data from Scientific Literature

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

LIS Today Journal

Subscribe Us for Latest Update