GROBID in Focus: Machine Learning-Driven Extraction of Structured Bibliographic Data from Scientific Literature
DOI:
https://doi.org/10.48165/lt.2025.11.1.4Keywords:
GROBID, Bibliographic Data Extraction, Machine Learning, Open Source, PDF ParsingAbstract
This paper has discussed the GROBID as a machine learning tool for extracting bibliographic metadata from PDFs. It provides Ubuntu installation guidance, explains core extraction features, demonstrates applications like metadata and citation analysis, and supports deployment across environments, aiming to help users integrate it into their workflows. Installation on Ubuntu/Debian requires sequentially setting up JDK, Maven, and Git. The GROBID repository, cloned via Git, is built using Gradle and compiled by Maven. Users can configure it as a persistent system service. Customization involves editing the grobid.properties file. Functionality was tested via the web interface using sample PDFs. It effectively transforms unstructured PDFs into structured XML/ TEI, extracting titles, authors, abstracts, and references. It processes standard scholarly PDFs rapidly (2-5 seconds per page) with over 90% accuracy. Its flexibility comes from configurable PDF parsing, citation extraction, and memory management settings. Deployment is simplified via a standalone JAR, and system service setup enables continuous operation in production. The tool is based on open-source machine learning foundation allows deep operational customization, making it exceptional for automated document processing. It operates offline and supports multiple languages, leveraging ML to handle diverse document formats and languages. Its adaptable architecture serves varied domain needs, proving significant for research, library digitization, and enhancing search capabilities.
References
Agrawal, K., Mittal, A., & Pudi, V. (2019). Scalable, semi-supervised extraction of structured information from scientific literature. https://doi.org/10.18653/V1/W19-2602
Dagdelen, J., Dunn, A., Lee, S., Walker, N., Rosen, A. S., Ceder, G., Persson, K. A., & Jain, A. (2024). Structured information extraction from scientific text with large language models. Nature Communications. https://doi.org/10.1038/s41467-024-45563-x
Guo, J., Ibanez-Lopez, A. S., Gao, H., Quach, V., Coley, C. W., Jensen, K. F., & Barzilay, R. (2021). Automated chemical reaction extraction from scientific literature. Journal of Chemical Information and Modeling. https://doi.org/10.1021/ACS.JCIM.1C00284
Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. https://doi.org/10.1007/978-3-642-04346-8_62
Romary, L., & Lopez, P. (2015). Grobid—Information extraction from scientific publications. ERCIM News, 100. https://inria.hal.science/hal-01673305
Rettenberger, L., Münker, M. F., Schutera, M., Niemeyer, C. M., Rabe, K. S., & Reischl, M. (2024). Using large language models for extracting structured information from scientific texts. Current Directions in Biomedical Engineering. https://doi.org/10.1515/cdbme-2024-2129
Sebastian, Y. (2017). Literature-based discovery by learning heterogeneous bibliographic information networks. https://doi.org/10.1145/3130332.3130347
Sebastian, Y., Siew, E.-G., & Orimaye, S. O. (2017). Learning the heterogeneous bibliographic information network for literature-based discovery. Knowledge Based Systems. https://doi.org/10.1016/J.KNOSYS.2016.10.015
Yang, H., Aguirre, C., Torre, M. F. D. L., Christensen, D., Bobadilla, L., Davich, E., Roth, J., Luo, L., Theis, Y., Lam, A., Han, T. Y.-J., Buttler, D., & Hsu, W. H. (2019). Pipelines for procedural information extraction from scientific literature: Towards recipes using machine learning and data science. https://doi.org/10.1109/ICDARW.2019.10037