"Use of Large Language Models for Extracting and Analyzing Data from Heterogeneous Catalysis Literature


Abstract: Extracting experimentally measured heterogeneous catalysis data from the text of research articles into structured databases would facilitate the rapid screening of catalysts with target properties and the development of models capable of directly predicting experimental outcomes. This text mining task has been transformed by the release of large language models (LLMs) capable of following general natural language instructions, which have made it possible to mine text without the need to train task-specific models or define comprehensive expression-matching rules. Here, we develop and share a text mining tool called CatMiner that extracts arbitrary user-specified structure–environment–property data using LLMs. It is agnostic to LLM choice, with both OpenAI GPT models and open-source Llama and DeepSeek models supported without modification. We benchmark the ability of CatMiner to rapidly extract useful data from abundant published literature by focusing on a case study of the oxidative coupling of methane. We explore how model choice and prompting strategies affect extraction quality. Key capabilities, including the use of domain knowledge, iterative prompting, and document-wide context handling are shown to be critical for effective performance. We identify situations where CatMiner struggles and suggest reporting standards for the community to make catalysis data easier to extract going forward. CatMiner enables the creation of machine-readable catalysis datasets, streamlining access to experimental insights buried in the literature.


Benjamin W. Walls, Suljo Linic*
Cite this: ACS Catal. 2025, 15, XXX, 14751–14763
https://doi.org/10.1021/acscatal.5c03844
Published August 10, 2025
© 2025 The Authors. Published by American Chemical Society


https://pubs.acs.org/doi/10.1021/acscatal.5c03844
#metaglossia_mundus