Fatigue data is essential in many industrial sectors for evaluating the service life and reliability of materials and components under repeated stress and for designing them to withstand such stress. Generating fatigue data is costly, which is why literature data is an economical alternative. The potential of data in materials science literature is enormous, but it is very heterogeneous and can only be used in conjunction with relevant context.
While latest foundational language models achieve state-of-art performances on many general information extraction tasks, these models lack the detailed domain-specific knowledge required for accurate extraction of materials science data.
In a fatigue test, a material sample is subjected to cyclic loading until, for example, cracks appear or the sample fails. The results of fatigue tests are presented in so-called S-N diagrams, which show the relationship between the maximum stress (S) and the number of load cycles (N) that a material can withstand before it fails.
For the generative extraction of fatigue data from scientific literature, Fraunhofer IWM, in cooperation with the University of California, has developed two (agentic) workflows for Vision and Reasoning Language Models (VLM/RLM). A schema-based approach is applied, in which a data schema unambiguously defines target entities and applicable constraints. This schema is provided to the reasoning language model to provide contextual domain knowledge. Two workflows extend on this by i) applying a human-inspired sequential analysis of discriminative features within figures before delving into textual details, and ii) performing knowledge augmentation in a dynamic manner where different data validators, reasoners and knowledge bases facilitate detailed verification of extractions.
This methodology reveals so-called language model hallucinations and facilitates not only recovering high-quality fatigue data sets reducing the demand for experimental data generation but also seamlessly integrating literature and proprietary data sources.