Towards Knowledge Graph Construction from Unstructured Text with LLMs : Triple Identification and Alignment to Wikidata

Guardado en:
Detalles Bibliográficos
Publicado en:PQDT - Global (2025)
Autor principal: Salman, Muhammad
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 3235007906
003 UK-CbPIL
020 |a 9798290640631 
035 |a 3235007906 
045 2 |b d20250101  |b d20251231 
084 |a 189128  |2 nlm 
100 1 |a Salman, Muhammad 
245 1 |a Towards Knowledge Graph Construction from Unstructured Text with LLMs : Triple Identification and Alignment to Wikidata 
260 |b ProQuest Dissertations & Theses  |c 2025 
513 |a Dissertation/Thesis 
520 3 |a The exponential growth of digital text has underscored the pivotal role of Knowledge Graphs (KGs) in structuring, managing, and deriving value from unstructured data. However, a vast portion of textual content remains unstructured, posing critical challenges for the automatic construction and enrichment of KGs, particularly in the accurate extraction and linking of knowledge triples.This thesis addresses these challenges by presenting a comprehensive framework for extracting high-quality subject-predicate-object triples from unstructured text and linking them to a structured KG. To tackle the complexity of natural language input, a novel preprocessing technique,Controlled Syntactic Simplification (CSynSim),is introduced. CSynSimanalyses the syntactic structure of input sentences and applies a controlled "split and rewrite" strategy to simplify complex constructions while preserving semantic fidelity.To support systematic evaluation for triple extraction task, a crowdsourced benchmark dataset, TinyButMighty, is developed. It comprises richly annotated compound and complex sentences, validated by expert ontologists. This dataset serves as a standard for assessing triple extraction systems, supported by a tailored triple similarity based precision, recall, and f-measure metrics to evaluate the alignment between system outputs and human-annotated ground truth.Building on this foundation, the thesis proposes a baseline approach Doc-KG, a rule-based pipeline that employs traditional NLP tools, such as semantic dependency parsing and SPARQL querying, to extract and align triples with Wikidata entities. Doc-KGincludes a predicate mapping mechanism and is evaluated both quantitatively and qualitatively against existing approaches.The advent of LLMs such as GPTs marked a significant shift in our methodology, their integration in enhancing the capacity to address limitations of previous Doc-KG. The thesis introduces SALMON(Syntactically Analysed and LLM-Optimised Natural language), a hybrid framework that leverages LLMs for both triple extraction and entity linking. The integration of CSynSim as a preprocessing step significantly improves extraction accuracy, especially in structurally complex sentences.To address hallucination and ambiguity in LLM outputs, the thesis further proposes an LLM-SPARQL hybrid frameworkfor Named Entity Linking (NEL) and Disambiguation (NED). This approach harnesses SPARQL queries to retrieve candidate entities and employs LLMs to refine disambiguation decisions, thereby enhancing precision and reliability in linking textual mentions to Wikidata identifiers.In summary, this thesis presents an end-to-end pipeline that integrates syntactic simplification, rule-based processing, and large language models to advance the state of knowledge extraction from unstructured text. The resulting SALMONframework (https://w3id.org/salmon/) offers a modular and extensible approach to KG construction, with strong empirical performance and broader implications for natural language understanding, semantic technologies, and information integration. 
653 |a Language 
653 |a Graphs 
653 |a Ontology 
653 |a Web Ontology Language-OWL 
653 |a Semantic web 
653 |a Natural language processing 
653 |a Resource Description Framework-RDF 
653 |a Crowdsourcing 
653 |a Large language models 
653 |a Knowledge representation 
653 |a Semantics 
773 0 |t PQDT - Global  |g (2025) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3235007906/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3235007906/fulltextPDF/embedded/6A8EOT78XXH2IG52?source=fedsrch