Structuring PubMed Content into Knowledge Graphs for Enhanced Biomedical Intelligence

Đã lưu trong:
Chi tiết về thư mục
Xuất bản năm:ProQuest Dissertations and Theses (2025)
Tác giả chính: Huang, Yuanhao
Được phát hành:
ProQuest Dissertations & Theses
Những chủ đề:
Truy cập trực tuyến:Citation/Abstract
Full Text - PDF
Các nhãn: Thêm thẻ
Không có thẻ, Là người đầu tiên thẻ bản ghi này!
Miêu tả
Bài tóm tắt:As artificial intelligence (AI) continues to advance, its applications in biomedical research have expanded significantly. AI is now being applied to increasingly complex and knowledge-intensive tasks such as drug discovery, disease prediction, and clinical decision-making. Despite these achievements, concerns about the trustworthiness and transparency of AI systems in such applications have become more prominent. The knowledge embedded within large models is often difficult for humans to interpret or verify, and these models are prone to generating hallucinations. Moreover, large language models (LLMs) struggle with tasks that require comprehensive and context-aware knowledge retrieval, such as literature reviews and hypothesis generation. For example, writing a review on the gene SPRY2 necessitates understanding its diverse biological roles across tissues and diseases and systematically aggregating relevant publications. LLMs alone often fail to retrieve such comprehensive information effectivelyThese complex tasks are better facilitated by structured knowledge bases and domain-specific tools. Therefore, there is a critical need to develop hybrid frameworks that integrate the reasoning capabilities of AI with the structured reliability of relational knowledge representations. This dissertation addresses how to bridge vast biomedical knowledge from peer-reviewed literature with machine learning models and AI agents through the use of knowledge graphs. Knowledge graphs represent heterogeneous biomedical data as interconnected nodes and edges, explicitly modeling meaningful relationships. They have emerged as powerful tools in biomedical informatics, particularly for enabling machine learning and inference on structured biomedical knowledge. Chapter 2 presents an automated pipeline for curating semantic knowledge from over 33 million PubMed articles into a structured knowledge graph. This process represents the foundational step in connecting semantic biomedical knowledge with machine learning models and AI agents. To achieve this, we developed the LiteralGraph framework, which includes (1) a scalable and automated information extraction pipeline, and (2) a flexible, extensible schema designed for machine learning applications. The resulting knowledge graph, named the Genomic Literature Knowledge Base (GLKB), harmonizes semantic knowledge from both PubMed literature and curated biomedical databases.Chapter 3 introduces methods for generating knowledge graph embeddings from GLKB, enabling integration with various machine learning applications. These embeddings capture rich semantic features from the graph structure and are applicable to a range of downstream tasks, including literature retrieval, topic recommendation, and biological function prediction (e.g., gene-gene interactions and drug-disease associations).Chapter 4 explores how GLKB facilitates integration with LLMs and AI agents. The use of standardized biomedical terms and their explicit relationships forms a graph-based index over PubMed, enabling comprehensive and structured literature retrieval. This enhances the performance and trustworthiness of retrieval-augmented generation (RAG) methods. Furthermore, we introduce the GLKB Agent, a modular system capable of handling both graph-based and literature-based queries. It supports complex biomedical tasks such as automated literature reviews and hypothesis-driven knowledge discovery.In summary, this dissertation makes four key contributions to the intersection of biomedical informatics and AI. First, it introduces LiteralGraph, a scalable framework that extracts, disambiguates, and semantically structures knowledge from over 33 million PubMed articles into a unified biomedical knowledge graph—GLKB. Second, it demonstrates how GLKB supports a wide range of machine learning applications, including information retrieval, link prediction, and fact-checking, through the development of graph-based embeddings and standardized tabular datasets. Third, it presents the design and implementation of GLKB Agent, a retrieval-augmented generation (RAG) system that integrates graph reasoning and semantic search to improve the factuality and transparency of biomedical question answering. Finally, the dissertation introduces a deep research agent architecture, modeled on dual-system cognitive theory, capable of orchestrating multi-step reasoning over GLKB to perform literature reviews and generate scientific hypotheses. Together, these contributions advance the field by providing a structured, scalable, and interpretable infrastructure for biomedical AI.
số ISBN:9798291566336
Nguồn:ProQuest Dissertations & Theses Global